ROBERT F. MULLIGAN
A Brief Introduction to Econometrics
1. The basic univariate model: Yt = a + bXt + et. Y is called the left-hand-side variable, LHS variable, the dependent variable, or the explained variable. X is the right-hand-side, RHS variable, dependent, or explanatory variable. e is the residual, error, or disturbance term. The expression (a + bXt) is called the fitted value of Y and is also referred to as Y-hat. The regression calculates values of a and b, the "regression coefficients," that best fit the data for X and Y (hence regression is sometimes called "curve fitting,") by minimizing the sum of squared residuals, ∑tet2 (or e'e in vector notation.) This most basic and commonly used regression technique is called ordinary least squares, OLS, OLSQ, or LS.
To run a
regression in MS Excel 97, enter your data in a worksheet. Normally each
variable is placed in its own column, with the name of the variable and any
other information you may wish to place there, like units and data
transformations you performed (which are not used for computing the regression
coefficients), at the top of the column. Click on "Tools" in
the toolbar at the top of the screen to get the "Tools" menu, then go
down to the bottom and click on "Data Analysis." Scroll down
the "Data Analysis" window until you find "Regression,"
highlight it, and press "enter." You will be provided a dialog
box which enables you to set the range for each variable in the regression.
Y is the left-hand-side variable. Click on the Icon on the right side of
the window for the Y variable range (it looks like a little table with tiny
text windows - real cute.) Then highlight all the numerical values of the
LHS variable. The screen will show a dotted line around the numbers you
highlight. Press enter, and the dialog box will reappear with the range
of the LHS variable you selected. Repeat for the RHS variable. You
can highlight more than one column on the RHS to estimate a multivariate
regression. Make sure each variable has the same number of
observations. When you are satisfied with the Y and X ranges in the
dialog box, click on the "OK" button, and your output will appear on
a new sheet in your current worksheet.
2. The basic univariate model in vector notation: y = Xβ + e where y is an (n x 1) column vector of the different observations of the left-hand-side variable, X is an (n x 2) matrix for which the first column is a column of ones (representing the constant or intercept term) and the second column is a column of the n observations of the right-hand-side variable, β is a (2 x 1) vector of estimated coefficients (the intercept or constant, and the slope or X coefficient,) and e is an (n x 1) vector of residuals or disturbance terms. This model is the same as in part 1, and looks like this:
Note that I have renamed the intercept b0 (instead of a) and the slope b1 (instead of just b).
MS Excel 97, you can force the regression line through the origin, setting a =
b0 = 0, by checking the "Constant is Zero" option
on the regression dialog box.
3. The basic multivariate model (in vector notation): y = Xβ + e is the same as the simple univariate case except that now X is an (n x k+1) matrix (called the observation or data or independent variable matrix) consisting of a column of n ones (representing the intercept) and k columns of n observations of the k right-hand-side variables, and β is a (k+1 x 1) column vector of estimated coefficients. This model has more than just one explanatory or independent variable and looks like this:
This model can also be written as: Yt = b0 + b1Xt + b2Zt + b3Wt + b4Rt + et, supposing there are only four right-hand-side variables.
MS Excel 97, you estimate a multivariate regression by selecting an X range
with more than one column, which the software reads as more than one
variable. It is important each column has the same height, lined up row
by row, and that each column be adjacent.
4. Non-linear regression: What if Y is a function of X, but not a linear function of X? Then form second-order terms by squaring each X, and estimate the regression as: Yt = a + bXt + cXt2 + et. You can also cube X or raise it to any power you want. If it is a multivariate regression, (with, say, X and Z as right-hand-side variables,) you can also add cross-product terms (in this case XZ as well as X2 and Z2.)
could do this by creating new columns from your original X variables. Or
you could transform your data into logarithms.
5. Some regression output:
MS Excel 97 provides three output tables whenever you run a regression.
Table 1: Regression Statistics
The first table gives statistics which measure the fit of the regression as a whole. The multiple R is the coefficient of correlation between Y and Y-hat. (For a univariate regression, with only one RHS variable, R is the coefficient of correlation between X and Y). R is the square root of the R-squared, the coefficient of determination. The range of R is between -1 and +1, so R-squared ranges from 0 to 1. Theoretically, the adjusted R-squared, the coefficient of determination adjusted for degrees of freedom, or R-bar-squared, has the same range, but it can be negative when the number of coefficients being estimated—that is, the number of RHS variables—is high compared to the number of observations or sample size. R-squared = R-bar-squared = 1 indicate a perfect fit. The closer these are to one, the better the regression fits the data. The standard error of the regression, which should be as low as possible, and the number of observation are also provided.
Table 2: The F-Statistic
The second table, labeled ANOVA for "Analysis of Variance," computes an F statistic which tests whether the estimated coefficients (the b's) are jointly significantly different from zero. The last column in this table gives the significance level of the F test, which should be less than or equal to 5% (= 0.05) or some other arbitrarily chosen low percentage. The table gives degrees of freedom ("df") and sums of squares ("SS") for the regression (SSR), the residuals or error terms (SSE), and the total, for the LHS variable (SST). To get the SST, square each observation of the LHS variable and add the squares. To get the SSE, square each error term and add the squares. To get the SSR, subtract SSE from SST, or square each value of Y-hat (the fitted values of Y) and add the squares.
Next, the table gives mean squares, which are the SSR and SSE divided by their degrees of freedom. For the SSR, the df is the number of RHS variables with coefficients being estimated, not including the constant. The df for the SSE is the number of observations, minus the number of coefficients being estimated, including the constant. The SSR and SSE df add up to one less than the number of observations (1 - n), which is the SST df. One degree of freedom is eaten up by the constant, which is why the SST df is one less than the number of observations of each variable, as long as a constant is included in the regression.
The ratio of the two mean squares MSR/MSE is F-distributed under ideal conditions. This is listed under F. MST is not used to calculate the F statistic, so it is not listed in the ANOVA table. The F ratio tests the joint hypothesis H0: (b1 = b2 = b3 = . . . = bk = 0), the null hypothesis that all the b coefficients (except b0 the intercept or constant) are jointly equal to zero. The significance level of the F ratio is greater than 5% (or some other low value you choose) to accept (or "fail to reject") the null hypothesis, which tells you the regression has no explanatory power. The significance level of the F ratio is less than 5% (or some other value you choose) to reject the null hypothesis, which tells you the regression has some explanatory power.
Table 3: Estimated Coefficients and their t-Statistics
The third table provides (finally!) the estimated coefficients for the intercept and each RHS X variable. Each coefficient has a standard error. The t-statistic or t-ratio is the ratio of each coefficient divided by its standard error. Each t-ratio is t-distributed under ideal conditions. Each t test tests the null hypothesis H0: (bj = 0), that each coefficient is equal to zero. The t-statistic independently tests the significance of each estimated coefficient while the F jointly tests whether they all work together. Usually if any t-stats are greater than 2.00, the F statistic will also be significant. The significance level of the t ratio is greater than 5% (or some other low value you choose) to accept (or "fail to reject") the null hypothesis, which tells you the variable may be removed from the regression, so take it out and estimate the regression without it. The significance level of the t ratio is less than 5% (or some other value you choose) to reject the null hypothesis, which tells you the variable may be kept.
In addition to estimated coefficients for each variable and the intercept or constant term, most regression softwares provide some or all of the following:
1. Standard error (of a coefficient) tells you how likely the true or population value of the estimated coefficient will lie within a certain distance from the estimate (or sample value,) based on the dispersion of the data. The smaller the better. The true value is about 67% likely to lie within plus or minus one standard error from the estimated value, and about 95% likely to lie within plus or minus two standard errors. MS Excel 97 provides this in the coefficient estimate table.
2. t-statistic (t-ratio): The estimated coefficient divided by its standard error. The higher the better. If t exceeds 1, it is at least about 67% likely that the true value of the coefficient is not zero, and if t exceeds 2, it is at least about 95% likely. This tests the hypothesis that the true value of an estimated coefficient is equal to zero, and that that variable should be deleted. MS Excel 97 provides this next to the standard error in the coefficient estimate table.
3. Probability value of the t statistic (2-tailed significance level): Gives the probability value of the computed t statistic from an internal table of the t distribution. The lower the better. Less than .05 (or .10 or .01, this is called an alpha or level of significance, which you choose) for statistically significant coefficients.
4. R2 (Coefficient of determination): Ranges from 0 to 1. The higher the better. Measures the % of variation in Y that is explained by variation in the right-hand-side variables. In a univariate regression it is the square of the coefficient of correlation (rXY) between X and Y. It is calculated by this formula:
Y-bar is the sample mean of Y, SSE means "sum of squared errors" (or sum of squared residuals, e'e in vector notation), and SST means "sum of squares total." SSR is the "sum of squares regression," and they all fit into this identity: SST = SSR + SSE. SSE can't be referred to as SSR (though, unfortunately, sometimes it is.) To make things more confusing, SSR is also called the "explained sum of squares" because it's what's explained by the regression, and SSE is also known as the "unexplained sum of squares," because it's what's left over. If the regression were a perfect fit, all the e's would be zero, and so SSE would equal zero and R2 would equal one. (This would mean that SSR = SST.) MS Excel 97 provides the R2 in the regression statistics table, just under its square root, "Multiple R."
5. Adjusted R2 (R2 adjusted for degrees of freedom, aka R-bar-squared): Interpreted the same way as R2, but can be negative. Should never be greater than R2. It is calculated by this formula:
The formula is basically the same as for R2, but includes adjustments for the size of the sample (n) and the number of coefficients being estimated (k+1). MS Excel 97 provides this in the regression statistics table under the R2.
6. Durbin-Watson statistic (DW or d): Tests for serial correlation of residuals: Serial correlation means that the error terms are systematically related through time, and not statistically independent of each other. Theoretically the errors should be white noise, or perfectly random. When they aren't, the whole estimate's validity is questionable. DW is close to 2 in the absence of serial correlation. Critical values for DW can be found in the (old) Durbin-Watson or (newer) Savin-White tables in the back of most econometrics texts. In cross-sectional data, serial correlation is called autocorrelation, and can be solved by just scrambling the order of the data (basically you can do this because the order of the observations wasn't important to start with.) With time-series data—commonly used in macroeconomics—you can't change the order, so you have two choices: difference your regression equation (see below), or apply a correction for serial correlation of the error terms, typically the Cochrane-Orcutt iterative correction (provided by MicroTSP) or Beach-MacKinnon iterative maximum likelihood method (provided by TSP). MS Excel 97 does not provide a test for serial correlation or an estimation technique robust to serial correlation.
7. Durbin's h and/or Durbin's h alternative: TSP provides this, but MicroTSP doesn't (yet). It's an improved test statistic for serial correlation and should be used in preference to DW whenever there are a large number of explanatory variables, or when some of the explanatory variables are lagged values of Y. TSP 4.2 prints out the probability, which you want to be less than the significance level you choose, a = .10, .05, or .01. MS Excel 97 does not provide a test for serial correlation. Too bad.
8. Log likelihood (logarithm of the likelihood function): If n is the number of observations of each variable in the regression, this is calculated by the formula:
The larger L the better (consequently the larger logL the better) but this is only meaningful when comparing two regressions: the one with the larger logL is the better of the two. The difference of the logL's is chi-square distributed, and if its probability value is below some critical significance level (a), usually .05, the model with the lower likelihood is rejected. This is called a likelihood ratio test because the test statistic (the difference between logL's) is the log of a ratio. The formula used for the log likelihood is based on the normal distribution; theoretically the distribution of the residuals should become more nearly normal as the number of observations n approaches infinity. Maximum likelihood estimation is an alternative technique to least squares that chooses the estimated coefficients to make the residuals most nearly normally distributed. Instead of minimizing the sum of squared residuals, it maximizes the value of the normal likelihood function. The higher logL is for a least squares estimate, the closer it is to the maximum likelihood or ML estimate. It is not always useful to assume that residuals are normally distributed, see sect. 7.5 below. MS Excel 97 does not provide this.
9. Standard error of the regression: Measures the magnitude of the residuals. About 67% will lie between one standard error above zero and one below, and about 95% will lie between plus or minus two standard errors. The smaller the better. MS Excel 97 provides this in the first regression output table, under "Standard Error."
10. Sum of squared residuals: (e'e) This is what is minimized by a least squares estimate. The smaller the better. MS Excel 97 provides this in the SS column of the ANOVA table, under "Residual."
11. F statistic (zero slopes): Tests whether the estimated values of the coefficients are all significantly different from zero. Doesn't include the constant or intercept. For a univariate regression is equal to the square of the t statistic for the right-hand-side variable. The higher the better. You want the probability value of the F statistic to be less than a = .10, .05, or .01. MS Excel 97 provides the F statistic and its probability level at the far right of the ANOVA table.
12. Akaike Information Criterion (AIC): AIC = ln(SSE/n) + 2k/n, where n is the number of observations or sample size, k is the number of coefficients being estimated, including the constant if any, and SSE is the sum of squared residuals from the regression, aka e'e. The lower this number is, the more efficient the regression model is in making use of the available information contained in the data to estimate the regression coefficients. Note that the formulas for the three information criteria are similar to the R-square. R-square always increases when additional RHS variables are added. The information criteria attempt to measure whether the information contained in additional RHS variables, which always increases the R-square, is sufficient to warrant adding more variables to the model. If adding a variable to the regression results in the AIC falling, then add the variable. If removing a variable results in the AIC falling, then remove the variable. The AIC is automatically computed by Eviews, but not by MS Excel.
13. Schwarz Bayesian Information Criterion (SBIC or SC): SBIC = ln(SSE/n) + (k ln(n))/n. The SBIC is an alternative approach to measuring the information efficiency of a regression model. SBIC is interpreted the same as the other information criteria, but does not always give the same results. It is automatically computed by Eviews as part of the regression output, but not by MS Excel.
14. Amemiya's Prediction Criterion (PC): Amemiya's PC = SSE(1 + k/n)/(n – k). Amemiya's PC is yet another approach to
measuring the information efficiency of a regression model. This is interpreted the same way as the AIC
or the SBIC, lower is better, but does not always give the same results. It is not computed by either Eviews or MS
5. The kinds of data.
a. Time-series data: A different value of each variable is observed for each period of time, e.g., annually, quarterly, monthly, weekly, daily, etc. Gross domestic product (GDP), interest rates, stock market quotations are examples. Used mostly by macroeconomists, and is often subject to serial correlation. ECON 300 Forecasting Projects will deal mostly with time-series data.
b. Cross-sectional data: Each observation of each variable is for a different unit being observed, e.g., different firms, plants, households, individuals. Census data on the number of lathe machines operated at different factories, the number of workers employed by different firms, the size of different households are examples. Used mostly by microeconomists, and the order can be scrambled if it improves regression estimates.
data: Each observation is for a specific unit and a specific point in time. A
set of time series consisting of the number of workers employed by different
firms over several years would be panel data. Used mostly by microeconomists.
6. Differencing a regression: If a time-series regression has serial correlated residuals, or if any one of the variables is non-stationary, that is, if a variable has a moving trend like GDP, CPI, or many other steadily-growing macroeconomic variables, or has a cyclical pattern of variation, the regression can be differenced. This involves lagging the equation one time period, then taking the difference:
The b0 drops out, but all the other b's (the slopes) retain their identities, so this is an alternative way of estimating the slopes. The equation can be differenced as many times as necessary to achieve acceptable Durbin-Watson or Durbin's h statistics. An intercept term (I call it "a") is left in the differenced equation, even though it doesn't belong there. This is because if you estimate the regression without an intercept, that imposes the very unrealistic restriction that the regression line passes through the origin. It is better to estimate the unrestricted equation, though a is not the same as b0. Also the u's are different from the (serial correlated) e's.
regressions with non-stationary data can benefit from the desirable statistical
property of superconsistency. This
means that the estimated coefficients converge to their true, unbiased values
faster than for OLS estimates with stationary data. Superconsistency is a large sample property,
and the dividing line between large and small samples is ambiguous. A sample less than 30 observations is always
"small," but a sample less than 300 observations may not be
7. Violations of regression assumptions:
1. Heteroskedasticity: Regression theory assumes homoskedasticity of regression residuals, that is, that the residuals have a constant variance, (otherwise the standard error of the regression would vary systematically.) If you have this, invoke White's heteroskedasticity consistent covariance matrix to compute the results. In MicroTSP use the command [LS(H) Y C X Z W] in TSP use [OLSQ (ROBUST) Y C,X,Z,W ;]. How do you know if you have it? If you don't get the same answer without White's matrix as with it, heteroskedasticity is the reason why. There are a battery of statistical hypothesis tests available to test for heteroskedasticity, but it's much easier to estimate the regression both ways. MS Excel 97 does not provide this.
2. Serial correlation (or Autocorrelation): Use the Cochrane-Orcutt correction in MicroTSP: [LS Y C X Z W AR(1)]. If necessary, add AR(2) or more terms as well. When the Durbin-Watson statistic goes close to two, this problem is solved. You can also difference the equation, see section 6. In TSP use the Beach-MacKinnon iterative maximum likelihood technique: [AR1 Y C,X,Z,W ;], but usually this only works for annual data because you can't add more terms if you need to. With cross-sectional data, just change the order of the data, (you can do this as long as the order doesn't matter.) MS Excel 97 does not provide this.
3. Multicollinearity (1): Some of your variables are too similar for the computer algorithm for minimizing the sum of squared residuals (e'e) to handle. You'll get an error message regarding a "near singular matrix" which the computer can't invert (see section 8 below). Excel returns the LINEST function. There are two possibilities:
a. It is an exactly singular matrix (the computer can't tell the difference) because you have accidentally entered the same variable twice in the same regression. Look for the left-hand-side variable on the right, or one of the right-hand-side variables repeated. Excel's LINEST function error message indicates the regression cannot be computed because one variable is a linear function of another. One of the few advantages of MS Excel 97 is that Excel is so user-hostile it is very difficult to make this mistake.
b. One (or more) of your variables is very much smaller in relative magnitude than the rest. This is called a scale problem. The computer treats it as a string of zeroes, and it's a linear function of the string of ones it uses for the constant. You can tell if you have this by plotting all your data together: in MicroTSP the command is: [PLOT Y X Z W]. In MS Excel 97, plot your data by clicking on the "chart wizard" icon on the standard toolbar. It looks like a bar chart. If one variable is just about always equal to zero on the plot, then you should select some time period and divide each variable by its own value for that time period. This is called indexing, and should remove any scale problems.
4. Multicollinearity (2): Although the software can successfully perform the estimate, it is difficult to interpret. (This is a less serious problem.) Typically you may find that the F statistic is very high (indicating a good estimate and a good model,) but the t statistics are all very low, (indicating that many or all of the variables should be removed from the model.) It is O.K. to remove variables from the model as long as the F statistic remains high (or its probability level remains low.) If you cannot remove any variables without hurting your F statistic, just note multicollinearity as the reason why the t statistics are so low.
5. Non-normality: Contrary to common opinion, it is not a requirement of the standard regression model that regression residuals be normally distributed. However, if they are normally distributed, that guarantees that the residuals are also homoskedastic and not serially correlated. To test for normality, calculate the Jarque-Bera efficient normality test statistic. In MicroTSP, right after you run the regression, use: [HIST RESID] and the Jarque-Bera statistic will be at the bottom of the histogram, next to its probability level. J-B tests the null hypothesis of normality. You want the probability to be greater (not less) than .05, or some other significance level, otherwise the residuals aren't normally distributed. TSP does this automatically whenever you estimate a least squares regression, but in MicroTSP you can test any variable for normality with the HIST command, which is very convenient. SAS does a different test, the Wilson statistic, with the PROC UNIVARIATE command. The closer the Wilson statistic is to one, the more nearly normal the variable being tested. The traditional test for normality, the Kolmolgorov-Smirnoff test, was difficult to compute, and was invented by two Russian mathematicians. The Jarque-Bera test was invented by economists. If your residuals are normally distributed, you can claim that your estimate is close to the maximum likelihood estimate.
97 does not provide a test for normality, but you can compute skewness and
kurtosis of your residuals with the "Descriptive Statistics" option
in the "Data Analysis" menu. The skewness of the normal
distribution is 0 and the kurtosis 3. Check the "Residuals" box in
the regression dialog box to save your residuals when you run a regression.
8. Regression computations: (Inverting the matrix). In the general multivariate regression model, y = Xβ + e, a least squares regression is performed by calculating an estimate of β, called b. Both b and β are (k+1 x 1) column vectors. The least squares formula is b = (X'X)-1X'y. The matrix formulae X'X and X'y are not too difficult to calculate, even with large numbers of explanatory variables. However, inverting the matrix X'X to get (X'X)-1 is not easy, and before the days of computers (up until the early sixties), regressions could not generally be calculated with more than five or six right-hand-side variables. Even supercomputers have been known to balk at large inversion problems.
Some alternatives to ordinary least squares: (none are provided by MS Excel
1. Single equation techniques:
a. minimum absolute deviation (MAD)
b. minimum sum of absolute errors (MSAE)
c. instrumental variables (IV) [You can do this in MS Excel 97, but you have to do all the work yourself.]
d. two-stage least squares (2SLS) [This can be done in Excel the same as with instrumental variables.]
e. weighted least squares
f. limited information maximum likelihood (LIML)
g. full information maximum likelihood (FIML or just ML)
h. recursive least squares
i. Kalman filtering
j. non-linear least squares (NLS)
k. least median of squares (LMS) [This is an outlier-resistant technique designed to avoid having the estimate unduly influenced by unusually large or small observations.]
2. Simultaneous equation techniques
a. seemingly unrelated regressions (Zellner's method) (SUR)
b. limited information maximum likelihood (LIML)
c. full information maximum likelihood (FIML or just ML)
d. three stage least squares (3SLS)
e. generalized method of moments (GMM)