9 Least Squares with Matrix Algebra

The mathematics of least squares is best expressed in matrix form. Proofs of results are much more concise and more general, and we can draw on the insights of linear algebra to understand least squares at a deeper level. Furthermore, the same mathematics applies to a vast number of advanced linear models including multiple equation models, and is useful even for non-linear ones, so it is well worth the time and effort to master the mathematics of least squares estimation expressed using matrix algebra. The objective of this chapter is to help you become familiarized with the mathematics of least squares estimation of linear models, and to provide proofs of results previously omitted or only proven partially.

We use the following packages in this chapter.

library(readxl)
library(car)
library(sandwich)

9.1 The Setup

Suppose that you have a sample \(\{Y_i,X_{1,i},X_{2,i},\dots,X_{K-1,i}\}_{i=1}^N\) such that \[ Y_i = \beta_0 + \beta_1 X_{1,i}+ \beta_2 X_{2,i} + ... + \beta_{K-1} X_{K-1,i} + \epsilon_i \, , \; i=1,2,...,N. \tag{9.1}\] We use \(X_{k,i}\) to denote the \(i\mathrm{th}\) observation of variables \(X_k\). We can write the regression in matrix form as \[ \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_N \end{bmatrix} = \begin{bmatrix} 1 & X_{1,1} & X_{2,1} & \dots & X_{K-1,1} \\ 1 & X_{1,2} & X_{2,2} & \dots & X_{K-1,2} \\ \vdots & \vdots & \vdots &\ddots & \vdots \\ 1 & X_{1,N} & X_{2,N} & \dots & X_{K-1,N} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{K-1} \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{bmatrix} \tag{9.2}\] or simply \[ y = X\beta + \varepsilon \tag{9.3}\] where \(y\) is the \((N \times 1)\) vector \(\begin{bmatrix}Y_1 & Y_2 & \dots & Y_N\end{bmatrix}^\mathrm{T}\), \(X\) is the \((N \times K)\) matrix of regressors, \(\beta\) is the \((K \times 1)\) coefficient vector, and \(\varepsilon\) is the \((N \times 1)\) vector of noise terms. We will use \(\{\epsilon_i\}_{i=1}^N\) to denote a sample of the noise variable \(\epsilon\). We organize the \(\{\epsilon_i\}_{i=1}^N\) into the vector \(\varepsilon\), as in (9.2) and (9.3). We assume throughout that \(N>K\).

We can partition the regressor matrix by observation: \[ X = \left[\begin{array}{ccccc} 1 & X_{1,1} & X_{2,1} & \dots & X_{K-1,1} \\ \hline 1 & X_{1,2} & X_{2,2} & \dots & X_{K-1,2} \\ \hline \vdots& \vdots & \vdots & \ddots & \vdots\\ \hline 1 & X_{1,N} & X_{2,N} & \dots & X_{K-1,N} \end{array}\right] = \begin{bmatrix} x_1^\mathrm{T} \\ x_2^\mathrm{T} \\ \vdots \\ x_N^\mathrm{T} \\ \end{bmatrix} \tag{9.4}\] and write the model as \[ \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_N \end{bmatrix} = \begin{bmatrix} x_1^\mathrm{T} \\ x_2^\mathrm{T} \\ \vdots \\ x_N^\mathrm{T} \\ \end{bmatrix} \beta + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{bmatrix} \] or \[ Y_i = x_i^\mathrm{T}\beta + \epsilon_i \, , \; i=1,2,...,N. \] It is sometimes helpful to partition the regressor matrix by variable: \[ X = \left[\begin{array}{c|c|c|c|c} 1 & X_{1,1} & X_{2,1} & \dots & X_{K-1,1} \\ 1 & X_{1,2} & X_{2,2} & \dots & X_{K-1,2} \\ \vdots& \vdots & \vdots & \ddots & \vdots\\ 1 & X_{1,N} & X_{2,N} & \dots & X_{K-1,N} \end{array}\right] = \begin{bmatrix} i_N & \mathrm{x}_{1} & \mathrm{x}_{2} & \dots \mathrm{x}_{K-1} \end{bmatrix} \tag{9.5}\] where \(i_N\) is the \((N \times 1)\) vector of ones, and \(\mathrm{x}_k\) is the vector of observations of variable \(X_k\). Feasibility of OLS estimation will require \(X\) to have full column rank, i.e., that there is no non-zero \((K \times 1)\) vector \(c\) such that \[ Xc = c_0 + c_1\mathrm{x}_1 + c_2\mathrm{x}_2 + \dots + c_{K-1}\mathrm{x}_{K-1} = 0\,. \] In other words, we must assume that there is variation in each of the variables (apart from the constant vector), and that no one variable can be written as a linear combination of the other variables. The full column rank assumption implies that \(X^\mathrm{T}X\) is non-singular (i.e., has an inverse).

We continue to assume that the noise terms \(\epsilon_i\) have zero mean conditional on all observations of all regressors: \[ E[\epsilon_i | \mathrm{x}_{1}, \mathrm{x}_{2}, \dots, \mathrm{x}_{K-1}]=0\,. \] In matrix form, we can write this even more simply as \[ E[\varepsilon|X] = 0_{N \times 1}\,. \] In the basic model, the noise terms were assumed to be conditionally homoskedasticity and uncorrelated: \[ \begin{aligned} var[\epsilon_i | \mathrm{x}_{1}, \mathrm{x}_{2}, \dots, \mathrm{x}_{K-1}] = \sigma^2 \quad &\text{for all} \quad i = 1,2,...,N,\\ cov[\epsilon_i \epsilon_j | \mathrm{x}_{1}, \mathrm{x}_{2}, \dots, \mathrm{x}_{K-1}] = 0 \quad &\text{for all} \quad i \neq j\,,\, i,j=1,2,...,N. \end{aligned} \] The (conditional) variance-covariance matrix of \(\varepsilon\) contains the conditional variances and covariances of the noise terms: \[ var[\varepsilon] = E[\varepsilon \varepsilon^\mathrm{T} | X] = \begin{bmatrix} var[\varepsilon_1 | X] & cov[\varepsilon_1, \varepsilon_2| X] & \dots & cov[\varepsilon_1, \varepsilon_N| X] \\ cov[\varepsilon_2, \varepsilon_1| X] & var[\varepsilon_2| X] & \dots & cov[\varepsilon_2, \varepsilon_N| X] \\ \vdots & \vdots & \ddots & \vdots \\ cov[\varepsilon_N, \varepsilon_1| X] & cov[\varepsilon_N, \varepsilon_2| X] & \dots & var[\varepsilon_N| X] \end{bmatrix}. \] In the case of homoskedastic uncorrelated noise terms, the variance-covariance matrix of \(\varepsilon\) is simply \[ E[\varepsilon \varepsilon^\mathrm{T} | X] = \begin{bmatrix} \sigma^2 & 0 & \dots & 0 \\ 0 & \sigma^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma^2 \end{bmatrix} = \sigma^2 I \] where \(I\) is the \((N \times N)\) identity matrix (we will generally not indicate the dimensions of identity matrices, and leave the reader to deduce dimensions from context). If the errors are conditionally heteroskedastic with \(var[\epsilon_i | X] = \sigma_i^2\) but uncorrelated, then the conditional variance-covariance matrix of \(\varepsilon\) becomes \[ E[\varepsilon \varepsilon^\mathrm{T} | X] = \begin{bmatrix} \sigma_1^2 & 0 & \dots & 0 \\ 0 & \sigma_2^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_N^2 \end{bmatrix} = \mathrm{diag}(\sigma_1^2, \sigma_2^2, \dots, \sigma_N^2). \] If there is correlation between the errors, then the var-cov matrix of \(\varepsilon\) will no longer be diagonal. We write the general variance matrix of \(\varepsilon\) as \[ E[\varepsilon \varepsilon^\mathrm{T} | X] = \Omega, \] although we will have to impose some structure on \(\Omega\) for the analysis to be feasible. We list the assumptions of the basic model below:

Assumption Set D The sample \(\{Y_i,X_{1,i},X_{2,i},\dots,X_{K-1,i}\}_{i=1}^N\) satisfies

(D1) \(y = X\beta + \varepsilon\),

(D2) \(E[\varepsilon | X] = 0\),

(D3) \(E[\varepsilon \varepsilon^\mathrm{T} | X] = \sigma^2I\),

(D4) \(Xc \neq 0\) for all \(c \neq 0\),

where \(y\), \(X\) and \(\epsilon\) are as defined in this section.

9.2 Ordinary Least Squares

Let \(\hat{\beta}\) denote some estimator for \(\beta\). Then the fitted values associated with these estimators are \[ \hat{y} = X\hat{\beta} \] and the residuals are \[ \hat{\varepsilon} = y - \hat{y} = y - X\hat{\beta}. \] The sum of squared residuals is then \[ \begin{aligned} SSR = \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} &= (y - X\hat{\beta})^\mathrm{T}(y - X\hat{\beta}) \\ &= y^\mathrm{T}y - \hat{\beta}^\mathrm{T}X^\mathrm{T}y - y^\mathrm{T}X\hat{\beta} + \hat{\beta}^\mathrm{T}X^\mathrm{T}X\hat{\beta} \\ &= y^\mathrm{T}y - 2\hat{\beta}^\mathrm{T}X^\mathrm{T}y + \hat{\beta}^\mathrm{T}X^\mathrm{T}X\hat{\beta} \end{aligned} \] where we have used the fact that \(\hat{\beta}^\mathrm{T}X^\mathrm{T}y\) is the transpose of \(y^\mathrm{T}X\hat{\beta}\), and the transpose of a scalar is the scalar itself. OLS estimators are those that minimize SSR: \[ \hat{\beta}^{ols} = \text{argmin}_{\hat{\beta}} SSR\,. \] The first-order conditions are \[ \left. \frac{\partial SSR}{\partial \hat{\beta}} \right|_{\hat{\beta}^{ols}} = - 2X^\mathrm{T}y + 2X^\mathrm{T}X\hat{\beta}^{ols} = 0. \] This implies \[ X^\mathrm{T}X\hat{\beta}^{ols} = X^\mathrm{T}y \] which, given our assumption that \(X\) is full column rank, can be solved for \(\hat{\beta}^{ols}\): \[ \hat{\beta}^{ols} = (X^\mathrm{T}X)^{-1}X^\mathrm{T}y. \] The second partial derivatives of the SSR \[ \frac{\partial^2 SSR}{\partial \hat{\beta} \partial \hat{\beta}^\mathrm{T}} = 2X^\mathrm{T}X \] is positive definite, since the assumption of full column rank of \(X\) means that \(Xc \neq 0\) for all \(c \neq 0\), from which it follows that \[ c^\mathrm{T}X^\mathrm{T}Xc = (Xc)^\mathrm{T}Xc > 0. \] The FOC can also be written as \[ X^\mathrm{T}(y - X\hat{\beta}^{ols}) = X^\mathrm{T}\hat{\varepsilon}^{ols}= 0. \tag{9.6}\] Partitioning \(X^\mathrm{T}\) “by variable” as in (9.5), we can see that (9.6) says that OLS residuals sum to zero, and are orthogonal to each of the regressors: \[ X^\mathrm{T}\hat{\varepsilon}^{ols} = \begin{bmatrix} i_N^\mathrm{T} \\ \mathrm{x}_{1}^\mathrm{T} \\ \mathrm{x}_{2}^\mathrm{T} \\ \vdots \\ \mathrm{x}_{K-1}^\mathrm{T} \end{bmatrix}\hat{\varepsilon}^{ols} = \begin{bmatrix} i_N^\mathrm{T}\hat{\varepsilon}^{ols} \\ \mathrm{x}_{1}^\mathrm{T} \hat{\varepsilon}^{ols} \\ \mathrm{x}_{2}^\mathrm{T} \hat{\varepsilon}^{ols} \\ \vdots \\ \mathrm{x}_{K-1}^\mathrm{T}\hat{\varepsilon}^{ols} \end{bmatrix} = 0\,. \] Partitioning \(X\) by observation, as in (9.4), we can also write the OLS estimator as \[ \begin{aligned} \hat{\beta}^{ols} &= (X^\mathrm{T}X)^{-1}X^\mathrm{T}y \\ &= \left\{ \begin{bmatrix} x_1 & x_2 & \dots & x_N \end{bmatrix} \begin{bmatrix} x_1^\mathrm{T} \\ x_2^\mathrm{T} \\ \vdots \\ x_N^\mathrm{T} \end{bmatrix} \right\}^{-1} \begin{bmatrix} x_1 & x_2 & \dots & x_N \end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_N \end{bmatrix} \\ &= \left( \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1}\sum_{i=1}^N x_iY_i. \end{aligned} \] This form emphasizes the role that sample averages play in the estimation of \(\beta\): \[ \hat{\beta}^{ols} = \left(\frac{1}{N} \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \left(\frac{1}{N} \sum_{i=1}^N x_iY_i \right). \]

9.3 Algebraic Properties of OLS Estimators

We list here some algebraic properties of OLS estimators. These hold as long as Assumption D4 holds. From this point onwards, we drop the ‘OLS’ superscript from the OLS estimators, fitted values and residuals, and write \(\hat{\beta}\), \(\hat{Y}\) and \(\hat{\varepsilon}\) for \(\hat{\beta}^{ols}\), \(\hat{Y}^{ols}\) and \(\hat{\varepsilon}^{ols}\). We will reinstate the superscript whenever context demands it so.

OLS estimators are linear estimators: \(\hat{\beta} = (X^\mathrm{T}X)^{-1}X^\mathrm{T}y = Ay\) means that each OLS estimator \(\hat{\beta}_k\), \(k=0,1,...,K-1\) can be written as \[ \hat{\beta}_k = \sum_{i=1}^N a_{k,i}Y_i \] where \(a_{k,i}, i=1,2,...,N\) are the elements of the \(k\mathrm{th}\) row of \(A = (X^\mathrm{T}X)^{-1}X^\mathrm{T}\).
The OLS estimators can also be written as \[ \begin{aligned} \hat{\beta} &= (X^\mathrm{T}X)^{-1}X^\mathrm{T}y \\ &= (X^\mathrm{T}X)^{-1}X^\mathrm{T}(X\beta+\varepsilon) \\ &= \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon. \end{aligned} \]
The OLS fitted values can be written as \[ \hat{y} = X\hat{\beta}=X(X^\mathrm{T}X)^{-1}X^\mathrm{T}y. \] The matrix \(X(X^\mathrm{T}X)^{-1}X^\mathrm{T}\) is sometimes called the ‘hat’ matrix (because it puts a ‘hat’ on \(y\)). It is also called the “Projection” matrix (since it projects \(y\) onto the space spanned by the columns of \(X\)) and denoted \(P\). It has the convenient property that it is symmetric: \[ \begin{aligned} P^\mathrm{T} &= (X(X^\mathrm{T}X)^{-1}X^\mathrm{T})^\mathrm{T} \\ &= (X^{\mathrm{T}\mathrm{T}}[(X^\mathrm{T}X)^{-1}]^\mathrm{T}X^\mathrm{T}) \\ &= X(X^\mathrm{T}X)^{-1}X^\mathrm{T} = P \end{aligned} \] where we have used the fact that \((X^\mathrm{T}X)^{-1}\) is symmetric (why is \((X^\mathrm{T}X)^{-1}\) symmetric?). The matrix \(P\) is also idempotent, meaning that \(PP=P\): \[ PP=X(X^\mathrm{T}X)^{-1}X^\mathrm{T}X(X^\mathrm{T}X)^{-1}X^\mathrm{T}=X(X^\mathrm{T}X)^{-1}X^\mathrm{T}=P. \] Symmetric and idempotent matrices have the convenient property that their rank is equal to their trace, which is easy to compute. Since \[ \mathrm{tr}(P) = \mathrm{tr}(X(X^\mathrm{T}X)^{-1}X^\mathrm{T}) = \mathrm{tr}(X^\mathrm{T}X(X^\mathrm{T}X)^{-1}) = \mathrm{tr}(I_K) = K, \] the rank of \(P\) is \(K\).
The OLS residuals can be written as \[ \hat{\varepsilon} = y - \hat{y} = (I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T})y. \] The matrix \(I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T}\) is also symmetric and idempotent, and its trace, and therefore its rank, is \(N-K\) (see exercises). It is often denoted by \(M\), and has the property that it “eliminates \(X\)” in the sense that \[ MX = (I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T})X = X-X = 0. \] As a consequence of this, we have \[ MP = MX(X^\mathrm{T}X)^{-1}X^\mathrm{T} = 0. \] Of course, you can also see this from \(MP=(I-P)P = P-PP = P-P = 0\).
We have already noted from the FOC that the OLS residuals sum to zero, and are orthogonal to each of the regressors. Since \(y = \hat{y} + \hat{\varepsilon}\), it follows that \(\overline{Y} = \overline{\hat{Y}}\). Furthermore, \(\hat{y}^\mathrm{T}\hat{\varepsilon} = 0\). That is, the fitted values and the residuals are orthogonal. We can also use the fact that \(MP=PM=0\): \[ \hat{y}^\mathrm{T}\hat{\varepsilon} = y^\mathrm{T}PMy = 0. \] For those who are not uncomfortable thinking about \(N\)-dimensional vectors in geometric terms, this means the fitted values and residuals are at “right-angles” in \(N\)-dimensional space. The length of a vector \(y\) is \(\sqrt{y^\mathrm{T}y}\). Using orthogonality of the fitted values and residuals, we get \[ \begin{aligned} y^\mathrm{T}y &= \hat{y}^\mathrm{T}\hat{y} + 2\hat{y}^\mathrm{T}\hat{\varepsilon} + \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} \\ &= \hat{y}^\mathrm{T}\hat{y} + \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}. \end{aligned} \tag{9.7}\] This is just Pythagoras’s Theorem (in \(N\)-dimensional space).
One useful application of the fact that \(M\) eliminates \(X\) is to derive a formula linking the residuals to the noise terms. We have \[ \hat{\varepsilon} = My = M(X\beta+\varepsilon) = M\varepsilon \] This result, and the fact that \(M\) is symmetric and idempotent, means that the sum of squared residuals can be written as \[ \hat{\varepsilon}^\mathrm{T} \hat{\varepsilon} = (M\varepsilon)^\mathrm{T}M\varepsilon = \varepsilon^\mathrm{T}M^\mathrm{T}M\varepsilon = \varepsilon^\mathrm{T}M\varepsilon. \]

9.4 Statistical Properties of OLS Estimators.

Under Assumption Set D, \(\hat{\beta}\) is unbiased. Using \(\hat{\beta} = \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon\), we have \[ E[\hat{\beta}|X] = \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}E[\varepsilon|X] = \beta \] which implies \(E[\hat{\beta}]=\beta\).

The proof of unbiasedness uses Assumption D2 directly, and Assumptions D1 and D4 indirectly, but it does not make any use of Assumption D3. Unbiasedness of OLS estimators does not depend on the the structure of the variance-covariance matrix of the noise terms.

The variances and covariances of all of the OLS coefficient estimators can be obtained by computing the (conditional) variance-covariance matrix of \(\hat{\beta}\): \[ \begin{aligned} var[\hat{\beta}|X] &= E[(\hat{\beta} - E[\hat{\beta}])(\hat{\beta} - E[\hat{\beta}])^\mathrm{T} | X] \\ &= E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^\mathrm{T} | X] \\ &= E[(X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon \varepsilon^\mathrm{T}X(X^\mathrm{T}X)^{-1} | X] \\ &= (X^\mathrm{T}X)^{-1}X^\mathrm{T}E[\varepsilon \varepsilon^\mathrm{T} | X]X(X^\mathrm{T}X)^{-1}. \end{aligned} \] In the general case \(E[\varepsilon \varepsilon^\mathrm{T} | X] = \Omega\), this becomes \[ var[\hat{\beta}|X] = (X^\mathrm{T}X)^{-1}X^\mathrm{T}\Omega X(X^\mathrm{T}X)^{-1}. \] To get any further we would have to put more structure on \(\Omega\). If we have uncorrelated but possibly heteroskedastic noise terms, then \[ \begin{aligned} &var[\hat{\beta}|X] \\ &= (X^\mathrm{T}X)^{-1}X^\mathrm{T}\mathrm{diag}(\sigma_1^2,\sigma_2^2,\dots,\sigma_N^2) X(X^\mathrm{T}X)^{-1} \\ &= \left( \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \begin{bmatrix} x_1 & x_2 & \dots & x_N \end{bmatrix} \begin{bmatrix} \sigma_1^2 & 0 & \dots & 0 \\ 0 & \sigma_2^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_N^2 \end{bmatrix} \begin{bmatrix} x_1^\mathrm{T} \\ x_2^\mathrm{T} \\ \vdots \\ x_N^\mathrm{T}\end{bmatrix} \left( \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \\ &= \left( \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \left( \sum_{i=1}^N \sigma_i^2 x_i x_i^\mathrm{T}\right) \left( \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \\ \end{aligned}. \tag{9.8}\] If we further assume conditional homoskedasticity as in Assumption D3, then \(var[\hat{\beta}|X]\) further simplifies to \[ \begin{aligned} var[\hat{\beta}|X] &= \left( \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \left( \sum_{i=1}^N \sigma^2 x_i x_i^\mathrm{T}\right) \left( \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \\ &= \sigma^2\left( \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \\ &= \sigma^2(X^\mathrm{T}X)^{-1}. \end{aligned} \tag{9.9}\] Under conditional homoskedasticity, an unbiased estimator for \(\sigma^2\) is \[ \begin{aligned} \widehat{\sigma^2} = \frac{\hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}}{N-K} = \frac{\sum_{i=1}^N \hat{\epsilon}_i^2}{N-K}. \end{aligned} \] To prove unbiasedness of this estimator, we note that \[ \begin{aligned} E[\hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} | X] &= E[\varepsilon^\mathrm{T}M\varepsilon | X] = E[\mathrm{tr}(\varepsilon^\mathrm{T}M\varepsilon) | X] \\ &= E[\mathrm{tr}(\varepsilon\varepsilon^\mathrm{T}M) | X] = \mathrm{tr}(E[\varepsilon\varepsilon^\mathrm{T}M | X]) \\ &= \mathrm{tr}(E[\varepsilon\varepsilon^\mathrm{T} | X]M) = \mathrm{tr}(\sigma^2M) \\ &= \sigma^2 (N-K). \end{aligned} \] This implies that \(E[\hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} ] = \sigma^2(N-K)\) and unbiasedness of \(\widehat{\sigma^2}\) follows. We therefore estimate \(var[\hat{\beta}]\) using \[ \widehat{var}[\hat{\beta} | X] = \widehat{\sigma^2}(X^\mathrm{T}X)^{-1}. \]

Example 9.1 In the simple linear regression \(Y_i = \beta_0 + \beta_1 X_{1,i} + \epsilon_i\), \(i=1,2,...,N\), we have \[ X = \begin{bmatrix} 1 & X_{1,1} \\ 1 & X_{1,2} \\ \vdots & \vdots \\ 1 & X_{1,N} \end{bmatrix} \quad \text{and} \quad X^\mathrm{T}X = \begin{bmatrix} N & \sum_{i=1}^N X_{1,i} \\ \sum_{i=1}^N X_{1,i} & \sum_{i=1}^N X_{1,i}^2 \end{bmatrix}\,. \] The OLS estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) can be found from \[ \hat{\beta} = \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{bmatrix} = (X^\mathrm{T}X)^{-1}X^\mathrm{T}y \,. \] If Assumption Set D holds, their variances and covariances can be found from \[ var[\hat{\beta}|X] = \begin{bmatrix} var[\hat{\beta}_0|X] & cov[\hat{\beta}_0,\hat{\beta}_1|X] \\ cov[\hat{\beta}_0, \hat{\beta}_1|X] & var[\hat{\beta}_1|X] \end{bmatrix} = \sigma^2(X^\mathrm{T}X)^{-1}. \]

We will discuss estimation of the variance-covariance matrix under conditional heteroskedasticity when we discuss asymptotic properties of OLS estimators. In the meantime, we continue our discussion under the assumption of conditional homoskedasticity.

Under Assumption Set D (with conditional homoskedasticity), the OLS estimators are the most efficient estimators among all linear unbiased estimators, i.e., \[ var[c^\mathrm{T}\hat{\beta}|X] \leq var[c^\mathrm{T}\tilde{\beta}|X] \tag{9.10}\] for all unbiased estimators of the form \(\tilde{\beta} = By\). In other words, each individual \(\hat{\beta}_k\) has the smallest variance among all linear unbiased estimators of \(\beta_k\), and all linear combinations of \(\hat{\beta}\) will have a smaller variance than the same linear combination of any other linear unbiased estimator of \(\beta\).

Example 9.2 Suppose \(Y_i = x_i^\mathrm{T}\beta + \epsilon_i\), with Assumption Set D holding. The prediction of \(Y_i\) at \(x_i = x_0\) is \(\hat{Y}(x_0)= x_0^\mathrm{T}\hat{\beta}\). This is a linear combination of the estimators in \(\hat{\beta}\). Since the OLS estimators \(\hat{\beta}\) are most efficient among all linear unbiased estimators, this prediction rule gives us the most precise linear unbiased prediction of \(Y\) at \(x=x_0\).

To prove (9.10), let \(\tilde{\beta} = By\) where \(B\) comprises constants and elements of \(X\), and such that \(\tilde{\beta}\) is unbiased. Write \(B = D + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\), so \[ \tilde{\beta} = DX\beta+D\varepsilon + \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon. \] We have already assumed \(E[\varepsilon|X]=0\). To ensure unbiasedness of \(\tilde{\beta}\), we have also to assume that \(DX=0\). We make this assumption. Then \[ \begin{aligned} var[\tilde{\beta}|X] &= E[(\tilde{\beta}-\beta)(\tilde{\beta}-\beta)^\mathrm{T}|X] \\ &= E[(D+(X^\mathrm{T}X)^{-1}X^\mathrm{T})\varepsilon \varepsilon^\mathrm{T}(D+(X^\mathrm{T}X)^{-1}X^\mathrm{T})^\mathrm{T}|X] \\ &= \sigma^2[(X^\mathrm{T}X)^{-1}+DD^\mathrm{T}] \\ &= var[\hat{\beta}] + \sigma^2DD^\mathrm{T} \,. \end{aligned} \] Result (9.10) follows immediately.

9.5 Hypothesis Testing

Suppose in addition to (D1)-(D4) we also assume that the noise terms are normally distributed. We can write this as

(D5) \(\varepsilon | X \sim \text{Normal}_N(0, \sigma^2 I)\).

As written, assumption D5 actually subsumes D2 and D3, but we will keep D2, D3, and D5 as separate statements. Obviously if you have conditional heteroskedasticiy or correlation in the noise terms, then the variance expression in D5 must be modified accordingly.

With assumption D5, we have \[ \hat{\beta}|X \sim \text{Normal}_K(\beta, \sigma^2(X^\mathrm{T}X)^{-1}) \tag{9.11}\] since \(\hat{\beta}\) is a constant plus a linear combination of normally distributed noise terms. This can be used to develop t and F-tests of linear hypotheses concerning elements of \(\beta\). A general single linear hypothesis can be written as \[ H_0: r^\mathrm{T}\beta = r_0 \quad \text{vs} \quad r^\mathrm{T}\beta \neq r_0. \]

Example 9.3 To test \(\beta_1 + \beta_2 = 1\) in the regression \[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i, \] set \(r^\mathrm{T}=\begin{bmatrix} 0 & 1 & 1 \end{bmatrix}\) and \(r_0 = 1\).

From (9.11), we have \[ r^\mathrm{T}\hat{\beta} | X \sim \text{Normal}(r^\mathrm{T}\beta,r^\mathrm{T}[\sigma^2 (X^\mathrm{T}X)^{-1}]r). \] If the null hypothesis \(r^\mathrm{T}\beta = r_0\) holds, then \[ r^\mathrm{T}\hat{\beta} | X \sim \text{Normal}(r_0,r^\mathrm{T}[\sigma^2 (X^\mathrm{T}X)^{-1}]r) \] and \[ \frac{r^\mathrm{T}\hat{\beta} - r_0}{\sqrt{r^\mathrm{T}[\sigma^2 (X^\mathrm{T}X)^{-1}]r}} \sim \text{Normal}(0,1). \] Furthermore, it can be shown that if we replace \(\sigma^2\) with \(\widehat{\sigma^2}\), then \[ \begin{aligned} t &= \frac{r^\mathrm{T}\hat{\beta} - r_0}{\sqrt{ r^\mathrm{T}[\widehat{\sigma^2}(X^\mathrm{T}X)^{-1}]r}} \\ &= \frac{r^\mathrm{T}\hat{\beta} - r_0}{\sqrt{ r^\mathrm{T}\widehat{var}[\hat{\beta}|X]r}} \sim \text{t}_{(N-K)}. \end{aligned} \tag{9.12}\] This can be used to test the hypothesis \(H_0: r^\mathrm{T}\beta = r_0\) in the usual way.

To test multiple hypotheses jointly, write the hypotheses as \[ H_0: R\beta = r_0 \quad \text{vs} \quad H_A: R\beta \neq r_0 \] where now \(R\) is a \((J \times K)\) matrix, and \(r_0\) is a \((J \times 1)\) vector.

Example 9.4 To test the hypotheses \[ H_0: \beta_1 + \beta_2 = 1 \; \text{and} \; \beta_3 = 0 \quad \text{vs} \quad H_A: \beta_1 + \beta_2 \neq 1 \; \text{or} \; \beta_3 \neq 0 \; \text{(or both)}, \] set the matrices \(R\) and \(r_0\) to \[ R = \begin{bmatrix} 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \quad \text{and} \quad r_0 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}. \]

To text multiple hypotheses jointly, we can again compare the sum of squared residuals from the restricted and unrestricted regressions, as explained in a previous chapter.

Example 9.5 We continue with Example 9.4. The restricted regression is \[ \begin{aligned} Y_i &= \beta_0 + \beta_1X_{1,i} + (1-\beta_1) X_{2,i} + 0X_{3,i} + \epsilon_i \\ &= \beta_0 + \beta_1(X_{1,i}-X_{2,i}) + X_{2,i} + \epsilon_i. \end{aligned} \] To estimate this, regress \(Y_i - X_{2,i}\) on a constant as \(X_{1,i}-X_{2,i}\). Then calculate the restricted OLS residuals \[ \hat{\epsilon}_{i,r} = Y_i - \hat{\beta}_{0,r} - \hat{\beta}_{1,r}(X_{1,i}-X_{2,i}) - X_{2,i} \] and finally calculate \(SSR_r = \sum_{i=1}^N \hat{\epsilon}_{i,r}^2 = \hat{\varepsilon}_r^\mathrm{T}\hat{\varepsilon}_r\), where \(\hat{\varepsilon}_r\) is the vector of restricted OLS residuals.

It can be shown that if the null is true, then \[ F = \frac{(\hat{\varepsilon}_r^\mathrm{T}\hat{\varepsilon}_r - \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}) / J} {\hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} / (N-K)} \sim \text{F}_{(J,N-K)} \tag{9.13}\] where \(J\) is the number of restrictions being tested and \(\hat{\varepsilon}\) is the vector of unrestricted OLS residuals. You would reject \(H_0\) if the \(F\)-statistic is “improbably large”, meaning that \(F > F_{\alpha, J, N-K}\) where \(F_{\alpha, J, N-K}\) is the \(1-\alpha\) percentile of the \(\mathrm{F}_{J,N-K}\) distribution. Typically \(\alpha\) = 0.01, 0.05 or 0.1.

In practice you do not have to compute the restricted regression. It can be shown that \[ \hat{\varepsilon}_r^\mathrm{T}\hat{\varepsilon}_r - \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} = (R\hat{\beta}-r_0)^\mathrm{T}[R(X^\mathrm{T}X)^{-1}R^\mathrm{T}]^{-1}(R\hat{\beta}-r_0) \tag{9.14}\] where \(\hat{\beta}\) is the unrestricted OLS estimators (we show this in an Appendix). Furthermore, since the denominator of the \(F\)-statistic is \(\widehat{\sigma^2}\), we can write the \(F\)-statistic as \[ \begin{aligned} F &= (R\hat{\beta}-r_0)^\mathrm{T} [R(\widehat{\sigma^2}(X^\mathrm{T}X)^{-1})R^\mathrm{T}]^{-1} (R\hat{\beta}-r_0)/J \\ &= (R\hat{\beta}-r_0)^\mathrm{T} [R\,\widehat{\text{var}}[\hat{\beta}|X]\,R^\mathrm{T}]^{-1} (R\hat{\beta}-r_0)/J \\ &\sim \text{F}_{(J,N-K)} \end{aligned} \tag{9.15}\]

Example 9.6 We use data in earnings.xlsx to to estimate the equation \[ \ln(earnings_{i}) = \beta_0 + \beta_1 height_{i} + \beta_2male_{i} + \beta_3wexp_{i} + \beta_4 tenure_{i} + \epsilon_{i} \] We compute the OLS estimates and associated statistics from the formulas derived in this chapter. The first four rows of the \(X\) matrix are

df_earnings <- read_excel("data\\earnings.xlsx")
y <- as.matrix(log(df_earnings$earnings))
N <- length(y)
X <- as.matrix(cbind("const"=rep(1,length(y)), 
                     df_earnings[,c('height','male','wexp','tenure')]))
K <- dim(X)[2]
head(X,4)

     const height male      wexp   tenure
[1,]     1     67    1 22.384610 2.750000
[2,]     1     67    1  8.903846 2.384615
[3,]     1     69    1 13.250000 5.750000
[4,]     1     72    1 18.250000 6.134615

The following code implement the formulas derived earlier.

XTX <- t(X)%*%X # t() for transpose
XTXinv <- solve(XTX) # solve() to get inverse
XTy <- t(X)%*%y
bhat  <- solve(XTX,XTy)        # beta_hat, alt: XTX_1 %*% XTy, given mtd preferred
yhat  <- X%*%bhat              # fitted values
ehat  <- y - yhat              # residuals
s2hat <- sum(ehat^2)/(N-K)     # or t(ehat)%*%ehat
varbhat <- s2hat*XTXinv        # var(bhat)
sebhat  <- sqrt(diag(varbhat)) # se(bhat)
tbhat   <- bhat/sebhat
pval    <- 2*(1-pt(abs(tbhat),N-K))
star             <- rep("", K)
star[pval<0.1] <- "."
star[pval<0.05] <- "*"
star[pval<0.01] <- "**"
star[pval<0.001] <- "***"
Results <- data.frame("Estimate"=bhat, "Std. Err."=sebhat, "t-stat"=tbhat, "p-val"=pval, star)
names(Results)[K] <- ""
print(Results, digits=6, right=FALSE)
cat("---\nSignif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n")

       Estimate   Std..Err.  t.stat   p.val          
const  0.98423994 0.53948678 1.824400 0.068649145 .  
height 0.02271202 0.00824686 2.754021 0.006087062 ** 
male   0.17052245 0.07035690 2.423678 0.015694822 *  
wexp   0.00541464 0.00585844 0.924245 0.355775183    
tenure 0.01335911 0.00397157 3.363681 0.000824198 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As an illustration, we test the hypothesis that \(\beta_3 = \beta_4\), i.e., \(\beta_3 - \beta_4 = 0\).

r <- matrix(c(0,0,0,1,-1), ncol=1)
t <- t(r)%*%bhat / sqrt(t(r)%*%varbhat%*%r)
tpval <- 2*(1-pt(abs(t),N-K))
cat("H0 wage = tenure: t-stat ",round(t,4),", pval", round(tpval,6))

H0 wage = tenure: t-stat  -0.9779 , pval 0.328545

We do not reject the hypothesis. To jointly test the hypotheses \(\beta_3 - \beta_4 = 0\) and \(\beta_1=0\), we use the F-test, with \[ R = \begin{bmatrix} 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & -1 \end{bmatrix} \quad \text{and} \quad r_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \]

R <- matrix(c(0,1,0,0,0,0,0,0,1,-1), nrow=2, byrow=TRUE)
r0 <- matrix(c(0,0),2,1)
J = length(r0)
Rb <- R %*% bhat - r0
F <- t(Rb) %*% solve(R%*%varbhat%*%t(R)) %*% Rb / J
Fpval <- (1-pf(F,J,N-K))
cat("H0 height = 0 and wage = tenure: F-stat ",round(F,4),", pval", round(Fpval,6))

H0 height = 0 and wage = tenure: F-stat  4.3802 , pval 0.012975

The following are the results using the lm() function for the regression, and the linearHypothesis() function from the car package for the general t- and F-tests.

mdl <- lm(log(earnings)~height+male+wexp+tenure, data=df_earnings)
summary(mdl)


Call:
lm(formula = log(earnings) ~ height + male + wexp + tenure, data = df_earnings)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.18146 -0.35799 -0.02521  0.32146  2.13918 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.984240   0.539487   1.824 0.068649 .  
height      0.022712   0.008247   2.754 0.006087 ** 
male        0.170522   0.070357   2.424 0.015695 *  
wexp        0.005415   0.005858   0.924 0.355775    
tenure      0.013359   0.003972   3.364 0.000824 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5528 on 535 degrees of freedom
Multiple R-squared:  0.1244,    Adjusted R-squared:  0.1179 
F-statistic: 19.01 on 4 and 535 DF,  p-value: 1.255e-14

linearHypothesis(mdl,c('wexp=tenure'))  ## Testing one hypothesis

Linear hypothesis test

Hypothesis:
wexp - tenure = 0

Model 1: restricted model
Model 2: log(earnings) ~ height + male + wexp + tenure

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    536 163.77                           
2    535 163.48  1   0.29223 0.9564 0.3285

linearHypothesis(mdl,c('height=0','wexp=tenure'))   ## Testing two hypotheses jointly

Linear hypothesis test

Hypothesis:
height = 0
wexp - tenure = 0

Model 1: restricted model
Model 2: log(earnings) ~ height + male + wexp + tenure

  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    537 166.15                              
2    535 163.48  2    2.6769 4.3802 0.01297 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the linearHypothesis() function returns an F-statistic even when testing a single hypothesis. It can be shown (see exercises) that the F-statistic for a test of a single hypothesis is the square of the corresponding t-statistic. The two tests will produce identical p-values.

9.6 Asymptotic Properties

In addition to showing consistency and asymptotic normality of OLS estimators, the asymptotic properties discussed in this section deal with two additional issues: how do we do hypothesis testing if we cannot assume that the noise terms are normally distributed? And how do we estimate the variance-covariance matrix of \(\hat{\beta}\) if there is conditional heteroskedasticity? (We have up to this point in the chapter only discussed estimation of \(var[\hat{\beta}]\) under conditional homoskedasticity. We briefly mentioned heteroskedasticity-robust standard errors in previous chapters; we develop the theory in this chapter.)

Recall that a “Law of Large Numbers” gives conditions under which a sample mean converges in probability to the true mean, e.g., Khinchine’s LLN says that if \(\{Z_i\}_{i=1}^N\) are iid with mean \(E[Z_i]=\mu < \infty\), then \(\overline{Z} \rightarrow_p \mu\). A “Central Limit Theorem” gives conditions that guarantee convergence in distribution, e.g., if \(\{Z_i\}_{i=1}^N\) are iid with \(E[Z_i]=\mu < \infty\) and \(var[Z_i]=\sigma^2 < \infty\), then \[ \sqrt{N}(\overline{Z} - \mu) \rightarrow_d N(0,\sigma^2). \] In particular, if \(\mu = 0\), then \[ \sqrt{N}\,\overline{Z} = \frac{1}{\sqrt{N}}\sum_{i=1}^N Z_i \rightarrow_d N(0,\sigma^2). \]
In this section, we will use multivariate versions of these rules. First we extend the definition of convergence in probability and convergence in distribution to sequences of vectors and matrices of random variables: convergence in probability of a sequence of \((M \times K)\)-dimensional matrices of random variables means convergence in probability element-by-element. Convergence in distribution of a sequence of \((K \times 1)\) vectors of random variables means convergence to some multivariate distribution. We have the following results:

If \(A_N\) is \((M \times K )\) such that \(A_N \rightarrow_p A\) and \(g(.)\) is a continuous function, then \(g(A_N) \rightarrow_p g(A)\).
If \(A_N\) is \((M \times K )\) such that \(A_N \rightarrow_p A\), and \(\overline{Z}_N\) is \((K \times 1 )\) such that \(\overline{Z}_N \rightarrow_p \mu\), then \(A_N \overline{Z}_N \rightarrow_p A \mu\).
If \(\overline{Z}_N \rightarrow_d Z\) (meaning that the distribution of \(\overline{Z}_N\) converges to the distribution of \(Z\)), then \(A_N \overline{Z}_N \rightarrow_d AZ\). In the special case that \(Z \sim \text{Normal}_K(0,\Sigma)\), we have \[ A_N Z_N \rightarrow_d \text{Normal}_M(0,A\Sigma A^\mathrm{T}). \]

Now let \(Z_i\) be \((K\times 1)\) and let \(\{Z_i\}_{i=1}^N\) be an iid sample with \(E[Z_i] = \mu\) for all \(i\). Let \(\overline{Z}_N\) be the sample mean computed from a sample of size \(N\). Then \[ \overline{Z} = \frac{1}{N}\sum_{i=1}^N Z_i = \begin{bmatrix} (1/N)\sum_{i=1}^N Z_{1,i} \\ (1/N)\sum_{i=1}^N Z_{2,i} \\ \vdots \\ (1/N)\sum_{i=1}^N Z_{K,i} \end{bmatrix} \rightarrow_p \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_K \end{bmatrix} = \mu\,. \] If \(\{Z_i\}_{i=1}^N\) is iid with \(E[Z_i] = 0\) and \(var[Z_i] = \Sigma\) for all \(i\), then \[ \sqrt{N}\,\overline{Z} = \frac{1}{\sqrt{N}}\sum_{i=1}^N Z_i = \begin{bmatrix} (1/\sqrt{N})\sum_{i=1}^N Z_{1,i} \\ (1/\sqrt{N})\sum_{i=1}^N Z_{2,i} \\ \vdots \\ (1/\sqrt{N})\sum_{i=1}^N Z_{K,i} \end{bmatrix} \rightarrow_d \text{Normal}_K(0,\Sigma)\,. \] If \(Z \sim \text{Normal}_K(0,\Sigma)\), then we write \(\sqrt{N}\,\overline{Z} \rightarrow_d Z\).

We will work with the following form of \(\hat{\beta}\):
\[ \begin{aligned} \hat{\beta} &= \left(\frac{1}{N} \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \left(\frac{1}{N} \sum_{i=1}^N x_iY_i \right) \\ &= \beta + \left(\frac{1}{N} \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \left(\frac{1}{N} \sum_{i=1}^N x_i\epsilon_i \right), \end{aligned} \tag{9.16}\] The asymptotic properties of the OLS estimators depends on the what happens to the sample means in (9.16) as \(N \rightarrow \infty\). We state a set of assumptions that is a slightly modified version of the basic assumptions we have been using.

Assumption Set E

(E1) The sample \(\{Y_i, X_{1,i}, X_{2,i}, \dots, X_{K-1,i}\}_{i=1}^N\) are iid draws from a \(K\)-dimensional distribution with the following properties:

(E2) \(E[x_ix_i^\mathrm{T}] = \Sigma_{XX}\) is finite and non-singular,

(E3) \(E[\epsilon_i | x_i] = 0\) where \(\epsilon_i = Y_i - x_i^\mathrm{T}\beta\) for some constants \(\beta\), and

(E4) \(E[\epsilon_i^2 x_i x_i^\mathrm{T}] = S\) finite and non-singular.

Remarks:

Assumption E1 is the random sampling assumption. Since the draws are iid, the random variables in the \((K \times K)\) matrix \(x_i x_i^\mathrm{T}\) and the \((K \times 1)\) vector \(x_i \epsilon_i\) are also iid over \(i\).
Assumption E2 is the assumption that there are no perfect correlations among the regressors in expectation. Together with the random sampling assumption, it guarantees that \[ \frac{1}{N}\sum_{i=1}^N x_i x_i^\mathrm{T} \] converges in probability to something that is finite and has an inverse.
The assumption E3 says a number of things: first, \(\epsilon_i\) is zero mean for all \(i\). Second, it says that \(\epsilon_i\) is uncorrelated with each of the \(K-1\) regressors in \(x_i\), i.e., \(E[\epsilon_i x_i]=0\). This, and the random sampling assumption, means that \[ \frac{1}{N}\sum_{i=1}^N x_i \epsilon_i \rightarrow_p0. \] The implication \(E[\epsilon_i x_i]=0\) of Assumption D3 is the key to obtaining consistent OLS coefficient estimators.
Since \(E[\epsilon_i x_i]=0\), the expectation \(E[\epsilon_i^2 x_i x_i^\mathrm{T}]\) in Assumption D4 is the variance-covariance matrix of \(x_i \epsilon_i\).
The assumptions impose unconditional homoskedasticity, but allow for conditional heteroskedasticity. Since \(\{Y_i, X_{1,i}, X_{2,i}, \dots, X_{K-1,i}\}_{i=1}^N\) are iid draws, \(\{\epsilon_i\}_{i=1}^N\) are also iid. The “identically distributed” part implies \(var[\epsilon]\) is constant, so there is unconditional homoskedasticity. However, we allow for conditional heteroskedasticity, i.e., the conditional variance may depend on the regressors. For example, suppose \(E[\epsilon_i^2| x_i] = \sigma^2 X_{1,i}^2\) so the noise variance depends on \(X_{1,i}\) (there is conditional heteroskedasticity). However, \(E[E[\epsilon_i^2| x_i]] = \sigma^2 E[X_{1,i}^2]\), which is constant if \(X_{1,i}\) is iid.
We do not assume that the noise terms are normally distributed.

Given Assumption Set E, \(\hat{\beta}\) is consistent: \[ \hat{\beta} = \beta + \underbrace{\left(\frac{1}{N} \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1}} _{\rightarrow_p \; \Sigma_{XX}^{-1}} \underbrace{\left(\frac{1}{N} \sum_{i=1}^N x_i\epsilon_i \right)} _{\rightarrow_p \; 0} \; \rightarrow_p \,\beta \] This is basically the general version of the simple linear regression argument that \[ \hat{\beta}_1 = \beta_1 + \frac{\text{sample cov}(X_i, \epsilon_i)}{\text{sample var}(X_i)} \rightarrow_p \, \beta_1 \] if \(cov[X_i,\epsilon_i]=0\) and \(var[X_i]\) is finite, and if their sample counterparts converge in probability to them.

Since \(\hat{\beta}_1\) is consistent, its distribution is essentially degenerate in the limit. To talk about limiting distributions, we have to scale \(\hat{\beta}_1\). We use \[ \sqrt{N}(\hat{\beta} - \beta) = \left(\frac{1}{N} \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1} \left(\frac{1}{\sqrt{N}} \sum_{i=1}^N x_i\epsilon_i \right), \] Our assumptions and the CLT imply \[ \frac{1}{\sqrt{N}} \sum_{i=1}^N x_i\epsilon_i \rightarrow_d \text{Normal}_K(0,S), \] therefore \[ \sqrt{N}(\hat{\beta} - \beta) = \underbrace{\left(\frac{1}{N} \sum_{i=1}^N x_i x_i^\mathrm{T}\right)^{-1}} _{\rightarrow_p \; \Sigma_{XX}^{-1}} \underbrace{\left(\frac{1}{\sqrt{N}} \sum_{i=1}^N x_i\epsilon_i \right)} _{\rightarrow_d \; \text{Normal}_K(0,S)} \rightarrow_d \text{Normal}_K(0,\Sigma_{XX}^{-1}S\Sigma_{XX}^{-1}) \] That is, \(\hat{\beta}\) is consistent, with asymptotic variance \(\text{avar}[\hat{\beta}]=\Sigma_{XX}^{-1}S\Sigma_{XX}^{-1}\). This result justifies the approximation \[ var[\hat{\beta}] \approx (1/N)\Sigma_{XX}^{-1}S\Sigma_{XX}^{-1}. \] An obvious estimator for \(\Sigma_{XX}\) is \[ \hat{\Sigma}_{XX}=\frac{1}{N}\sum_{i=1}^N x_ix_i^\mathrm{T} = (1/N)X^\mathrm{T}X \] Some additional assumptions (see advanced econometrics textbooks) guarantee \[ \hat{S} = \frac{1}{N}\sum_{i=1}^N \hat{\epsilon}_i^2x_ix_i^\mathrm{T} \; \rightarrow_p \,S. \tag{9.17}\] This allows us to consistently estimate the asymptotic variance of \(\hat{\beta}\) by \[ \widehat{\text{avar}}[\hat{\beta}] = \left(\frac{1}{N}\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \left(\frac{1}{N}\sum_{i=1}^N \hat{\epsilon}_i^2x_ix_i^\mathrm{T}\right) \left(\frac{1}{N}\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1}, \] and justifies the use of \[ \begin{aligned} \widehat{\text{var}}_{HC0}[\hat{\beta}] &= \frac{1}{N}\widehat{\text{avar}}[\hat{\beta}] \\ &= \frac{1}{N} \left(\frac{1}{N}\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \left(\frac{1}{N}\sum_{i=1}^N \hat{\epsilon}_i^2x_ix_i^\mathrm{T}\right) \left(\frac{1}{N}\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \\ &= \left(\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \left(\sum_{i=1}^N \hat{\epsilon}_i^2x_ix_i^\mathrm{T}\right) \left(\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \\ \end{aligned} \tag{9.18}\] as an estimator for the variance of \(\hat{\beta}\). The variance estimator in (9.18) is a “Heteroskedasticity-Consistent Variance Estimator”. There are several versions. The version presented in (9.18) is often referred to as “HC0”, which is why we label it as such. Other versions will be discussed in the exercises. Because of its form, (9.18) is often called a “sandwich” estimator.

If there is conditional heteroskedasticity in the noise terms, the usual OLS variance estimator \(\widehat{\sigma^2}(X^\mathrm{T}X)^{-1}\) is not appropriate since \(\widehat{\sigma^2}((1/N)X^\mathrm{T}X)^{-1}\) is not a consistent estimator for the asymptotic variance. On the other hand, the variance estimator (9.18) remains consistent even if in fact the noise terms are conditionally homoskedastic. In this sense it is safer to use (9.18) for estimating the estimator variance if there is any possibility of conditional heteroskedasticity.

We have already noted that OLS estimators are efficient if there is conditional homoskedasticity. If there is conditional heteroskedasticity, then OLS estimators are no longer efficient. (The formula (9.18) allows us to estimate the estimator variance consistently, but doesn’t do anything about the efficiency of \(\hat{\beta}\) itself.) We have already addressed how we might try to get efficient estimators in the previous chapter.

We can use the heteroskedasticity consistent variance estimator to construct heteroskedasticity robust \(t\) and \(F\) statistics, by replacing \(\widehat{\text{var}}[\hat{\beta}|X]\) in (9.12) and (9.15) with the heteroskedasticity robust variance estimator in (9.18).

Example 9.7 We compute below the HC0 Heteroskedasticity Robust covariance matrix for the regression in Example 9.6.

hatS = 0
for (i in 1:N){
  xi = as.matrix(X[i,])
  hatS = hatS + ehat[i]^2*xi%*%t(xi)
}
varhat_HC0 <- XTXinv %*% hatS %*% XTXinv ## XTXinv computed earlier
round(varhat_HC0,6)

           const    height      male      wexp    tenure
const   0.281450 -0.004322  0.024655 -0.000212  0.000364
height -0.004322  0.000069 -0.000390 -0.000004 -0.000005
male    0.024655 -0.000390  0.004595 -0.000040  0.000033
wexp   -0.000212 -0.000004 -0.000040  0.000033 -0.000006
tenure  0.000364 -0.000005  0.000033 -0.000006  0.000013

The function hccm() from the car package also calculates this, as does the vcovHC() function from the sandwich package. We use the latter.

varhat_HC0_v2 = vcovHC(mdl,type="HC0")
round(varhat_HC0_v2,6)

            (Intercept)    height      male      wexp    tenure
(Intercept)    0.281450 -0.004322  0.024655 -0.000212  0.000364
height        -0.004322  0.000069 -0.000390 -0.000004 -0.000005
male           0.024655 -0.000390  0.004595 -0.000040  0.000033
wexp          -0.000212 -0.000004 -0.000040  0.000033 -0.000006
tenure         0.000364 -0.000005  0.000033 -0.000006  0.000013

The heteroskedasticity-robust standard errors are the square roots of the diagonal of this matrix. The table below presents the heteroskedasticity robust t-stats and recomputes the corresponding p-values.

sebhat  <- sqrt(diag(varhat_HC0)) # se(bhat)
tbhat   <- bhat/sebhat
pval    <- 2*(1-pt(abs(tbhat),N-K))
star    <- rep("", K)
star[pval<0.1] <- "."
star[pval<0.05] <- "*"
star[pval<0.01] <- "**"
star[pval<0.001] <- "***"
Results <- data.frame("Estimate"=bhat, "HC0 Std. Err."=sebhat, "t-val"=tbhat, "p-val"=pval, star)
names(Results)[K] <- ""
print(Results, digits=6, right=FALSE)
cat("---\nSignif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n")

       Estimate   HC0.Std..Err. t.val   p.val          
const  0.98423994 0.53051905    1.85524 0.064111865 .  
height 0.02271202 0.00828080    2.74273 0.006297216 ** 
male   0.17052245 0.06778799    2.51553 0.012177295 *  
wexp   0.00541464 0.00574455    0.94257 0.346326070    
tenure 0.01335911 0.00365123    3.65880 0.000278498 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Comparing these results with the previous ones, we find that the heteroskedasticity-robust standard errors are not very different from those estimated using the usual OLS formulas under conditional homoskedasticity. This suggests that heteroskedasticity is not a significant issue in this regression. Nonetheless, for illustration purposes we use the heteroskedasticity-robust robust t- and F-tests to test the hypotheses in Example 9.6. In the code below, we compute the robust F-test manually, as well as using the linearHypothesis() function from the car package.

F <- t(Rb) %*% solve(R%*%varhat_HC0%*%t(R)) %*% Rb / J
Fpval <- (1-pf(F,J,N-K))
cat("H0 height = 0 and wage = tenure: F-stat ",round(F,4),", pval", round(Fpval,6))

H0 height = 0 and wage = tenure: F-stat  4.3413 , pval 0.013481

linearHypothesis(mdl,c('height=0','wexp=tenure'), vcov=varhat_HC0)

Linear hypothesis test

Hypothesis:
height = 0
wexp - tenure = 0

Model 1: restricted model
Model 2: log(earnings) ~ height + male + wexp + tenure

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F  Pr(>F)  
1    537                    
2    535  2 4.3413 0.01348 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.7 Exercises

Exercise 9.1 Find expressions for \(var[\hat{\beta}_0 | X]\) and \(cov[\hat{\beta}_0, \hat{\beta}_1 | X]\) in the simple linear regression \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \, , \; i=1,2,...,N \] where \(E[\epsilon_i | X] = 0\) and \(E[\epsilon_i^2 | X] = \sigma^2\) for all \(i=1,2,...,N\), and \(E[\epsilon_i \epsilon_j | X] = 0\) for all \(i \neq j\), \(i,j=1,2,...,N\). What is the sign of \(cov[\hat{\beta}_0, \hat{\beta}_1 | X]\)?

Exercise 9.2 Let \(X\) be a \((N \times K)\) full column rank matrix. Show that \[ M = I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T} \] is symmetric and idempotent, with rank \(N-K\).

Exercise 9.3 Show that the residuals from the regression \(Y_i = \beta_0 + \epsilon_i\), \(i=1,2,...,N\) can be written as \[ \hat{\varepsilon} = M_0y \] where \[ M_0 = I - i_N(i_N^\mathrm{T}i_N)^{-1}i_N^\mathrm{T}. \] Show by direct computation that \(M_0y\) is the vector of deviations from means, i.e., \[ M_0y = (I-i_N(i_N^\mathrm{T}i_N)^{-1}i_N^\mathrm{T})y = \begin{bmatrix} Y_1 - \overline{Y} \\ Y_2 - \overline{Y} \\ \vdots \\ Y_N - \overline{Y} \end{bmatrix}. \] Show that \(y^\mathrm{T}M_0y = \sum_{i=1}^N(Y_i - \overline{Y})^2\)

Exercise 9.4 Show that \[ y^\mathrm{T}M_0y = \hat{y}^\mathrm{T}M_0\hat{y} + \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}. \] This is the \(SST=SSR+SSE\) equality that forms the basis of the \(R^2\).

Exercise 9.5 Consider the regression \(E^i = X\beta + \varepsilon\), where \(E^i\) is the \((N \times 1)\) vector comprising all zeros except for a ‘1’ in the \(i\mathrm{th}\) position. Let the matrix \(X\) be \((N \times K)\) full column rank.

Show that the fitted values \(\widehat{E^i}\) has the expression \[ \widehat{E^i} = X(X^\mathrm{T}X)^{-1}x_i \] where \(x_i^\mathrm{T}\) is the \(i\mathrm{th}\) row of the \(X\) matrix.
Define the ‘leverage’ of observation \(i\) to be \[ h_{i} = x_i^\mathrm{T}(X^\mathrm{T}X)^{-1}x_i. \] Show that \(0 \leq h_{i} \leq 1\). Hint: Use part (a) and the “Pythagoras’s Theorem” result in (9.7).
Explain why \(\sum_{i=1}^N h_i\) is the trace of the matrix \(P = X(X^\mathrm{T}X)^{-1}X^\mathrm{T}\). Show that \(\sum_{i=1}^N h_{i} = K\). (In other words, the “average value” of \(h_{i}\) is \(K/N\).)

Remark: it can be shown that \[ \hat{\beta} - \hat{\beta}_{-i} = \left( \frac{1}{1-h_i}\right)(X^\mathrm{T}X)^{-1}x_i\hat{\epsilon}_i \] where \(\hat{\beta}_{-i}\) is the OLS estimator obtained when observation \(i\) is left out. An observation with \(h_i\) close to 1 therefore has very high leverage, or “influential”.

Exercise 9.6 Consider the linear regression \(y = X\beta + \varepsilon\) where \(E[\varepsilon|X]=0\) and \(E[\varepsilon \varepsilon^\mathrm{T}|X] = \sigma^2I\). The fact that \(\widetilde{\sigma^2} = \frac{1}{N}\sum_{i=1}^N \hat{\epsilon}_i^2\) is biased implies that each individual \(\hat{\varepsilon}_i^2\) must in general be a biased estimator for \(\sigma^2\). Show that \[ E[\hat{\epsilon}_i^2] = (1-h_{i})\sigma^2. \] Hint: use \(\hat{\epsilon}_i^2 = (E^i)^\mathrm{T}\hat{\varepsilon}\hat{\varepsilon}^\mathrm{T}E^i\) and \(\hat{\varepsilon} = M\varepsilon\) where \(M = I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T}\).

Exercise 9.7 The “HC0” version of the heteroskedasticity-consistent standard errors given in (9.18) is sometimes criticized for not taking into consideration the fact that \(K\) degrees of freedom are used in the computation of \(\hat{\epsilon_i}\). Another version proposes to take this into account by estimating \(S\) with \[ \hat{S}_1 = \frac{1}{N-K}\sum_{i=1}^N \hat{\epsilon}_i^2x_ix_i^\mathrm{T} \] which also consistently estimates \(S\).

Show that using \(\hat{S}_1\) instead of \(\hat{S}\) in Eq. 9.17 results in the Heteroskedasticity-Consistent variance estimator \[ \begin{aligned} \widehat{\text{var}}_{HC1}[\hat{\beta}] &= \left(\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \left(\frac{N}{N-K}\sum_{i=1}^N \hat{\epsilon}_i^2x_ix_i^\mathrm{T}\right) \left(\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \,. \end{aligned} \tag{9.19}\] Amend the code in Example 9.7 to use the HC1 version of the variance estimator, and verify your results using the vcovHC() function.
Another version, based on the result in Exercise 9.6, is
\[ \begin{aligned} \widehat{\text{var}}_{HC2}[\hat{\beta}] &= \left(\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \left(\sum_{i=1}^N \frac{\hat{\epsilon}_i^2}{1-h_i}x_ix_i^\mathrm{T}\right) \left(\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \,. \end{aligned} \tag{9.20}\] Amend the code in Example 9.7 to use the HC2 version of the variance estimator, and verify your results using the vcovHC() function.
The result in Exercise 9.6, of course, assumes conditional homoskedasticity. Yet another version is
\[ \begin{aligned} \widehat{\text{var}}_{HC3}[\hat{\beta}] &= \left(\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \left(\sum_{i=1}^N \frac{\hat{\epsilon}_i^2}{(1-h_i)^2}x_ix_i^\mathrm{T}\right) \left(\sum_{i=1}^N x_ix_i^\mathrm{T}\right)^{-1} \,. \end{aligned} \tag{9.21}\] This version puts more weight on observations that are more influential. Amend the code in Example 9.7 to use the HC3 version of the variance estimator, and verify your results using the vcovHC() function.

Exercise 9.8 Consider the linear regression model under Assumption Set E. Show, by application of the Law of Iterated Expectations, that if the noise terms are conditionally homoskedastic, i.e., if \(E[\epsilon_i^2|x_i] = \sigma^2\), then \[ S = E[\epsilon_i^2x_ix_i^\mathrm{T}] = \sigma^2 \Sigma_{XX}. \] and that the asymptotic variance of \(\hat{\beta}\) then becomes \[ \text{avar}[\hat{\beta}] = \sigma^2\Sigma_{XX}^{-1}. \] Remark: this result is in accordance with the fact that under conditional homoskedasticity, \(var[\hat{\beta}|X] = \sigma^2(X^\mathrm{T}X)^{-1}\).

Exercise 9.9 Suppose we have \(y = X\beta + \varepsilon\) with \(E[\varepsilon|X] = 0\). Suppose \(var[\varepsilon|X]\) is possibly heteroskedastic and correlated, but known up to some parameter \(\sigma^2\), i.e., \(var[\varepsilon|X] = \sigma^2\Omega\), where \(\Omega\) is known (it may involve elements of \(X\), but with no unknown parameters). Of course, \(\Omega\) must be symmetric and positive definite. For example, we may have \[ var[\varepsilon|X] = \sigma^2\Omega_1 = \sigma^2\begin{bmatrix} 1/X_{k,1}^2 & 0 & 0 & \dots & 0 \\ 0 & 1/X_{k,2}^2 & 0 & \dots & 0 \\ 0 & 0 & 1/X_{k,3}^2 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & 1/X_{k,N}^2 \end{bmatrix} \] where \(X_{k,i}\) is the \(i\mathrm{th}\) observation of variable \(X_k\). Another example is \[ var[\varepsilon|X] = \sigma^2\Omega_2 = \frac{\sigma^2}{1-0.9^2}\begin{bmatrix} 1 & 0.9 & 0.9^2 & \dots & 0.9^{N-1} \\ 0.9 & 1 & 0.9 & \dots & 0.9^{N-2} \\ 0.9^2 & 0.9 & 1 & \dots & 0.9^{N-3} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0.9^{N-1} & 0.9^{N-2} & 0.9^{N-3} & \dots & 1 \end{bmatrix} \] Since \(\Omega\) is symmetric and positive definite, we can find \(P\) such that \[ P\Omega P^\mathrm{T} = I \]

Find \(P\) such that \(P\Omega_1P^\mathrm{T} = I\).
Verify that \(P \Omega_2 P^\mathrm{T} = I\) where \[ P = \begin{bmatrix} \sqrt{1-0.9^2} & 0 & 0 & \dots & 0 & 0 \\ -0.9 & 1 & 0 & \dots & 0 & 0 \\ 0 & -0.9 & 1 & \dots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \dots & -0.9 & 1 \end{bmatrix} \]

Remark: Transforming the regression \(y = X\beta + \varepsilon\), \(E[\varepsilon|X] = 0\), \(var[\varepsilon|X] = \sigma^2\Omega\), by premultiplying by \(P\), where \(P\) is the matrix such that \(P\Omega P^\mathrm{T} = I\), produces the regression \[ Py = PX\beta + P\varepsilon \] \[ \text{or} \qquad y^* = X^*\beta + \varepsilon^* \quad \quad \] where \(y^{*} = Py\), \(X^* = PX\) and \(\varepsilon^* = P\varepsilon\). Let \(\hat{\beta}^{gls}\) be the estimator of \(\beta\) obtained by applying OLS to the transformed regression. This is called the “Generalized Least Squares”, or GLS, estimator for \(\beta\). In general, \(\Omega\) will contain parameters that must (somehow) be estimated. Methods where \(\Omega\) is replaced by some estimated \(\hat{\Omega}\) are called “Feasible Generalized Least Squares” or FGLS.*

Exercise 9.10

Show that the \(F\)-statistic for testing a single restriction, i.e., when \(J=1\), is equal to the square of the \(t\)-statistic for testing the same hypothesis. (Hint: compare (9.12) and (9.15) after setting \(R=r^\mathrm{T}\) in the \(F\)-statistic.)
Suppose you have the regressions
\[ \begin{aligned} &\text{[A]} \qquad Y_i = \beta_0 + \beta_1X_{1,i} + \dots \beta_{K-1}X_{K-1,i} + \epsilon_i \; , \\ &\text{[B]} \qquad Y_i = \beta_0 + \beta_1X_{1,i} + \dots \beta_{K-1}X_{K-1,i} + \beta_{K}X_{K,i} + \epsilon_i \;. \end{aligned} \] Two possible ways of deciding whether or not to include variable \(X_{K,i}\) in the regression is to do a t-test (or an F-test) of the hypothesis \(\beta_K=0\). Another way is to see whether or not the Adjusted-\(R^2\) in [B] is greater than the Adjusted-\(R^2\) in [A]. Show that the latter method is equivalent to including \(X_{K,i}\) if the absolute value of the t-statistic for \(\hat{\beta}_K\) in [B] is greater than 1. Hint: use the version of the F-statistic in (9.13).

Exercise 9.11 Suppose \(Y_i = x_i^\mathrm{T}\beta + \epsilon_i\), \(i=1,2,...,N\), with \(E[\varepsilon|X] = 0\) and \(E[\varepsilon \varepsilon^\mathrm{T} | X] = \sigma^2 I\). Let \(\hat{\beta}\) be the OLS estimator for \(\beta\). Suppose we predict \(Y\) at \(x = x_0\) using \(\hat{Y}(x_0)=x_0^\mathrm{T}\hat{\beta}\). The prediction error is \[ \hat{e}(x_0)=Y(x_0)-\hat{Y}(x_0)=x_0^\mathrm{T}\beta + \epsilon_0 - x_0^\mathrm{T}\hat{\beta} = x_0^\mathrm{T}(\beta-\hat{\beta})+\epsilon_0. \]

Derive an expression for the prediction error variance.
Specialize your answer in (a) to the simple linear regression \(Y_i = \beta_0 + \beta_1X_{1,i} + \epsilon_i\), predicting \(Y\) at \(X_{1}= x_1^0\). Show that the prediction error variance is \[ \hat{e}(x_1^0) = \sigma^2\left(1+ \frac{1}{N} + \frac{(x_1^0-\overline{X}_1)^2}{\sum_{i=1}^N(X_{1,i} - \overline{X}_1)^2}\right). \]

9.8 Appendix

In this appendix, we prove the equality (9.14). Let the regression model be \(y=X\beta+\varepsilon\), and let \(\hat{\beta}^r\) be the least squares estimator for \(\beta\) subject to the restriction that \(R\beta=r\). We first show that \[ \hat{\beta}^r = \hat{\beta}^{ols} + (X^\mathrm{T}X)^{-1}R^\mathrm{T}[R(X^\mathrm{T}X)^{-1}R^\mathrm{T}]^{-1}(r-R\hat{\beta}^{ols}) \] where \(\hat{\beta}^{ols}\) is the usual unrestricted OLS estimator. The restricted SSR minimization problem is \[ \hat{\beta}^r = \text{argmin}_{\hat{\beta}}(y-X\hat{\beta})^\mathrm{T}(y-X\hat{\beta}) \; \text{subject to} \; R\hat{\beta}-r=0. \] The Lagrangian is \[ L= (y-X\hat{\beta})^\mathrm{T}(y-X\hat{\beta}) + 2(r^\mathrm{T} - \hat{\beta}^\mathrm{T} R^\mathrm{T})\lambda\,. \] The FOC is \[ \begin{aligned} \left.\frac{\partial L}{\partial \hat{\beta}}\right|_{\hat{\beta}^r,\hat{\lambda}} &= -2X^\mathrm{T}y + 2X^\mathrm{T}X\hat{\beta}^r-2R^\mathrm{T}\hat{\lambda}=0 \\ \left.\frac{\partial L}{\partial \hat{\lambda}}\right|_{\hat{\beta}^r,\hat{\lambda}} &= \phantom{-}2(r-R\hat{\beta}^r)=0 \\ \end{aligned} \] The second equation in the FOC merely says that the restriction must hold. The first equation in the FOC implies \[ \hat{\beta}^r = (X^\mathrm{T}X)^{-1}X^\mathrm{T}y + (X^\mathrm{T}X)^{-1} R^\mathrm{T} \hat{\lambda} = \hat{\beta}^{ols} + (X^\mathrm{T}X)^{-1} R^\mathrm{T} \hat{\lambda}\,. \] Multiplying throughout by \(R\) gives \[ R\hat{\beta}^r = R\hat{\beta}^{ols} + R(X^\mathrm{T}X)^{-1} R^\mathrm{T} \hat{\lambda}\,. \] If follows that \[ \begin{aligned} \hat{\lambda} &= [R(X^\mathrm{T}X)^{-1} R^\mathrm{T}]^{-1} (R\hat{\beta}^r- R\hat{\beta}^{ols} ) \\ &= [R(X^\mathrm{T}X)^{-1} R^\mathrm{T}]^{-1} (r- R\hat{\beta}^{ols} )\,, \end{aligned} \] and therefore \[ \hat{\beta}^r=\hat{\beta}^{ols} + (X^\mathrm{T}X)^{-1}R^\mathrm{T}[R(X^\mathrm{T}X)^{-1}R^\mathrm{T}]^{-1}(r-R\hat{\beta}^{ols})\,. \tag{9.22}\] Now let \[ \begin{aligned} \hat{\varepsilon}_r &= y - X\hat{\beta}^r \\ &= y - X\hat{\beta}^{ols} + X\hat{\beta}^{ols} - X\hat{\beta}^r \\ &= \hat{\varepsilon}_{ols} + X(\hat{\beta}^{ols} - \hat{\beta}^r)\,. \end{aligned} \tag{9.23}\] Since (unrestricted) OLS residuals are orthogonal to the regressors, we have \[ \hat{\varepsilon}_{ols}^\mathrm{T} \hat{\varepsilon}_r = \hat{\varepsilon}_{ols}^\mathrm{T} \hat{\varepsilon}_{ols} + \hat{\varepsilon}_{ols}^\mathrm{T}X(\hat{\beta}^{ols} - \hat{\beta}^r) = \hat{\varepsilon}_{ols}^\mathrm{T} \hat{\varepsilon}_{ols}. \] Therefore \[ (\hat{\varepsilon}_r - \hat{\varepsilon}_{ols})^\mathrm{T}(\hat{\varepsilon}_r - \hat{\varepsilon}_{ols}) = \hat{\varepsilon}_r^\mathrm{T}\hat{\varepsilon}_r - \hat{\varepsilon}_{ols}^\mathrm{T}\hat{\varepsilon}_{ols}. \tag{9.24}\] Finally, use (9.22), (9.23) and (9.24) to show (9.14).