The mathematics of least squares is best expressed in matrix form. Proofs of results are much more concise and more general, and we can draw on the insights of linear algebra to understand least squares at a deeper level. Furthermore, the same mathematics applies to a vast number of advanced linear models including multiple equation models, and is useful even for non-linear ones, so it is well worth the time and effort to master the mathematics of least squares estimation expressed using matrix algebra.
6.1 The Setup
Suppose you are interested in estimating the conditional expectation \(E(Y \mid X_1, \dots, X_{K-1})\) which is assumed to have the form \[
E(Y \mid X_1, \dots, X_{K-1}) = \beta_0 + \beta_1 X_{1}+ \beta_2 X_{2} + ... + \beta_{K-1} X_{K-1}\,.
\] Here \(X_1, \dots, X_{K-1}\) represent \(K-1\) different variables, not observations of a single variable. We define the noise term \(\epsilon\) as \[
\epsilon = Y - \beta_0 - \beta_1 X_{1} - \beta_2 X_{2} - ... - \beta_{K-1} X_{K-1}
\] so we can write \[
Y = \beta_0 + \beta_1 X_{1}+ \beta_2 X_{2} + ... + \beta_{K-1} X_{K-1} + \epsilon\,,\,E(\epsilon \mid X_1, \dots, X_{K-1})=0\,.
\] Suppose that you have a representative iid sample \(\{Y_i,X_{i1},X_{i2},\dots,X_{i,K-1}\}_{i=1}^n\) from this population, where \(X_{ik}\) denote the \(i\mathrm{th}\) observation of variable \(X_k\), then we can write \[
Y_i = \beta_0 + \beta_1 X_{i1}+ \beta_2 X_{i2} + ... + \beta_{K-1} X_{i,K-1} + \epsilon_i \, , \; i=1,2,...,n.
\tag{6.1}\] We can also write (6.1) in matrix form as \[
\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}
=
\begin{bmatrix}
1 & X_{11} & X_{12} & \dots & X_{1,K-1} \\
1 & X_{21} & X_{22} & \dots & X_{2,K-1} \\
\vdots & \vdots & \vdots &\ddots & \vdots \\
1 & X_{n1} & X_{n2} & \dots & X_{n,K-1}
\end{bmatrix}
\begin{bmatrix}
\beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{K-1}
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n
\end{bmatrix}
\tag{6.2}\] or simply \[
y = X\beta + \varepsilon
\tag{6.3}\] where \(y\) is the \(n \times 1\) vector \(\begin{bmatrix}Y_1 & Y_2 & \dots & Y_n\end{bmatrix}^\mathrm{T}\), \(X\) is the \(n \times K\) matrix of regressors, \(\beta\) is the \(K \times 1\) coefficient vector, and \(\varepsilon\) is the \(n \times 1\) vector of noise terms. We assume that \(n>K\).
We can partition the regressor matrix by observation: \[
X =
\left[\begin{array}{ccccc}
1 & X_{11} & X_{12} & \dots & X_{1, K-1} \\
\hline
1 & X_{21} & X_{22} & \dots & X_{2,K-1} \\
\hline
\vdots& \vdots & \vdots & \ddots & \vdots\\
\hline
1 & X_{n1} & X_{n2} & \dots & X_{n,K-1}
\end{array}\right]
=
\begin{bmatrix}
X_{1*} \\
X_{2*} \\
\vdots \\
X_{n*} \\
\end{bmatrix}
\tag{6.4}\] and write the model as \[
\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}
=
\begin{bmatrix}
X_{1*} \\
X_{2*} \\
\vdots \\
X_{n*} \\
\end{bmatrix}
\beta
+
\begin{bmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n
\end{bmatrix}
\] or \[
Y_i = X_{i*}\beta + \epsilon_i \, , \; i=1,2,...,n.
\] It is sometimes helpful to partition the regressor matrix by variable: \[
X =
\left[\begin{array}{c|c|c|c|c}
1 & X_{11} & X_{21} & \dots & X_{1,K-1} \\
1 & X_{12} & X_{22} & \dots & X_{2,K-1} \\
\vdots& \vdots & \vdots & \ddots & \vdots\\
1 & X_{1n} & X_{2n} & \dots & X_{n,K-1}
\end{array}\right]
=
\begin{bmatrix}
i_n & X_{*1} & X_{*2} & \dots X_{*,K-1}
\end{bmatrix}
\tag{6.5}\] where \(i_n\) is the \(n \times 1\) vector of ones, and \(X_{*k}\) is the vector of observations of variable \(X_k\). Feasibility of OLS estimation will require \(X\) to have full column rank, i.e., \[
Xc = c_0 + c_1X_{*1} + c_2X_{*2} + \dots + c_{K-1}X_{*, K-1} = 0_{n \times 1} \;\Longleftrightarrow\; c = 0_{K \times 1}
\] where \(c = \begin{bmatrix} c_0 & \dots & c_{K-1} \end{bmatrix}^\mathrm{T}\). In other words, we must assume that there is variation in each of the variables (apart from the constant vector), and that no one variable can be written as a linear combination of the other variables. The full column rank assumption implies that \(X^\mathrm{T}X\) is non-singular (i.e., has an inverse).
Since \(E(\epsilon \mid X_1, \dots, X_{K-1}) = 0\) in population, and we have an iid representative sample from the population, we can assume that: \[
E(\epsilon_i \mid X_{*1}, X_{*2}, \dots, X_{*,K-1})=0\;\text{ for all }\; i=1,2, \dots, n
\] and \[
\begin{aligned}
\mathit{Cov}(\epsilon_i \epsilon_j \mid X_{*1}, X_{*2}, \dots, X_{*,K-1}) = 0
\quad &\text{for all} \quad i \neq j\,,\, i,j=1,2,...,n.
\end{aligned}
\] We continue to assume homoskedastic errors \[
\begin{aligned}
\mathit{Var}(\epsilon_i \mid X_{*1}, X_{*2}, \dots, X_{*,K-1}) = \sigma^2
\quad &\text{for all} \quad i = 1,2,...,n,\\
\end{aligned}
\] In matrix form, we can write these assumptions even more simply as \[
E(\varepsilon \mid X) = 0_{n \times 1} \;\text{ and }\;
\mathit{Var}(\varepsilon \mid X) =
E(\varepsilon \varepsilon^\mathrm{T} \mid X) = \sigma^2 I_{n}
\] Remember that \(\mathit{Var}(\varepsilon \mid X)\) is the conditional variance-covariance matrix of \(\varepsilon\)\[
\mathit{Var}(\varepsilon)
=
\begin{bmatrix}
\mathit{Var}(\varepsilon_1 \mid X) & \mathit{Cov}(\varepsilon_1, \varepsilon_2 \mid X) & \dots & \mathit{Cov}(\varepsilon_1, \varepsilon_n \mid X) \\
\mathit{Cov}(\varepsilon_2, \varepsilon_1 \mid X) & \mathit{Var}(\varepsilon_2 \mid X) & \dots & \mathit{Cov}(\varepsilon_2, \varepsilon_n \mid X) \\
\vdots & \vdots & \ddots & \vdots \\
\mathit{Cov}(\varepsilon_n, \varepsilon_1 \mid X) & \mathit{Cov}(\varepsilon_n, \varepsilon_2 \mid X) & \dots & \mathit{Var}(\varepsilon_n \mid X)
\end{bmatrix}.
\] It is equal to \(E(\varepsilon \varepsilon^\mathrm{T} \mid X)\) because \(E(\varepsilon \mid X) = 0\). We have \[
E(\varepsilon \varepsilon^\mathrm{T} \mid X)
=
\begin{bmatrix}
\sigma^2 & 0 & \dots & 0 \\
0 & \sigma^2 & \dots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \dots & \sigma^2
\end{bmatrix}
= \sigma^2 I
\] because of our assumption of iid samples and homoskedasticity. If the errors are conditionally heteroskedastic with \(\mathit{Var}(\epsilon_i \mid X) = \sigma_i^2\) but uncorrelated, then the conditional variance-covariance matrix of \(\varepsilon\) becomes \[
E(\varepsilon \varepsilon^\mathrm{T} \mid X)
=
\begin{bmatrix}
\sigma_1^2 & 0 & \dots & 0 \\
0 & \sigma_2^2 & \dots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \dots & \sigma_n^2
\end{bmatrix}
= \mathrm{diag}(\sigma_1^2, \sigma_2^2, \dots, \sigma_n^2).
\] If there is correlation between the errors, then the var-cov matrix of \(\varepsilon\) will no longer be diagonal.
6.2 Ordinary Least Squares
Let \(\hat{\beta}\) denote some estimator for \(\beta\). Then the fitted values associated with these estimators are \[
\underset{n \times 1}{\hat{y}} = \underset{n \times K}{\phantom{\beta}X\phantom{\beta}} \;\underset{K \times 1}{\hat{\beta}}
\] and the residuals are \[
\underset{n \times 1}{\hat{\varepsilon}} = \underset{n \times 1}{y} - \underset{n \times 1}{\hat{y}} = \underset{n \times 1}{y} - \underset{n \times K}{\phantom{\beta}X\phantom{\beta}}\underset{K \times 1}{\hat{\beta}}.
\] The residual sum of squares is then \[
\begin{aligned}
\underset{1 \times 1}{RSS}
=
\underset{1 \times n}{\phantom{i^I}\hat{\varepsilon}^\mathrm{T}}\;\underset{n \times 1}{\hat{\varepsilon}}
&=
\underset{---1 \times n---}{(y - X\hat{\beta})^\mathrm{T}}\;\;\underset{--n \times 1--}{(y - X\hat{\beta})} \\[2ex]
&=
\underset{1 \times n}{\phantom{i}y^\mathrm{T}}\;\underset{n \times 1}{y}
- \underset{1 \times K}{\hat{\beta}^\mathrm{T}} \;\underset{K \times n}{\phantom{\beta}X^\mathrm{T}} \;\underset{n \times 1}{y}
- \underset{1 \times n}{y^\mathrm{T}} \; \underset{n \times K}{\phantom{y}X\phantom{y}}\;\underset{K \times 1}{\hat\beta}
+
\underset{1 \times K}{\hat{\beta}^\mathrm{T}}\;\underset{K \times n}{\phantom{y}X^\mathrm{T}}\;\underset{n \times K}{\phantom{\beta}X\phantom{\beta}}\underset{K \times 1}{\hat{\beta}} \\[2ex]
&=
\underset{1 \times 1}{y^\mathrm{T}y}
-
2\underset{--1 \times 1--}{\hat{\beta}^\mathrm{T}X^\mathrm{T}y}
+
\underset{---1 \times 1---}{\hat{\beta}^\mathrm{T}X^\mathrm{T}X\hat{\beta}}
\end{aligned}
\] where we have used the fact that \(\hat{\beta}^\mathrm{T}X^\mathrm{T}y\) is the transpose of \(y^\mathrm{T}X\hat{\beta}\), and the transpose of a scalar is the scalar itself. I have indicated the dimensions of the matrices in the expressions above to help you read the matrix equations. I will not do so from this point, but I strongly recommend that you continue this practice until you feel comfortable reading matrix expressions.
The OLS estimators are those that minimize RSS: \[
\hat{\beta}^{ols} = \text{argmin}_{\hat{\beta}} RSS\,.
\] The first-order conditions are \[
\left. \frac{\partial RSS}{\partial \hat{\beta}} \right |_{\hat{\beta}^{ols}}
= - 2X^\mathrm{T}y + 2X^\mathrm{T}X\hat{\beta}^{ols} = 0.
\tag{6.6}\] This implies \[
X^\mathrm{T}X\hat{\beta}^{ols} = X^\mathrm{T}y
\] which, given our assumption that \(X\) is full column rank, can be solved for \(\hat{\beta}^{ols}\): \[
\hat{\beta}^{ols} = (X^\mathrm{T}X)^{-1}X^\mathrm{T}y.
\tag{6.7}\] The second partial derivatives of the RSS is positive definite: \[
\frac{\partial^2 RSS}{\partial \hat{\beta} \partial \hat{\beta}^\mathrm{T}}
= 2X^\mathrm{T}X
\] since \(X\) is full column rank, which means that \(Xc \neq 0\) for all \(c \neq 0\), and therefore \[
c^\mathrm{T}X^\mathrm{T}Xc = (Xc)^\mathrm{T}Xc > 0\,.
\] This guarantees that (6.7) solves the minimization problem.
The FOC can also be written as \[
X^\mathrm{T}(y - X\hat{\beta}^{ols}) = X^\mathrm{T}\hat{\varepsilon}^{ols}= 0.
\tag{6.8}\] Partitioning \(X^\mathrm{T}\) “by variable” as in (6.5), we can see that (6.8) says that OLS residuals sum to zero, and are orthogonal to each of the regressors: \[
X^\mathrm{T}\hat{\varepsilon}^{ols}
= \begin{bmatrix}
i_n^\mathrm{T} \\
X_{*1}^\mathrm{T} \\
X_{*2}^\mathrm{T} \\
\vdots \\
X_{*,K-1}^\mathrm{T}
\end{bmatrix}\hat{\varepsilon}^{ols}
=
\begin{bmatrix}
i_n^\mathrm{T}\hat{\varepsilon}^{ols} \\
X_{*1}^\mathrm{T} \hat{\varepsilon}^{ols} \\
X_{*2}^\mathrm{T} \hat{\varepsilon}^{ols} \\
\vdots \\
X_{*,K-1}^\mathrm{T}\hat{\varepsilon}^{ols}
\end{bmatrix}
= 0_{K \times 1}\,
\] In other words, we have \[
\sum_{i=1}^n \widehat{\varepsilon}_{i}^{ols} = 0 \;\text{ and }\;
\sum_{i=1}^n X_{ik}\widehat{\varepsilon}_{i}^{ols} = 0 \;\text{ for all }\; k=1, \dots, K-1\,.
\tag{6.9}\] We can also view our estimators as arising from a “method of moments” perspectively. The assumption that \(E(\epsilon \mid X_1, \dots, X_{K-1}) = 0\) implies \(E(\epsilon) = 0\) and \(E(X_k \epsilon) = 0\) for all \(k = 1, \dots, K-1\). By choosing our estimators to solve (6.9), we are choosing our estimators to satisfy the sample moment conditions corresponding to these population moments.
6.3 Algebraic Properties of OLS Estimators
We list here some algebraic properties of OLS estimators. From this point onwards, we drop the ‘OLS’ superscript from the OLS estimators, fitted values and residuals, and write \(\hat{\beta}\), \(\hat{Y}\) and \(\hat{\varepsilon}\) for \(\hat{\beta}^{ols}\), \(\hat{Y}^{ols}\) and \(\hat{\varepsilon}^{ols}\). We will reinstate the superscript whenever context demands it so.
OLS estimators are linear estimators, by which we mean: \[
\hat{\beta} = (X^\mathrm{T}X)^{-1}X^\mathrm{T}y = Ay\,.
\] This says that each OLS estimator \(\hat{\beta}_k\), \(k=0,1,...,K-1\) can be written as \[
\hat{\beta}_k = \sum_{i=1}^n a_{ki}Y_i
\] where \(a_{ki}\,,\, i=1,2,...,n\) are the elements of the \(k\mathrm{th}\) row of \(A = (X^\mathrm{T}X)^{-1}X^\mathrm{T}\).
The OLS estimators can also be written as \[
\hat{\beta}
= (X^\mathrm{T}X)^{-1}X^\mathrm{T}y
= (X^\mathrm{T}X)^{-1}X^\mathrm{T}(X\beta+\varepsilon)
= \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon.
\] This form of the OLS estimator is useful for deriving its statistical properties, since it expresses \(\widehat{\beta}\) in terms of the actual parameter value \(\beta\).
The OLS fitted values can be written as \[
\hat{y} = X\hat{\beta}=X(X^\mathrm{T}X)^{-1}X^\mathrm{T}y.
\] The matrix \(X(X^\mathrm{T}X)^{-1}X^\mathrm{T}\) is sometimes called the “hat” matrix (because it puts a “hat” on \(y\)). It is also called the “projection” matrix, since it projects \(y\) onto the space spanned by the columns of \(X\) (see Tay, Preve, and Baydur (2025) Chapter 10). It is often denoted \(P\). It has the convenient property that it is symmetric: \[
P^\mathrm{T}
= (X(X^\mathrm{T}X)^{-1}X^\mathrm{T})^\mathrm{T}
= (X^{\mathrm{T}\mathrm{T}}((X^\mathrm{T}X)^{-1})^\mathrm{T}X^\mathrm{T}) = X(X^\mathrm{T}X)^{-1}X^\mathrm{T} = P
\] where we have used the fact that \((X^\mathrm{T}X)^{-1}\) is symmetric (why is \((X^\mathrm{T}X)^{-1}\) symmetric?).
The matrix \(P\) is also idempotent, meaning that \(PP=P\): \[
PP=X(X^\mathrm{T}X)^{-1}X^\mathrm{T}X(X^\mathrm{T}X)^{-1}X^\mathrm{T}=X(X^\mathrm{T}X)^{-1}X^\mathrm{T}=P.
\] Symmetric and idempotent matrices have the convenient property that their rank is equal to their trace, which is easy to compute. Since \[
\mathrm{tr}(P) = \mathrm{tr}(X(X^\mathrm{T}X)^{-1}X^\mathrm{T}) = \mathrm{tr}(X^\mathrm{T}X(X^\mathrm{T}X)^{-1}) = \mathrm{tr}(I_K) = K,
\] the rank of \(P\) is \(K\).
The OLS residuals can be written as \[
\hat{\varepsilon} = y - \hat{y} = (I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T})y.
\] The matrix \(I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T}\) is also symmetric and idempotent, and its trace, and therefore its rank, is \(n-K\) (see exercises). It is often denoted by \(M\), and has the property that it “eliminates \(X\)” in the sense that \[
MX = (I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T})X = X-X = 0.
\] As a consequence of this, we have \[
MP = MX(X^\mathrm{T}X)^{-1}X^\mathrm{T} = 0.
\] Of course, you can also see this from \(MP=(I-P)P = P-PP = P-P = 0\).
We have already noted from the FOC that the OLS residuals sum to zero, and are orthogonal to each of the regressors. Since \(y = \hat{y} + \hat{\varepsilon}\), it follows that \(\overline{Y} = \overline{\hat{Y}}\). Furthermore, \(\hat{y}^\mathrm{T}\hat{\varepsilon} = 0\). That is, the fitted values and the residuals are orthogonal. We can also use the fact that \(MP=PM=0\): \[
\hat{y}^\mathrm{T}\hat{\varepsilon} = y^\mathrm{T}PMy = 0.
\] For those who are not uncomfortable thinking about \(n\)-dimensional vectors in geometric terms, this means the fitted values and residuals are at “right-angles” in \(n\)-dimensional space. The length of a vector \(y\) is \(\sqrt{y^\mathrm{T}y}\). Using orthogonality of the fitted values and residuals, we get \[
\begin{aligned}
y^\mathrm{T}y &= \hat{y}^\mathrm{T}\hat{y} + 2\hat{y}^\mathrm{T}\hat{\varepsilon} + \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} \\
&= \hat{y}^\mathrm{T}\hat{y} + \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}.
\end{aligned}
\tag{6.10}\] This is just Pythagoras’s Theorem (in \(n\)-dimensional space).
The \(TSS = FSS + RSS\) equality continues to hold in the multiple regression case \[
\sum_{i=1}^n (Y_i - \overline{Y})^2 = \sum_{i=1}^n (\hat{Y}_i - \overline{\hat{Y}})^2 + \sum_{i=1}^n \hat{\epsilon}_i^2.
\tag{6.11}\] See Exercise 6.3 and Exercise 6.4 for a matrix algebra version of this identity.
One useful application of the fact that \(M\) eliminates \(X\) is to derive a formula linking the residuals to the noise terms. We have \[
\hat{\varepsilon} = My = M(X\beta+\varepsilon) = M\varepsilon
\] This result, and the fact that \(M\) is symmetric and idempotent, means that the sum of squared residuals can be written as \[
\hat{\varepsilon}^\mathrm{T} \hat{\varepsilon}
= (M\varepsilon)^\mathrm{T}M\varepsilon
= \varepsilon^\mathrm{T}M^\mathrm{T}M\varepsilon
= \varepsilon^\mathrm{T}M\varepsilon.
\] This expression is also very useful for deriving properties of OLS estimators.
6.4 Statistical Properties of OLS Estimators.
6.4.1 Unbiasedness
Under our assumptions, \(\hat{\beta}\) is unbiased. Using \(\hat{\beta} = \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon\), we have \[
E(\hat{\beta} \mid X)
= \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}E(\varepsilon \mid X)
= \beta
\] which implies \(E(\hat{\beta})=\beta\). The key assumption delivering unbiasedness is, of course, \(E(\varepsilon \mid X) = 0\). Note that the structure of the variance-covariance matrix of \(\varepsilon\) is irrelevant as far as unbiasedness of OLS estimators is concerned
6.4.2 Variance-Covariance Matrices
The variances and covariances of all of the OLS coefficient estimators can be obtained by computing the (conditional) variance-covariance matrix of \(\hat{\beta}\): \[
\begin{aligned}
\mathit{Var}(\hat{\beta} \mid X)
&= E((\hat{\beta} - E(\hat{\beta}))(\hat{\beta} - E(\hat{\beta}))^\mathrm{T} \mid X) \\
&= E((\hat{\beta} - \beta)(\hat{\beta} - \beta)^\mathrm{T} \mid X) \\
&= E((X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon \varepsilon^\mathrm{T}X(X^\mathrm{T}X)^{-1} \mid X) \\
&= (X^\mathrm{T}X)^{-1}X^\mathrm{T}E(\varepsilon \varepsilon^\mathrm{T} \mid X)X(X^\mathrm{T}X)^{-1}.
\end{aligned}
\] If we further assume homoskedastic and uncorrelated errors, then \(E(\varepsilon \varepsilon^\mathrm{T} \mid X) = \sigma^2 I\), and \(\mathit{Var}(\hat{\beta} \mid X)\) simplifies to \[
\begin{aligned}
\mathit{Var}(\hat{\beta} \mid X)
&= (X^\mathrm{T}X)^{-1}X^\mathrm{T}(\sigma^2 I)X(X^\mathrm{T}X)^{-1} \\[1ex]
&= \sigma^2(X^\mathrm{T}X)^{-1}X^\mathrm{T}X(X^\mathrm{T}X)^{-1} \\[1ex]
&= \sigma^2(X^\mathrm{T}X)^{-1}
\end{aligned}
\tag{6.12}\]
Example 6.1 In the simple linear regression \(Y_i = \beta_0 + \beta_1 X_{1i} + \epsilon_i\), \(i=1,2,...,n\), we have \[
X =
\begin{bmatrix} 1 & X_{11} \\ 1 & X_{21} \\ \vdots & \vdots \\ 1 & X_{n1}
\end{bmatrix}
\quad \text{and} \quad
X^\mathrm{T}X =
\begin{bmatrix}
n & \sum_{i=1}^n X_{i1} \\
\sum_{i=1}^n X_{i1} & \sum_{i=1}^n X_{i1}^2
\end{bmatrix}\,.
\] The OLS estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) can be found from the \(2 \times 1\) vector \[
\hat{\beta}
= \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{bmatrix}
= (X^\mathrm{T}X)^{-1}X^\mathrm{T}y \,.
\] The formulas obtained are the same as the ones previously derived. With homoskedastic and uncorrelated errors, the variances and covariances of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) can be found from \[
\mathit{Var}(\hat{\beta} \mid X)
= \begin{bmatrix}
\mathit{Var}(\hat{\beta}_0 \mid X) & \mathit{Cov}(\hat{\beta}_0,\hat{\beta}_1 \mid X) \\
\mathit{Cov}(\hat{\beta}_0, \hat{\beta}_1 \mid X) & \mathit{Var}(\hat{\beta}_1 \mid X) \end{bmatrix}
= \sigma^2(X^\mathrm{T}X)^{-1}.
\tag{6.13}\] Previously we derived the expression for the variance of \(\hat{\beta}_1\) only. The expression in (6.13) contains both the variance of \(\hat{\beta}_1\) and \(\hat{\beta}_0\) and their covariance. You will explore these expressions in the exercises.
In order to operationalize (6.13), we have to estimate \(\sigma^2\). Under homoskedasticity, an unbiased estimator for \(\sigma^2\) is \[
\begin{aligned}
\widehat{\sigma^2}
= \frac{\hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}}{n-K}
= \frac{\sum_{i=1}^n \hat{\epsilon}_i^2}{n-K}.
\end{aligned}
\] To prove unbiasedness of this estimator, we note that \[
\begin{aligned}
E(\hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} \mid X)
&= E(\varepsilon^\mathrm{T}M\varepsilon \mid X)
= E(\mathrm{tr}(\varepsilon^\mathrm{T}M\varepsilon) \mid X) \\
&= E(\mathrm{tr}(\varepsilon\varepsilon^\mathrm{T}M) \mid X)
= \mathrm{tr}(E(\varepsilon\varepsilon^\mathrm{T}M \mid X)) \\
&= \mathrm{tr}(E(\varepsilon\varepsilon^\mathrm{T} \mid X)M)
= \mathrm{tr}(\sigma^2M) \\
&= \sigma^2 (n-K).
\end{aligned}
\] Unbiasedness of \(\widehat{\sigma^2}\) follows. We therefore estimate \(\mathit{Var}(\hat{\beta})\) using \[
\widehat{\mathit{Var}}(\hat{\beta} \mid X) = \widehat{\sigma^2}(X^\mathrm{T}X)^{-1}.
\]
Example 6.2 In Example 3.2 we regressed \(\ln earn\) on \(educ\) using data from earnings2019.csv using the lm() function, repeated here for reference.
In the general case \(E(\varepsilon \varepsilon^\mathrm{T} \mid X) = \Omega\), the formula for the variance-covriance matrix of \(\hat\beta\) is \[
\mathit{Var}(\hat{\beta} \mid X)
= (X^\mathrm{T}X)^{-1}X^\mathrm{T}\Omega X(X^\mathrm{T}X)^{-1}.
\] To get any further we would have to put more structure on \(\Omega\). If we have uncorrelated but possibly heteroskedastic noise terms, then \[
\begin{aligned}
&\mathit{Var}(\hat{\beta} \mid X) \\[2ex]
&= (X^\mathrm{T}X)^{-1}X^\mathrm{T}\mathrm{diag}(\sigma_1^2,\sigma_2^2,\dots,\sigma_n^2) X(X^\mathrm{T}X)^{-1} \\
&= \left( \sum_{i=1}^n X_{i*}^\mathrm{T} X_{i*}\right)^{-1}
\begin{bmatrix} X_{1*}^\mathrm{T} & X_{2*}^\mathrm{T} & \dots & X_{n*}^\mathrm{T} \end{bmatrix}
\begin{bmatrix}
\sigma_1^2 & 0 & \dots & 0 \\
0 & \sigma_2^2 & \dots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \dots & \sigma_n^2
\end{bmatrix}
\begin{bmatrix}
X_{1*} \\ X_{2*} \\ \vdots \\ X_{n*} \end{bmatrix}
\left( \sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1} \\
&=
\left( \sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}
\left( \sum_{i=1}^n \sigma_i^2 X_{i*}^\mathrm{T}X_{i*}\right)
\left( \sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}\,.
\end{aligned}
\tag{6.14}\]
Expression (6.14) is not “operational” because we do not know the \(\sigma_i^2\) nor can we estimate them — there are \(n\) of them and we only have \(n\) observations. Nonetheless it is still possible to develop an estimator for \(\mathit{Var}(\hat\beta)\) from (6.14). We will see how this can be done when we discuss asymptotic properties of OLS estimators. In the meantime, we continue our discussion under the assumption of homoskedasticity. In particular, we show that if the noise terms are homoskedastic, then OLS estimators are the most precise estimators among all linear unbiased estimators, or “best linear unbiased” in the sense that \[
\mathit{Var}(c^\mathrm{T}\hat{\beta} \mid X) \leq \mathit{Var}(c^\mathrm{T}\tilde{\beta} \mid X)
\tag{6.15}\] for all \(K \times 1\) vectors \(c\), and for all unbiased estimators of the form \(\tilde{\beta} = By\). Then we discuss hypothesis testing under homoskedastic errors, and then finally the asymptotic properties of OLS estimators.
6.4.3 Best Linear Unbiasedness
We elaborate on (6.15) a little bit. Recall that OLS estimators are linear estimators, meaning that it can be written in the form \(\hat{\beta} = Ay\). In the case of OLS estimators, \(A = (X^\mathrm{T}X)^{-1}X^\mathrm{T}\). Suppose someone came up with a different linear unbiased estimator for \(\beta\), call it \(\widetilde{\beta} = By\) where \(B \neq A\). Which is better, \(\hat{\beta}\) or \(\widetilde{\beta}\)? They are both unbiased, so are equally good from that perspective. But given two unbiased estimators we should prefer the one with smaller variance. The inequality (6.15) says that each individual \(\hat{\beta}_k\) will have a variance less than, or at worst equal to, the variance of \(\widetilde{\beta}_k\) (choose \(c\) such that \(c_k = 1\), \(c_j = 0\) for all \(j \neq k\)). It also says that any linear combinations of \(\hat{\beta}\) will have variance less than, or at worst equal to, the variance of the same linear combination of \(\widetilde{\beta}\).
We say that under homoskedasticity, OLS estimators of the linear regression model are best linear unbiased. This result is known as the Gauss-Markov theorem. We also say OLS estimators under homoskedasticity are efficient.
Example 6.3 For a regression \(Y_i = \beta_0 + \beta_1 X_{i1} + \dots + \beta_{K-1} X_{i,K-1} + \epsilon_i\) with homoskedastic \(\epsilon_i\), the OLS prediction for \(Y\) at \((X_1, \dots, X_{K-1}) = (X_{0,1}, \dots, X_{0,K-1})\) is \[
\hat{Y}(X_0)= \hat\beta_0 + \hat\beta_1 X_{01} + \dots + \hat\beta_{K-1} X_{0,K-1}\,.
\]
This is a linear combination of OLS estimators, and by (6.15) it is the best linear unbiased prediction of \(Y\) at \(X_{0}\).
Example 6.4 Recall the dataset multireg_eg.csv containing data on three variables \(X\), \(Y\) and \(Z\). The first two are continuous random variables whereas \(Z\) is discrete, taking values in \(\{1, 2, 3, 4, 5\}\). See Fig. 4.3. If it helps, imagine that \(Y\) is test score for a certain course, \(X\) is study time per week a student spends on the course, and \(Z\) is the student’s background for the course. To estimate the effect of study time \(X\) on test scores \(Y\) controlling for student prior preparedness for the course, you estimate the regression \[
Y = \beta_0 + \beta_1 X + \beta_2 Z + \epsilon
\] using OLS, with interest in the parameter \(\beta_1\). You find little evidence of heteroskedasticity, so you know that the OLS estimator \(\widehat{\beta}_1\) gives you the best linear unbiased estimator.
Suppose someone now suggests to you to estimate five simple linear regressions of \(Y\) on \(X\), one for each value of \(Z\), i.e., estimate the regressions \[
Y_{i} = \beta_{0j} + \beta_{1j}X_{i} + \epsilon_{i}\,,\,\text{ for all }\,i\; \text{ such that }\, z_i = j\,,\,j\in\{1, 2, 3, 4, 5\}
\] and then average the five estimates, i.e., let \(\tilde{\beta}_1 = \frac{1}{5}\sum_{j=1}^5 \widehat{\beta}_i^j\). You are asked in an exercise to show that \(\widetilde{\beta}_1\) is in fact a linear unbiased estimator for \(\beta_1\). However, we should still prefer the OLS estimator, since the OLS estimator is best among all linear unbiased estimators.
To prove (6.15), let \(\tilde{\beta} = By\) where \(B\) comprises constants and elements of \(X\), and such that \(\tilde{\beta}\) is unbiased. Write \(B = D + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\), so \[
\tilde{\beta} = DX\beta+D\varepsilon + \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon.
\] We have already assumed \(E(\varepsilon \mid X)=0\). To ensure unbiasedness of \(\tilde{\beta}\), we have also to assume that \(DX=0\). We make this assumption. Then \[
\begin{aligned}
\mathit{Var}(\tilde{\beta} \mid X)
&= E((\tilde{\beta}-\beta)(\tilde{\beta}-\beta)^\mathrm{T} \mid X) \\
&= E((D+(X^\mathrm{T}X)^{-1}X^\mathrm{T})\varepsilon
\varepsilon^\mathrm{T}(D+(X^\mathrm{T}X)^{-1}X^\mathrm{T})^\mathrm{T} \mid X) \\
&= \sigma^2((X^\mathrm{T}X)^{-1}+DD^\mathrm{T}) \\
&= \mathit{Var}(\hat{\beta}) + \sigma^2DD^\mathrm{T} \,.
\end{aligned}
\] It follows that for any \(c \neq 0\), \[
\begin{aligned}
\mathit{Var}(c^\mathrm{T}\tilde{\beta} \mid X)
&= c^\mathrm{T}\mathit{Var}(\tilde{\beta} \mid X)c \\
&= c^\mathrm{T}\mathit{Var}(\hat{\beta})c + \sigma^2c^\mathrm{T}DD^\mathrm{T}c \\
&= \mathit{Var}(c^\mathrm{T}\hat{\beta}) + \sigma^2(D^\mathrm{T}c)^\mathrm{T}(D^\mathrm{T}c) \,.
\end{aligned}
\] The second term on the right-hand side is non-negative, so result (6.15) follows.
6.4.4 Hypothesis Testing
We have already seen how to test single and joint hypotheses in Section 4.5. In this section, we develop matrix algebra formulas for the \(t\) and \(F\) tests.
If in the regression \(y = X\beta + \varepsilon\), \(E(\varepsilon \mid X) = 0\) and \(\mathit{Var}(\varepsilon \mid X) = \sigma^2 I\), we add the assumption that the noise terms are normally distributed, i.e., \[
y = X\beta + \varepsilon\,,\,\, \varepsilon \mid X \sim \text{Normal}_n(0, \sigma^2 I)\,,
\] then we have \[
\hat{\beta} \mid X \sim \text{Normal}_K(\beta, \sigma^2(X^\mathrm{T}X)^{-1})
\tag{6.16}\] since \(\hat{\beta} = \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\varepsilon\) is a constant plus a linear combination of normally distributed noise terms. This can be used to develop \(t\) and \(F\) tests of linear hypotheses concerning elements of \(\beta\). A general single linear hypothesis can be written as \[
H_0: r^\mathrm{T}\beta = r_0 \quad \text{vs} \quad r^\mathrm{T}\beta \neq r_0.
\]
Example 6.5 To test \(\beta_1 + \beta_2 = 1\) in the regression \[
Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i,
\] set \(r^\mathrm{T}=\begin{bmatrix} 0 & 1 & 1 \end{bmatrix}\) and \(r_0 = 1\).
From (6.16), we have \[
r^\mathrm{T}\hat{\beta} \mid X \sim \text{Normal}(r^\mathrm{T}\beta,r^\mathrm{T}(\sigma^2 (X^\mathrm{T}X)^{-1})r).
\] If the null hypothesis \(r^\mathrm{T}\beta = r_0\) holds, then \[
r^\mathrm{T}\hat{\beta} \mid X \sim \text{Normal}(r_0,r^\mathrm{T}(\sigma^2 (X^\mathrm{T}X)^{-1})r)
\] and \[
\frac{r^\mathrm{T}\hat{\beta} - r_0}{\sqrt{r^\mathrm{T}(\sigma^2 (X^\mathrm{T}X)^{-1})r}} \sim \text{Normal}(0,1).
\] Furthermore, it can be shown that if we replace \(\sigma^2\) with \(\widehat{\sigma^2}\), then \[
\begin{aligned}
t &= \frac{r^\mathrm{T}\hat{\beta} - r_0}{\sqrt{ r^\mathrm{T}(\widehat{\sigma^2}(X^\mathrm{T}X)^{-1})r}} \\
&= \frac{r^\mathrm{T}\hat{\beta} - r_0}{\sqrt{ r^\mathrm{T}\widehat{\mathit{Var}}(\hat{\beta} \mid X)r}}
\sim \text{t}{(n-K)}.
\end{aligned}
\tag{6.17}\] This can be used to test the hypothesis \(H_0: r^\mathrm{T}\beta = r_0\) in the usual way.
To test multiple hypotheses jointly, write the hypotheses as \[
H_0: \mathcal{R}\beta = r_0 \quad \text{vs} \quad H_A: \mathcal{R}\beta \neq r_0
\] where now \(\mathcal{R}\) is a \((J \times K)\) matrix, and \(r_0\) is a \((J \times 1)\) vector.
Example 6.6 To test the hypotheses \[
H_0: \beta_1 + \beta_2 = 1 \; \text{and} \; \beta_3 = 0
\quad \text{vs} \quad
H_A: \beta_1 + \beta_2 \neq 1 \; \text{or} \; \beta_3 \neq 0 \; \text{(or both)},
\] in the regression \[
Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_3 X_{i3} + \varepsilon_i
\]
set the matrices \(\mathcal{R}\) and \(r_0\) to \[
\mathcal{R} = \begin{bmatrix} 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}
\quad \text{and} \quad r_0 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}.
\]
To test multiple hypotheses jointly, we can again compare the residual sum of squares from the restricted and unrestricted regressions, as explained in Section 4.5. In particular, it can be shown that if the null is true, then \[
F = \frac{(\hat{\varepsilon}_{rls}^\mathrm{T}\hat{\varepsilon}_{rls} - \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}) / J}
{\hat{\varepsilon}^\mathrm{T}\hat{\varepsilon} / (n-K)}
\sim \text{F}{(J,n-K)}
\tag{6.18}\] where \(J\) is the number of restrictions being tested and \(\hat{\varepsilon}\) is the vector of unrestricted OLS residuals. You would reject \(H_0\) if the \(F\)-statistic is “improbably large”, meaning that \(F > \text{F}_{1-\alpha}(J, n-K)\) where \(\text{F}_{1-\alpha}(J, n-K)\) is the \(1-\alpha\) percentile of the \(\mathrm{F}(J,n-K)\) distribution. Typically \(\alpha\) = 0.01, 0.05 or 0.1.
It turns out that in practice, one does not actually have to compute the restricted regression. It can be shown that \[
\hat{\varepsilon}_{rls}^\mathrm{T}\hat{\varepsilon}_{rls} - \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}
= (\mathcal{R}\hat{\beta}-r_0)^\mathrm{T}(\mathcal{R}(X^\mathrm{T}X)^{-1}\mathcal{R}^\mathrm{T})^{-1}(\mathcal{R}\hat{\beta}-r_0)
\tag{6.19}\] where \(\hat{\beta}\) is the unrestricted OLS estimators (we will show this shortly). Furthermore, since the denominator of the \(F\)-statistic is \(\widehat{\sigma^2}\), we can write the \(F\)-statistic as \[
\begin{aligned}
F
&= (\mathcal{R}\hat{\beta}-r_0)^\mathrm{T}
(\mathcal{R}(\widehat{\sigma^2}(X^\mathrm{T}X)^{-1})\mathcal{R}^\mathrm{T})^{-1}
(\mathcal{R}\hat{\beta}-r_0)/J \\
&= (\mathcal{R}\hat{\beta}-r_0)^\mathrm{T}
(\mathcal{R}\,\widehat{\mathit{Var}}(\hat{\beta} \mid X)\,\mathcal{R}^\mathrm{T})^{-1}
(\mathcal{R}\hat{\beta}-r_0)/J \\
&\sim \text{F}{(J,n-K)}
\end{aligned}
\tag{6.20}\] If the error terms are not normally distributed, we can use the asymptotically valid chi-sq version of the test, namely \[
JF \overset{a}\sim \chi^2(J)\,.
\]
The following is the proof of (6.19). Let the regression model be \(y=X\beta+\varepsilon\), and let \(\hat{\beta}^{rls}\) be the least squares estimator for \(\beta\) subject to the restriction that \(\mathcal{R}\beta=r\). We first show that \[
\hat{\beta}^{rls}
= \hat{\beta}^{ols}
+ (X^\mathrm{T}X)^{-1}\mathcal{R}^\mathrm{T}(\mathcal{R}(X^\mathrm{T}X)^{-1}\mathcal{R}^\mathrm{T})^{-1}(r-\mathcal{R}\hat{\beta}^{ols})
\] where \(\hat{\beta}^{ols}\) is the usual unrestricted OLS estimator. The restricted RSS minimization problem is \[
\hat{\beta}^{rls} = \text{argmin}_{\hat{\beta}}(y-X\hat{\beta})^\mathrm{T}(y-X\hat{\beta}) \; \text{subject to} \; \mathcal{R}\hat{\beta}-r=0.
\] The Lagrangian and FOC are \[
L= (y-X\hat{\beta})^\mathrm{T}(y-X\hat{\beta}) + 2(r^\mathrm{T} - \hat{\beta}^\mathrm{T} \mathcal{R}^\mathrm{T})\lambda\,.
\]\[
\begin{aligned}
\left.\frac{\partial L}{\partial \hat{\beta}}\right|_{\hat{\beta}^{rls},\hat{\lambda}}
&= -2X^\mathrm{T}y + 2X^\mathrm{T}X\hat{\beta}^{rls}-2\mathcal{R}^\mathrm{T}\hat{\lambda}=0 \\
\left.\frac{\partial L}{\partial \hat{\lambda}}\right|_{\hat{\beta}^{rls},\hat{\lambda}}
&= \phantom{-}2(r-\mathcal{R}\hat{\beta}^{rls})=0 \\
\end{aligned}
\] The second equation in the FOC merely says that the restriction must hold. The first equation in the FOC implies \[
\hat{\beta}^{rls}
= (X^\mathrm{T}X)^{-1}X^\mathrm{T}y + (X^\mathrm{T}X)^{-1} \mathcal{R}^\mathrm{T} \hat{\lambda}
= \hat{\beta}^{ols} + (X^\mathrm{T}X)^{-1} \mathcal{R}^\mathrm{T} \hat{\lambda}\,.
\] Multiplying throughout by \(\mathcal{R}\) gives \[
\mathcal{R}\hat{\beta}^{rls}
= \mathcal{R}\hat{\beta}^{ols} + \mathcal{R}(X^\mathrm{T}X)^{-1} \mathcal{R}^\mathrm{T} \hat{\lambda}\,.
\] If follows that \[
\begin{aligned}
\hat{\lambda}
&= (\mathcal{R}(X^\mathrm{T}X)^{-1} \mathcal{R}^\mathrm{T})^{-1}
(\mathcal{R}\hat{\beta}^{rls}- \mathcal{R}\hat{\beta}^{ols} ) \\
&= (\mathcal{R}(X^\mathrm{T}X)^{-1} \mathcal{R}^\mathrm{T})^{-1}
(r- \mathcal{R}\hat{\beta}^{ols} )\,,
\end{aligned}
\] and therefore \[
\hat{\beta}^{rls}=\hat{\beta}^{ols} +
(X^\mathrm{T}X)^{-1}\mathcal{R}^\mathrm{T}(\mathcal{R}(X^\mathrm{T}X)^{-1}\mathcal{R}^\mathrm{T})^{-1}(r-\mathcal{R}\hat{\beta}^{ols})\,.
\tag{6.21}\] Now let \[
\begin{aligned}
\hat{\varepsilon}_{rls}
&= y - X\hat{\beta}^{rls} \\
&= y - X\hat{\beta}^{ols} + X\hat{\beta}^{ols} - X\hat{\beta}^{rls}
= \hat{\varepsilon}_{ols} + X(\hat{\beta}^{ols} - \hat{\beta}^{rls})\,.
\end{aligned}
\tag{6.22}\] Since (unrestricted) OLS residuals are orthogonal to the regressors, we have \[
\hat{\varepsilon}_{ols}^\mathrm{T} \hat{\varepsilon}_{rls}
= \hat{\varepsilon}_{ols}^\mathrm{T} \hat{\varepsilon}_{ols}
+ \hat{\varepsilon}_{ols}^\mathrm{T}X(\hat{\beta}^{ols} - \hat{\beta}^{rls})
= \hat{\varepsilon}_{ols}^\mathrm{T} \hat{\varepsilon}_{ols}.
\] Therefore \[
(\hat{\varepsilon}_{rls} - \hat{\varepsilon}_{ols})^\mathrm{T}(\hat{\varepsilon}_{rls} - \hat{\varepsilon}_{ols})
=
\hat{\varepsilon}_{rls}^\mathrm{T}\hat{\varepsilon}_{rls} - \hat{\varepsilon}_{ols}^\mathrm{T}\hat{\varepsilon}_{ols}.
\tag{6.23}\] Finally, use (6.21), (6.22) and (6.23) to show (6.19).
Example 6.7 Previously we estimated the equation \[
\ln earn = \beta_0 + \beta_1 educ + \beta_2 black + \beta_3 female + \beta_4 black.female+ \epsilon
\] by OLS using the lm function, and used the linearHypothesis function from the car package to carry out a t-test of the hypothesis \(H_0: \beta_2 = \beta_3\) vs \(H_A: \beta_2 \neq \beta_3\), and an F-test of the joint hypotheses \(H_0: \beta_2 = \beta_3\) and \(\beta_4 = 0\). The results are repeated below:
Notice that the linearHypothesis() function returns an \(F\)-statistic even when testing a single hypothesis. Of course, even though the \(F\)-test is designed with testing joint hypothesis, there is no reason why is can’t be used to test a single hypothesis. It can be shown (see exercises) that in general the \(F\)-statistic for a test of a single hypothesis is the square of the corresponding \(t\)-statistic. You can verify that this is the case in the example above. The two tests will produce identical p-values.
6.5 Some Asymptotic Results
In this section, we provide rough arguments for the consistency and asymptotic normality of the OLS estimators. We also provide an asymptotically valid estimator for the variance-covariance matrix of the OLS estimator under conditional heteroskedasticity. For more complete arguments, please see any advanced econometrics texts, such as Hayashi (2000).
6.5.1 Consistency
Partitioning \(X\) by observation, as in (6.4), we can write the OLS estimator as \[
\begin{aligned}
\hat{\beta}^{ols}
&= \beta + (X^\mathrm{T}X)^{-1}X^\mathrm{T}\epsilon \\
&= \beta + \left\{
\begin{bmatrix}
X_{1*}^\mathrm{T} & X_{2*}^\mathrm{T} & \dots & X_{n*}^\mathrm{T}
\end{bmatrix}
\begin{bmatrix}
X_{1*} \\ X_{2*} \\ \vdots \\ X_{n*}
\end{bmatrix}
\right\}^{-1}
\begin{bmatrix}
X_{1*}^\mathrm{T} & X_{2*}^\mathrm{T} & \dots & X_{n*}^\mathrm{T}
\end{bmatrix}
\begin{bmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n
\end{bmatrix} \\
&=
\beta + \left( \sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}\sum_{i=1}^n X_{i*}^\mathrm{T}\epsilon_i \\[1ex]
&=
\beta + \left(\frac{1}{n} \sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}\left(\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}\epsilon_i\right).
\end{aligned}
\tag{6.24}\] Note that \[
\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}
=
\begin{bmatrix}
1 & \frac{1}{n}\sum_{i=1}^n X_{i1} & \dots & \frac{1}{n}\sum_{i=1}^n X_{i,K-1} \\
\frac{1}{n}\sum_{i=1}^n X_{i1} & \frac{1}{n}\sum_{i=1}^n X_{i1}^2 & \dots &\frac{1}{n}\sum_{i=1}^n X_{i1}X_{i,K-1} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{1}{n}\sum_{i=1}^n X_{i,K-1} & \frac{1}{n}\sum_{i=1}^n X_{i1}X_{i,K-1} & \dots & \frac{1}{n}\sum_{i=1}^n X_{i, K-1}^2
\end{bmatrix}
\] Since we have iid samples, we should expect each of the component sample means to converge to the corresponding population moment, e.g., \(\frac{1}{n}\sum_{i=1}^n X_{i1} \overset{p}\rightarrow E(X_1)\), \(\frac{1}{n}\sum_{i=1}^n X_{i1}^2 \overset{p}\rightarrow E(X_1^2)\), \(\frac{1}{n}\sum_{i=1}^n X_{i1}X_{i,K-1} \overset{p}\rightarrow E(X_1X_{K-1})\) and so on. That is, \[
\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}
\overset{p}\longrightarrow
\begin{bmatrix}
1 & E(X_1) & \dots & E(X_{K-1}) \\
E{(X_1)} & E(X_1^2) & \dots & E(X_1X_{K-1}) \\
\vdots & \vdots & \ddots & \vdots \\
E(X_{K-1}) & E(X_1X_{K-1}) & \dots & E(X_{K-1}^2)
\end{bmatrix}
= "\Sigma_{XX}"
\] which we assume is invertible. Likewise, we have \[
\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}\epsilon_i
=
\begin{bmatrix}
\frac{1}{n}\sum_{i=1}^n \epsilon_{i} \\ \frac{1}{n}\sum_{i=1}^n X_{i1}\epsilon_i \\ \vdots \\ \frac{1}{n}\sum_{i=1}^n X_{i,K-1}\epsilon_i
\end{bmatrix}
\overset{p}{\rightarrow}
\begin{bmatrix}
E(\epsilon) \\ E(X_{1}\epsilon) \\ \vdots \\ E(X_{K-1}\epsilon)
\end{bmatrix}
= \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}
\] Therefore \[
\begin{aligned}
\hat{\beta}^{ols}
&=
\beta + \left(\frac{1}{n} \sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}\left(\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}\epsilon_i\right)
\end{aligned}
\overset{p}\longrightarrow \beta + \Sigma_{XX}^{-1} 0_K = \beta\,.
\] The key assumptions for consistency are the assumptions \(E(\epsilon)=0\), \(E(X_1\epsilon)=0\), \(\dots\), \(E(X_{K-1}\epsilon)=0\), i.e., that the noise term \(\epsilon\) has zero mean and is uncorrelated with each of the regressors in population.
This is basically the general version of the simple linear regression argument that \[
\hat{\beta}_1
= \beta_1 + \frac{\text{sample cov}(X_i, \epsilon_i)}{\text{sample var}(X_i)}
\overset{p}\rightarrow \, \beta_1
\] if \(\mathit{Cov}(X_i,\epsilon_i)=0\) and \(\mathit{Var}(X_i)\) is finite, and if their sample counterparts converge in probability to them.
6.5.2 Asymptotic Normality
Since \(\hat{\beta}_1\) is consistent, its distribution is essentially degenerate in the limit. To talk about limiting distributions, we have to scale \(\hat{\beta}_1\). Rearranging 6.24 we have \[
\sqrt{n}(\hat{\beta} - \beta)
= \left(\frac{1}{n} \sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}
\left(\frac{1}{\sqrt{n}} \sum_{i=1}^n X_{i*}^\mathrm{T}\epsilon_i \right),
\] If in addition to our previous assumptions we add the assumption that \[
\frac{1}{n} \sum_{i=1}^n \epsilon_i^2X_{i*}^\mathrm{T}X_{i*}
\overset{p}\longrightarrow
S\;\text{ finite and non-singular}
\] then a CLT allows us to claim asymptotic normality. We have: \[
\frac{1}{\sqrt{n}} \sum_{i=1}^n X_{i*}^\mathrm{T}\epsilon_i
\overset{d}\longrightarrow
\text{Normal}_K(0,S),
\] therefore \[
\sqrt{n}(\hat{\beta} - \beta)
= \underbrace{\left(\frac{1}{n} \sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}}
_{\overset{p}\longrightarrow \; \Sigma_{XX}^{-1}}
\underbrace{\left(\frac{1}{\sqrt{n}} \sum_{i=1}^n X_{i*}^\mathrm{T}\epsilon_i \right)}
_{\overset{d}\longrightarrow \; \text{Normal}_K(0,S)}
\rightarrow_d
\text{Normal}_K(0,\Sigma_{XX}^{-1}S\Sigma_{XX}^{-1})
\] That is, \(\hat{\beta}\) is consistent, with asymptotic variance \(\mathit{Avar}(\hat{\beta})=\Sigma_{XX}^{-1}S\Sigma_{XX}^{-1}\).
6.5.3 Heteroskedasticity-Robust Standard Errors
This result justifies the approximation \[
\mathit{Var}(\hat{\beta}) \approx (1/n)\Sigma_{XX}^{-1}S\Sigma_{XX}^{-1}.
\] An obvious estimator for \(\Sigma_{XX}\) is \[
\hat{\Sigma}_{XX}=\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*} = \frac{1}{n}X^\mathrm{T}X\,.
\] Some additional assumptions (see Hayashi (2000), for example) guarantee \[
\hat{S} = \frac{1}{n}\sum_{i=1}^n \hat{\epsilon}_i^2 X_{i*}^\mathrm{T}X_{i*} \; \rightarrow_p \,S
\tag{6.25}\] where \(\widehat{\epsilon}_i\) are the OLS residuals. This allows us to consistently estimate the asymptotic variance of \(\hat{\beta}\) by \[
\widehat{\mathit{Avar}}(\hat{\beta})
=
\left(\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*} \right)^{-1}
\left(\frac{1}{n}\sum_{i=1}^n \hat{\epsilon}_i^2X_{i*}^\mathrm{T}X_{i*}\right)
\left(\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*} \right)^{-1},
\] and justifies the use of the following as an estimator for the variance of \(\hat{\beta}\): \[
\begin{aligned}
\widehat{\mathit{Var}}_{HC0}(\hat{\beta})
&=
\frac{1}{n}\widehat{\mathit{Avar}}(\hat{\beta}) \\
&=
\frac{1}{n}
\left(\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}
\left(\frac{1}{n}\sum_{i=1}^n \hat{\epsilon}_i^2X_{i*}^\mathrm{T}X_{i*}\right)
\left(\frac{1}{n}\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1} \\
&=
\left(\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}
\left(\sum_{i=1}^n \hat{\epsilon}_i^2X_{i*}^\mathrm{T}X_{i*}\right)
\left(\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}.
\end{aligned}
\tag{6.26}\]
Note that we have so far assume iid samples but we have not assumed homoskedasticity. In other words, 6.26 is valid under heteroskedasticity. The variance estimator in (6.26) is called a “Heteroskedasticity-Consistent Variance Estimator”. There are several versions. The version presented in (6.26) is often referred to as “HC0”, which is why we label it as such. Other versions will be discussed in the exercises. Because of its form, (6.26) is often called a “sandwich” estimator.
If there is conditional heteroskedasticity in the noise terms, the usual OLS variance estimator \(\widehat{\sigma^2}(X^\mathrm{T}X)^{-1}\) is not appropriate since \(\widehat{\sigma^2}((1/n)X^\mathrm{T}X)^{-1}\) is not a consistent estimator for the asymptotic variance. On the other hand, the variance estimator (6.26) remains consistent even if in fact the noise terms are conditionally homoskedastic. In this sense it is safer to use (6.26) for estimating the estimator variance if there is any possibility of conditional heteroskedasticity.
We have already noted that OLS estimators are efficient if there is conditional homoskedasticity. If there is conditional heteroskedasticity, then OLS estimators are no longer efficient. (The formula (6.26) allows us to estimate the estimator variance consistently, but doesn’t do anything about the efficiency of \(\hat{\beta}\) itself.) We have already addressed how we might try to get efficient estimators in the previous chapter.
Example 6.8 We adapt our earlier mlr function to give HC0 Heteroskedasticity Robust standard errors and apply it to the regression \[
\ln earn = \beta_0 + \beta_1 educ + \beta_2 black + \beta_3 female + \beta_4 black.female + \epsilon\,.
\]
We can use the heteroskedasticity consistent variance estimator to construct heteroskedasticity robust \(t\) and \(F\) statistics, by replacing \(\widehat{\mathit{Var}}(\hat{\beta} \mid X)\) in (6.17) and (6.20) with the heteroskedasticity robust variance estimator in (6.26). For the linearHypothesis function from car, you can use the robust \(F\)-test as shown below, where we use the chi-square version of the test (since the robust standard errors are only valid asymptotically):
Linear hypothesis test:
black - female = 0
black:female = 0
Model 1: restricted model
Model 2: log(earn) ~ educ + black * female
Note: Coefficient covariance matrix supplied.
Res.Df Df Chisq Pr(>Chisq)
1 4943
2 4941 2 9.6196 0.00815 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6.6 Exercises
Exercise 6.1 Let \(y = X\beta + \varepsilon\,,\,\, E(\varepsilon \mid X) = 0\,,\,\, E(\varepsilon\varepsilon^\mathrm{T}) = \sigma^2 I\) represent the simple linear regression \[
Y_i = \beta_0 + \beta_1 X_{i1} + \epsilon_i \, , \; i=1,2,...,n
\] (a) Use (6.7) to find expressions for \(\hat\beta_0\) and \(\hat\beta_1\) and show that they are the same as those obtained in (3.21).
(b) Use (6.13) to find expressions for \(\mathit{Var}(\hat{\beta}_0 \mid X)\) and \(\mathit{Cov}(\hat{\beta}_0, \hat{\beta}_1 \mid X)\) in the simple linear regression. What is the sign of \(\mathit{Cov}(\hat{\beta}_0, \hat{\beta}_1 \mid X)\)?
Remark: For intuition regarding the sign of \(\mathit{Cov}(\hat{\beta}_0, \hat{\beta}_1 \mid X)\), consider the fact that estimated regression lines always pass through the point \((\overline{X}, \overline{Y})\).
Exercise 6.2 Let \(X\) be a \(n \times K\) matrix with full column rank. Show that \[
M = I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T}
\] is symmetric and idempotent, with rank \(n-K\).
Exercise 6.3 Let \(y = X\beta + \varepsilon\) represent the regression \(Y_i = \beta_0 + \epsilon_i\), \(i=1,2,...,n\).
(a) Show that the residuals from this regression can be written as \[
\hat{\varepsilon} = M_0y
\] where \(M_0 = I - i_n(i_n^\mathrm{T}i_n)^{-1}i_n^\mathrm{T}\) and \(i_n\) is the \(n \times 1\) vector of ones.
(b) Show by direct computation that \(M_0y\) is the vector of deviations from means, i.e., \[
M_0y
= (I-i_n(i_n^\mathrm{T}i_n)^{-1}i_n^\mathrm{T})y
= \begin{bmatrix}
Y_1 - \overline{Y} \\
Y_2 - \overline{Y} \\
\vdots \\
Y_n - \overline{Y}
\end{bmatrix}.
\]
(c) Show that \(y^\mathrm{T}M_0y = \sum_{i=1}^n(Y_i - \overline{Y})^2\)
Exercise 6.4 Show that for the general linear regression model (with intercept) that \[
y^\mathrm{T}M_0y = \hat{y}^\mathrm{T}M_0\hat{y} + \hat{\varepsilon}^\mathrm{T}\hat{\varepsilon}.
\] This is the \(TSS = FSS + RSS\) equality that forms the basis of the \(R^2\) goodness-of-fit measure.
Exercise 6.5 Consider the regression \(e^i = X\beta + \varepsilon\), where \(e^i\) is the \(n \times 1\) vector comprising all zeros except for a ‘1’ in the \(i\mathrm{th}\) position. Let the matrix \(X\) be \(n \times K\) with full column rank.
(a) Show that the fitted values \(\widehat{e^i}\) has the expression \[
\widehat{e^i} = X(X^\mathrm{T}X)^{-1}X_{i*}^\mathrm{T}
\] where \(X_{i*}\) is the \(i\mathrm{th}\) row of the \(X\) matrix.
(b) Define the ‘leverage’ of observation \(i\) to be \[
h_{i} = X_{i*}(X^\mathrm{T}X)^{-1}X_{i*}^\mathrm{T}.
\] Show that \(0 \leq h_{i} \leq 1\). Hint: Use part (a) and the “Pythagoras’s Theorem” result in (6.10).
(c) Explain why \(\sum_{i=1}^n h_i\) is the trace of the matrix \(P = X(X^\mathrm{T}X)^{-1}X^\mathrm{T}\). Show that \(\sum_{i=1}^n h_{i} = K\). (In other words, the “average value” of \(h_{i}\) is \(K/n\).)
Remark: It can be shown that\[
\hat{\beta} - \hat{\beta}_{-i}
= \left( \frac{1}{1-h_i}\right)(X^\mathrm{T}X)^{-1}X_{i*}^\mathrm{T}\,{\epsilon}_i
\]where \(\hat{\beta}_{-i}\) is the OLS estimator obtained when observation \(i\) is left out. An observation with \(h_i\) close to 1 therefore has very high leverage, or is “influential”.
Exercise 6.6 Consider the linear regression \(y = X\beta + \varepsilon\) where \(E(\varepsilon \mid X)=0\) and \(E(\varepsilon \varepsilon^\mathrm{T} \mid X) = \sigma^2I\). The fact that \(\widetilde{\sigma^2} = \frac{1}{n}\sum_{i=1}^n \hat{\epsilon}_i^2\) is biased implies that each individual \(\hat{\epsilon}_i^2\) must in general be a biased estimator for \(\sigma^2\). Show that \[
E(\hat{\epsilon}_i^2) = (1-h_{i})\sigma^2.
\]Hint: Use \(\hat{\epsilon}_i^2 = (e^i)^\mathrm{T}\hat{\varepsilon}\hat{\varepsilon}^\mathrm{T}e^i\) where \(e^i\) is as defined in the previous question, and \(\hat{\varepsilon} = M\varepsilon\) where \(M = I - X(X^\mathrm{T}X)^{-1}X^\mathrm{T}\).
Exercise 6.7 The “HC0” version of the heteroskedasticity-consistent standard errors given in (6.26) is sometimes criticized for not taking into consideration the fact that \(K\) degrees of freedom are used in the computation of \(\hat{\epsilon_i}\). Another version proposes to take this into account by estimating \(S\) with \[
\hat{S}_1 = \frac{1}{n-K}\sum_{i=1}^n \hat{\epsilon}_i^2X_{i*}^\mathrm{T}X_{i*}
\] which also consistently estimates \(S\).
(a) Show that using \(\hat{S}_1\) instead of \(\hat{S}\) in (6.25) results in the Heteroskedasticity-Consistent variance estimator \[
\begin{aligned}
\widehat{\mathit{Var}}_{HC1}(\hat{\beta})
&=
\left(\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}
\left(\frac{n}{n-K}\sum_{i=1}^n \hat{\epsilon}_i^2X_{i*}^\mathrm{T}X_{i*}\right)
\left(\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1} \,.
\end{aligned}
\tag{6.27}\] Amend the code in (Example 6.8) to use the HC1 version of the variance estimator, and verify your results using the vcovHC() function.
(b) Another version, based on the result in (Exercise 6.6), is \[
\begin{aligned}
\widehat{\mathit{Var}}_{HC2}(\hat{\beta})
&=
\left(\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}
\left(\sum_{i=1}^n \frac{\hat{\epsilon}_i^2}{1-h_i}X_{i*}^\mathrm{T}X_{i*}\right)
\left(\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1} \,.
\end{aligned}
\tag{6.28}\] Amend the code in (Example 6.8) to use the HC2 version of the variance estimator, and verify your results using the vcovHC() function.
(c) The result in (Exercise 6.6), of course, assumes conditional homoskedasticity. Yet another version is \[
\begin{aligned}
\widehat{\mathit{Var}}_{HC3}(\hat{\beta})
&=
\left(\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1}
\left(\sum_{i=1}^n \frac{\hat{\epsilon}_i^2}{(1-h_i)^2}X_{i*}^\mathrm{T}X_{i*}\right)
\left(\sum_{i=1}^n X_{i*}^\mathrm{T}X_{i*}\right)^{-1} \,.
\end{aligned}
\tag{6.29}\] This version puts more weight on observations that are more influential. Amend the code in Example 6.8 to use the HC3 version of the variance estimator, and verify your results using the vcovHC() function.
Exercise 6.8 (a) Show that the \(F\)-statistic for testing a single restriction, i.e., when \(J=1\), is equal to the square of the \(t\)-statistic for testing the same hypothesis. (Hint: Compare (6.17) and (6.20) after setting \(R=r^\mathrm{T}\) in the \(F\)-statistic.)
(b) Suppose you have the regressions \[
\begin{aligned}
&\text{(A)} \qquad Y_i = \beta_0 + \beta_1X_{i1} + \dots \beta_{K-1}X_{i,K-1} + \epsilon_i \; , \\[1ex]
&\text{(B)} \qquad Y_i = \beta_0 + \beta_1X_{i1} + \dots \beta_{K-1}X_{i,K-1} + \beta_{K}X_{iK} + \epsilon_i \;.
\end{aligned}
\] Two possible ways of deciding whether or not to include variable \(X_{iK}\) in the regression is to do a \(t\)-test (or an \(F\)-test) of the hypothesis \(\beta_K=0\). Another way is to see whether or not the Adjusted-\(R^2\) in (B) is greater than the Adjusted-\(R^2\) in (A). Show that the latter method is equivalent to including \(X_{K,i}\) if the absolute value of the t-statistic for \(\hat{\beta}_K\) in (B) is greater than 1. Hint: use the version of the F-statistic in (6.18).
Exercise 6.9 Suppose you estimated the regression equation \(y = X\beta + \varepsilon\,,\, E(\varepsilon \mid X) = 0\,,\,\,E(\varepsilon \varepsilon^\mathrm{T} \mid X) = \sigma^2 I\) using the iid sample \(\{Y_i, X_{i*}\}_{i=1}^n\). Let \[
\widehat{Y}(X_{0*}) = X_{0*}\hat{\beta}
\] be the prediction of \(Y\) for an independent draw from the population with \(X\) characteristics equal to \(X_{0*}\). The prediction error is \[
\hat{e}(X_{0*})=Y(X_{0*})-\hat{Y}(X_{0*})=X_{0*}\beta + \epsilon_0 - X_{0*}\hat{\beta} = X_{0*}(\beta-\hat{\beta})+\epsilon_0\,.
\]
(a) Derive an expression for the prediction error variance.
(b) Specialize your answer in (a) to the simple linear regression \(Y = \beta_0 + \beta_1X_{1} + \epsilon\), predicting \(Y\) at \(X_{1}= X_{01}\). Show that the prediction mean squared error is \[
\hat{e}(X_{01}) = \sigma^2\left(1+ \frac{1}{n} + \frac{(X_{01}-\overline{X}_1)^2}{\sum_{i=1}^n(X_{01} - \overline{X}_1)^2}\right).
\] Explain why this is also the prediction error variance.
Hayashi, Fumio. 2000. Econometrics. Princeton University Press.
Tay, Anthony, Daniel Preve, and Ismail Baydur. 2025. Mathematics and Programming for the Quantitative Economist. World Scientific.