How does ridge regression use an L2 penalty to handle multicollinearity, and how do you choose the penalty parameter?

Question

AcadiFi · Accepted Answer

Ridge regression adds a squared-magnitude penalty (L2 norm) to the OLS objective function, shrinking coefficients toward zero without eliminating any. This stabilizes estimates when predictors are correlated, trading a small amount of bias for a large reduction in variance.

**Objective Function:**

OLS minimizes: sum of (y_i - X_i x beta)^2

Ridge minimizes: sum of (y_i - X_i x beta)^2 + lambda x sum of beta_j^2

The closed-form solution is: beta_ridge = (X'X + lambda x I)^{-1} X'y

Adding lambda x I to X'X makes the matrix invertible even when predictors are perfectly collinear.

**Worked Example:**

Analyst Haruki at Pinebrook Capital regresses quarterly earnings surprises on three predictors: analyst sentiment (X1), momentum (X2), and a sentiment-momentum composite (X3). Because X3 is nearly a linear combination of X1 and X2, OLS produces:

| | OLS Coefficients | Std Error |
|---|---|---|
| X1 | 12.4 | 8.7 |
| X2 | -9.8 | 7.2 |
| X3 | 6.1 | 11.3 |

The coefficients are large in absolute value with enormous standard errors -- classic multicollinearity.

Applying ridge with lambda = 2.5:

| | Ridge Coefficients | Effective Std Error |
|---|---|---|
| X1 | 3.8 | 1.9 |
| X2 | -2.1 | 1.6 |
| X3 | 1.4 | 2.0 |

All coefficients shrink substantially, standard errors drop, and predictions become much more stable out-of-sample.

**Choosing Lambda:**

The penalty parameter lambda controls the bias-variance tradeoff:
- lambda = 0: pure OLS (no shrinkage, possibly high variance)
- lambda approaches infinity: all coefficients shrink toward zero (high bias)
- Optimal lambda: typically found via cross-validation, selecting the value that minimizes out-of-sample prediction error

**Geometric Interpretation:**

The L2 penalty constrains coefficients to lie within a hypersphere (circle in 2D) centered at the origin. OLS finds the unconstrained minimum. Ridge finds the point on the constraint boundary closest to the OLS solution. Because the constraint region is round, ridge shrinks all coefficients proportionally but never sets any exactly to zero.

**Key Properties for CFA:**
- Ridge always includes all predictors (no variable selection)
- Works best when all predictors contribute some information but are correlated
- Requires standardizing predictors first so the penalty treats them equally
- Does not produce sparse models (contrast with LASSO)

Practice regularization techniques in our CFA Quantitative Methods question bank.

How does ridge regression use an L2 penalty to handle multicollinearity, and how do you choose the penalty parameter?

Master Level II with our CFA Course

Related Questions

Practice Questions