Why does the LASSO's L1 penalty produce sparse models by setting some coefficients exactly to zero?

Question

AcadiFi · Accepted Answer

The LASSO (Least Absolute Shrinkage and Selection Operator) uses an L1 penalty -- the sum of absolute values of coefficients -- rather than the L2 squared penalty of ridge. This seemingly small change has a profound geometric consequence: the L1 constraint region has corners at the axes, and the OLS solution is more likely to touch these corners, setting some coefficients exactly to zero. **Objective Function:** LASSO minimizes: sum of (y_i - X_i x beta)^2 + lambda x sum of |beta_j| **Geometric Insight:** ```mermaid graph LR subgraph "Ridge (L2)" A["Circular constraint
No corners on axes
All coefficients nonzero"] end subgraph "LASSO (L1)" B["Diamond constraint
Corners on axes
Some coefficients = 0"] end A --> C["Shrinks but keeps all variables"] B --> D["Shrinks AND selects variables"] ``` Imagine the OLS contours (ellipses of equal RSS) expanding outward from the unconstrained minimum. For ridge, they hit the circular boundary, which is smooth everywhere -- the contact point almost never falls exactly on an axis. For LASSO, the diamond has sharp corners on the axes, and the expanding ellipse frequently touches a corner first, zeroing out that dimension's coefficient. **Worked Example:** Portfolio manager Svetlana at Ashworth Investments uses 15 macroeconomic variables to predict equity risk premium. Most are noisy or redundant. OLS with 15 variables: R-squared = 0.28, adjusted R-squared = 0.11 (many insignificant coefficients). LASSO with lambda chosen by 10-fold cross-validation: | Variable | OLS Coefficient | LASSO Coefficient | |---|---|---| | Term Spread | 0.42 | 0.31 | | Credit Spread | 0.38 | 0.25 | | Earnings Yield | 0.29 | 0.18 | | Inflation Surprise | -0.15 | -0.08 | | Industrial Production | 0.22 | 0.00 | | Consumer Confidence | 0.08 | 0.00 | | Trade Balance | -0.04 | 0.00 | | ... (8 more) | various | 0.00 | LASSO keeps 4 variables and zeros out 11. The sparse model has cross-validated R-squared of 0.16 -- lower than OLS in-sample but dramatically better out-of-sample. **LASSO vs. Ridge Summary:** - LASSO produces interpretable sparse models (built-in variable selection) - Ridge is better when all variables contribute signal and are correlated - LASSO struggles when predictors are highly correlated (tends to pick one arbitrarily) - LASSO's solution path is piecewise linear, making it computationally efficient **CFA Exam Tips:** - LASSO = L1 = sparsity = variable selection - Ridge = L2 = shrinkage without selection - Both require standardized predictors and cross-validation for lambda Master regularization concepts in our CFA Quantitative Methods course.

Why does the LASSO's L1 penalty produce sparse models by setting some coefficients exactly to zero?

Master Level II with our CFA Course

Related Questions

Practice Questions