A
AcadiFi
BU
biology_undergrad2026-04-09
cfaLevel IIQuantitative Methods

Why is stepwise regression considered dangerous, and what are the main pitfalls of automated variable selection?

In my CFA quant review, the textbook strongly warns against stepwise regression. But it seems like a convenient way to find the best predictors automatically. If I'm running forward selection and adding variables one at a time based on p-values, what exactly goes wrong? Why do practitioners call it data mining?

131 upvotes
AcadiFi TeamVerified Expert
AcadiFi Certified Professional

Stepwise regression automates variable selection by iteratively adding or removing predictors based on statistical significance thresholds. While convenient, it introduces multiple serious problems that can produce misleading models.

How Stepwise Works:

  • Forward selection: Start with no variables, add the most significant one at each step
  • Backward elimination: Start with all variables, remove the least significant one at each step
  • Bidirectional: Combine both, adding and removing at each step

The Core Problems:

Loading diagram...

Worked Example:

Researcher Simone at Hartwell Economics has 120 monthly observations of stock returns and 40 candidate macro predictors. She runs forward stepwise selection.

With 40 candidates at alpha = 0.05, the probability of finding at least one spuriously significant variable by chance alone is:

P(at least one false positive) = 1 - (1 - 0.05)^40 = 1 - 0.95^40 = 1 - 0.129 = 87.1%

Stepwise selects 6 variables with an in-sample R-squared of 0.34. But when Simone tests on 60 months of holdout data, R-squared drops to 0.04 -- almost no explanatory power. The model was fitting noise.

Specific Dangers:

  1. P-value distortion: After searching across many models, reported p-values no longer reflect true significance levels. A variable showing p = 0.02 after stepwise selection might have a corrected p-value above 0.10.

  2. Coefficient bias: Variables that survive selection are systematically those with larger sample estimates (by luck or noise). Their coefficients are biased upward in absolute value.

  3. Instability: Remove or add a few data points, and the selected model can change dramatically. This makes the approach unreliable for inference.

Better Alternatives:

  • Use information criteria (AIC, BIC) which penalize complexity
  • Apply regularization (ridge, LASSO) which shrinks coefficients
  • Use cross-validation to honestly evaluate out-of-sample performance
  • Let economic theory guide variable selection rather than the data alone

Explore principled model-building techniques in our CFA Quantitative Methods course.

📊

Master Level II with our CFA Course

107 lessons · 200+ hours· Expert instruction

#stepwise-regression#data-mining#overfitting#variable-selection#multiple-testing