Why is stepwise regression considered dangerous, and what are the main pitfalls of automated variable selection?
In my CFA quant review, the textbook strongly warns against stepwise regression. But it seems like a convenient way to find the best predictors automatically. If I'm running forward selection and adding variables one at a time based on p-values, what exactly goes wrong? Why do practitioners call it data mining?
Stepwise regression automates variable selection by iteratively adding or removing predictors based on statistical significance thresholds. While convenient, it introduces multiple serious problems that can produce misleading models.
How Stepwise Works:
- Forward selection: Start with no variables, add the most significant one at each step
- Backward elimination: Start with all variables, remove the least significant one at each step
- Bidirectional: Combine both, adding and removing at each step
The Core Problems:
Loading diagram...
Worked Example:
Researcher Simone at Hartwell Economics has 120 monthly observations of stock returns and 40 candidate macro predictors. She runs forward stepwise selection.
With 40 candidates at alpha = 0.05, the probability of finding at least one spuriously significant variable by chance alone is:
P(at least one false positive) = 1 - (1 - 0.05)^40 = 1 - 0.95^40 = 1 - 0.129 = 87.1%
Stepwise selects 6 variables with an in-sample R-squared of 0.34. But when Simone tests on 60 months of holdout data, R-squared drops to 0.04 -- almost no explanatory power. The model was fitting noise.
Specific Dangers:
-
P-value distortion: After searching across many models, reported p-values no longer reflect true significance levels. A variable showing p = 0.02 after stepwise selection might have a corrected p-value above 0.10.
-
Coefficient bias: Variables that survive selection are systematically those with larger sample estimates (by luck or noise). Their coefficients are biased upward in absolute value.
-
Instability: Remove or add a few data points, and the selected model can change dramatically. This makes the approach unreliable for inference.
Better Alternatives:
- Use information criteria (AIC, BIC) which penalize complexity
- Apply regularization (ridge, LASSO) which shrinks coefficients
- Use cross-validation to honestly evaluate out-of-sample performance
- Let economic theory guide variable selection rather than the data alone
Explore principled model-building techniques in our CFA Quantitative Methods course.
Master Level II with our CFA Course
107 lessons · 200+ hours· Expert instruction
Related Questions
Why does an early retirement provision lower risk tolerance but high turnover does not — both reduce liabilities, right?
Why does it matter if the pension fund is invested in stocks similar to the sponsor's business?
What is the rule about active vs retired lives and pension plan duration?
Why does the textbook recommend 100% equities for a young employee? That sounds extremely aggressive.
I run my own startup. My income is volatile and tied to my industry. Should I hold ZERO equities in my financial accounts?
Join the Discussion
Ask questions and get expert answers.