What's the difference between supervised and unsupervised learning, and how are they used in finance?
CFA Level II now covers machine learning basics. I get that supervised learning uses labeled data and unsupervised doesn't, but I'm unclear on practical finance applications. When would a portfolio manager or analyst use each type?
Machine learning (ML) in finance is a growing topic on the CFA exam. The fundamental distinction is about whether you have a target variable (label) to predict.
Supervised Learning:
You have input features (X) and a known output (Y). The algorithm learns the relationship X -> Y from historical data, then predicts Y for new data.
Finance Applications:
- Credit scoring: Predict default probability (Y = default/no default) from borrower characteristics (income, debt ratio, credit history)
- Stock return prediction: Predict next-month return from factors (value, momentum, quality)
- Fraud detection: Classify transactions as fraudulent or legitimate
- Earnings forecasting: Predict quarterly EPS from fundamental and market data
Common Algorithms:
- Linear/logistic regression (simplest, most interpretable)
- Decision trees and random forests
- Support vector machines
- Neural networks (most flexible, least interpretable)
Unsupervised Learning:
You only have input features — no target variable. The algorithm finds hidden patterns, groupings, or structure in the data.
Finance Applications:
- Portfolio clustering: Group stocks by return behavior (not just sector) to build truly diversified portfolios
- Regime detection: Identify market regimes (bull/bear/sideways) from price and volatility patterns
- Anomaly detection: Flag unusual trading patterns without pre-defining what 'unusual' means
- Risk factor discovery: Find hidden factors driving asset returns beyond traditional Fama-French factors
Common Algorithms:
- K-means clustering
- Principal Component Analysis (PCA)
- Hierarchical clustering
Hybrid Approach — Semi-Supervised:
In practice, financial data often has a small amount of labeled data and a large amount of unlabeled data. Semi-supervised methods use both: train on labeled examples and let the unlabeled data improve the model's understanding of the data distribution.
Key Tradeoffs for Analysts:
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Data requirement | Labeled data (expensive) | Unlabeled data (abundant) |
| Evaluation | Clear metrics (accuracy, RMSE) | Subjective (are clusters meaningful?) |
| Interpretability | Varies by algorithm | Often harder to interpret |
| Overfitting risk | High (fitting noise in labels) | Lower (no labels to overfit) |
Practice ML classification questions in our CFA Level II question bank.
Master Level II with our CFA Course
107 lessons · 200+ hours· Expert instruction
Related Questions
What exactly is the Capital Market Expectations (CME) framework and why does it matter for asset allocation?
How do business cycle phases affect asset class return expectations?
Can someone explain the Grinold–Kroner model step by step with numbers?
How do you forecast fixed-income returns using the building-blocks approach?
PPP vs Interest Rate Parity for forecasting exchange rates — when do I use which?
Join the Discussion
Ask questions and get expert answers.