How does gradient descent work in training financial models, and what are the key variants I should know for the CFA exam?

Question

AcadiFi · Accepted Answer

Gradient descent iteratively adjusts model parameters by moving in the direction of steepest descent of the loss function. The gradient tells you which direction is uphill, and you move the opposite way. The learning rate controls step size.

**Update Rule:**

theta_new = theta_old - eta x gradient(Loss)

where eta is the learning rate and gradient(Loss) is the partial derivative of the loss function with respect to each parameter.

**Three Variants:**

| Variant | Data Per Update | Speed | Stability |
|---|---|---|---|
| Batch GD | Entire dataset | Slow per step | Very stable |
| Stochastic GD (SGD) | Single observation | Fast per step | Noisy, can escape local minima |
| Mini-Batch GD | Subset (32-256) | Balanced | Good balance |

**Worked Example:**
Atlantic Ridge Capital trains a neural network to predict credit default probabilities using 50,000 loan records.

Batch GD computes the gradient over all 50,000 records per update — accurate but each step takes 12 seconds. After 1,000 steps (3.3 hours), loss reaches 0.142.

SGD updates after each record — each step takes 0.3ms but the loss function zigzags wildly. After scanning all records twice (100,000 updates, ~30 seconds), loss fluctuates between 0.135 and 0.180.

Mini-batch (size 128) processes 128 records per update. Each step takes 1.2ms. After 800 updates per epoch times 10 epochs (9.6 seconds total), loss converges smoothly to 0.128.

**Learning Rate Considerations:**
- Too large: parameters overshoot the minimum and diverge (loss increases)
- Too small: convergence is painfully slow, may get trapped in shallow local minimum
- Adaptive methods (Adam, RMSProp) automatically adjust per-parameter learning rates based on gradient history

**Financial Pitfalls:**
- Non-stationary financial data means the optimal parameters shift over time — online learning with SGD can adapt continuously
- Fat-tailed return distributions create occasional extreme gradients that destabilize training — gradient clipping helps
- Feature scaling is essential: if one feature ranges 0-1 and another ranges 0-10000, gradients will be dominated by the large-scale feature

For hands-on optimization practice, check our CFA Quantitative Methods question bank.

How does gradient descent work in training financial models, and what are the key variants I should know for the CFA exam?

Master Level II with our CFA Course

Related Questions

Practice Questions