A
AcadiFi
OD
OptimAlgo_Derek2026-04-09
cfaLevel IIQuantitative Methods

How does gradient descent work in training financial models, and what are the key variants I should know for the CFA exam?

I keep encountering gradient descent in my CFA quant studies. I understand it's an optimization algorithm that minimizes a loss function, but what's the difference between batch, stochastic, and mini-batch gradient descent? And why do we need learning rate schedules? My models sometimes diverge or get stuck in local minima.

76 upvotes
Verified ExpertVerified Expert
AcadiFi Certified Professional

Gradient descent iteratively adjusts model parameters by moving in the direction of steepest descent of the loss function. The gradient tells you which direction is uphill, and you move the opposite way. The learning rate controls step size.\n\nUpdate Rule:\n\ntheta_new = theta_old - eta x gradient(Loss)\n\nwhere eta is the learning rate and gradient(Loss) is the partial derivative of the loss function with respect to each parameter.\n\nThree Variants:\n\n| Variant | Data Per Update | Speed | Stability |\n|---|---|---|---|\n| Batch GD | Entire dataset | Slow per step | Very stable |\n| Stochastic GD (SGD) | Single observation | Fast per step | Noisy, can escape local minima |\n| Mini-Batch GD | Subset (32-256) | Balanced | Good balance |\n\nWorked Example:\nAtlantic Ridge Capital trains a neural network to predict credit default probabilities using 50,000 loan records.\n\nBatch GD computes the gradient over all 50,000 records per update — accurate but each step takes 12 seconds. After 1,000 steps (3.3 hours), loss reaches 0.142.\n\nSGD updates after each record — each step takes 0.3ms but the loss function zigzags wildly. After scanning all records twice (100,000 updates, ~30 seconds), loss fluctuates between 0.135 and 0.180.\n\nMini-batch (size 128) processes 128 records per update. Each step takes 1.2ms. After 800 updates per epoch times 10 epochs (9.6 seconds total), loss converges smoothly to 0.128.\n\nLearning Rate Considerations:\n- Too large: parameters overshoot the minimum and diverge (loss increases)\n- Too small: convergence is painfully slow, may get trapped in shallow local minimum\n- Adaptive methods (Adam, RMSProp) automatically adjust per-parameter learning rates based on gradient history\n\nFinancial Pitfalls:\n- Non-stationary financial data means the optimal parameters shift over time — online learning with SGD can adapt continuously\n- Fat-tailed return distributions create occasional extreme gradients that destabilize training — gradient clipping helps\n- Feature scaling is essential: if one feature ranges 0-1 and another ranges 0-10000, gradients will be dominated by the large-scale feature\n\nFor hands-on optimization practice, check our CFA Quantitative Methods question bank.

📊

Master Level II with our CFA Course

107 lessons · 200+ hours· Expert instruction

#gradient-descent#sgd#mini-batch#learning-rate#optimization#neural-network