What is a Transformer model and why did it replace RNNs in most NLP tasks?

Question

AcadiFi · Accepted Answer

The Transformer (Vaswani et al., 2017, 'Attention Is All You Need') abandoned recurrence entirely. Instead, it uses self-attention: each position in a sequence attends to all other positions simultaneously, learning context-dependent representations in parallel.

**Self-attention mechanism**:
For each input token, compute Query (Q), Key (K), and Value (V) vectors via learned projections.
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
Result: a weighted sum of values, where weights are scaled dot-product similarities between Q and K.

**Multi-head attention**: run self-attention h times in parallel with different projection matrices, then concatenate. Different heads learn different relationship types (syntactic, semantic, positional).

**Architecture**: stacked encoder and decoder blocks. Each block contains multi-head attention + position-wise feedforward + residual connections + layer norm.

**Positional encoding**: since Transformers have no recurrence, they add sinusoidal or learned position embeddings to tokens so the model knows order.

Why it wins:
- **Parallelism**: unlike RNNs, all positions are processed simultaneously - massive GPU speedup.
- **Long-range dependencies**: attention reaches any position in O(1) hops (vs O(n) for RNN).
- **Scalability**: models like GPT-4 scale to trillions of parameters, impossible with RNNs.

Finance use cases:
- Earnings call transcript sentiment scoring (e.g., Stellantic Capital uses FinBERT derivatives).
- News-driven alpha signals: Transformers process Bloomberg headlines + analyst notes.
- Document Q&A over 10-Ks and research reports (RAG pipelines with embedding Transformers).

```mermaid
graph TB
  I[Input tokens] --> E[Embed + Positional]
  E --> A[Multi-head attention]
  A --> F[Feedforward]
  F --> O[Output]
  A -.residual.- F
```

CFA context: Transformers represent the architecture behind BERT, GPT, and most modern NLP. Understand attention as 'content-based lookup' across a sequence.

What is a Transformer model and why did it replace RNNs in most NLP tasks?

Master Level II with our CFA Course

Related Questions

Practice Questions