01The Transformer Architecture
The Transformer (Vaswani et al., 2017) replaced recurrence with self-attention, processing all positions in parallel. This single architectural change enabled modern LLMs, from BERT to GPT-4.
The left stack is the Encoder (reads the input), the right is the Decoder (generates the output). Click any component:
Unlike RNNs that process tokens one by one, the Transformer attends to all positions simultaneously. Training time drops from O(n) sequential steps to O(1).
Self-attention connects every token to every other token in a single step. No vanishing gradients — the path length between any two tokens is O(1).
d_model=512 h=8 heads d_k=d_v=64 d_ff=2048 N=6 layers
02Embeddings & Positional Encoding
Tokens become dense vectors, then positional information is added. Without position encoding, the Transformer can't distinguish "dog bites man" from "man bites dog."
Each token maps to a learned vector of dimension d_model=512. We use d_model=4 here for visibility. Embeddings are scaled by √d_model (Section 3.4).
Position information uses sine/cosine waves at different frequencies — each dimension gets a unique pattern. Low dimensions oscillate fast (fine position), high dimensions oscillate slowly (coarse position).
PE(pos, 2i+1) = cos(pos / 100002i/d_model)
The final input is the element-wise sum of token embedding + positional encoding. One vector carries both meaning and position.
03The Intuition Behind Self-Attention
Before we see the math, let's understand why self-attention works. The core idea: every word should be able to "look at" every other word to understand context.
Consider this sentence:
What does "it" refer to? A human instantly knows it's the animal (not the street). But a model processing words one-by-one (like an RNN) may lose this connection over the intervening tokens. Self-attention solves this by letting every word directly examine every other word.
The Query/Key/Value mechanism is like a library search system:
You walk into a library with a question in mind. Each word creates a query vector representing what information it needs from context.
Each book has a label describing its contents. Each word creates a key vector advertising what information it offers.
Once you find the right book (Q·K match), you read its content. Each word provides a value vector — the actual information to extract.
The process: (1) Compare your query against all keys → get relevance scores. (2) Turn scores into probabilities (softmax). (3) Read a weighted mix of all values. The output is a context-enriched version of each word.
Click any word below to see which other words it attends to most. This simulates how self-attention lets each word gather context from the full sentence.
Self-attention transforms isolated word vectors into contextualized representations. The word "bank" means something different in "river bank" vs. "bank account" — self-attention resolves this by mixing in context.
04Scaled Dot-Product Attention
Now the math. Each token creates a Query, Key, and Value vector. Attention weights come from Q·K similarity, scaled by √d_k, then softmaxed to produce weights over V.
Four steps: (1) Score — compute QKT to measure similarity between each pair of tokens. (2) Scale — divide by √d_k to prevent softmax saturation. (3) Normalize — softmax turns scores into probabilities (rows sum to 1). (4) Aggregate — weighted sum of V vectors.
A tiny 3-token, 4-dimensional example showing every number. Tokens are projected to Q, K, V through learned weight matrices. Click each step:
When d_k is large, dot products grow in magnitude, pushing softmax into near-zero gradient regions. Scaling by √d_k keeps variance ≈ 1 regardless of dimension.
05Multi-Head Attention
Instead of one attention function, the model runs h parallel attention heads, each looking at different aspects of the relationships between tokens.
where head_i = Attention(Q·WᵢQ, K·WᵢK, V·WᵢV)
// h=8 heads, d_k = d_v = d_model/h = 512/8 = 64
Each head gets a different "slice" of dimensions. With 8 heads and d_model=512, each head works with d_k=64 dimensions — learning different relationship types in parallel.
Research shows heads specialize in different linguistic patterns:
06The Encoder Block
Each encoder layer has two sub-layers: (1) multi-head self-attention and (2) a position-wise feed-forward network. Both use residual connections and layer normalization.
Every sub-layer is wrapped with: LayerNorm(x + SubLayer(x)). The residual path lets gradients flow directly through deep networks. LayerNorm stabilizes training.
// LayerNorm normalizes across the d_model dimension:
LayerNorm(x) = γ · (x − μ) / (σ + ε) + β
Two linear layers with ReLU, applied to each token independently. The inner dimension d_ff=2048 is 4× the model dimension — think of it as "thinking in a higher-dimensional space."
// W₁: 512 × 2048 (expand) → ReLU → W₂: 2048 × 512 (contract)
Trace tokens through one complete encoder layer:
07The Decoder & Masking
The decoder adds two things: (1) masked self-attention that prevents looking at future tokens, and (2) cross-attention that reads the encoder's output.
During training, the decoder sees the full target at once (for parallelism). The mask sets future positions to −∞ before softmax, giving them zero weight.
// Mask[i][j] = 0 if j ≤ i (allowed), −∞ if j > i (blocked)
The decoder's second attention uses K, V from the encoder but Q from the decoder. The decoder asks: "what parts of the input are relevant to what I'm generating now?"
K = encoder output × WK // "what does the source contain?"
V = encoder output × WV // "what to extract?"
08Training & Generation
Trained with teacher forcing (parallel), generates output auto-regressively (sequential). The paper's optimization recipe became standard practice.
The decoder receives ground-truth target (shifted right) as input. All positions are computed in parallel — masking prevents cheating.
Encoder input: [I] [love] [ML]
Decoder input: [<SOS>] [Ich] [liebe] [ML] // shifted right
Decoder target: [Ich] [liebe] [ML] [<EOS>] // predict this
// Loss = cross-entropy(predictions, targets)
No ground truth available. Generate one token at a time: predict → append → predict next → ... until <EOS>.
// Linear warmup → inverse-sqrt decay