The Transformer Architecture
The Transformer (Vaswani et al., 2017) replaced recurrence and convolution with attention mechanisms alone. It processes all positions in parallel, enabling much faster training and better long-range dependency modeling.
This is the architecture from Figure 1 of the paper. The left side is the Encoder (processes the input), the right side is the Decoder (generates the output). Click any component to see its explanation below:
Unlike RNNs that process tokens sequentially, the Transformer attends to all positions simultaneously. Training time drops from O(n) sequential steps to O(1).
Self-attention connects every position to every other position directly. No more vanishing gradients over long sequences — path length is O(1).
The paper uses: d_model = 512, h = 8 heads, d_k = d_v = 64, d_ff = 2048, N = 6 layers.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017).
// recurrent or convolutional neural networks... We propose a new
// simple network architecture, the Transformer, based solely on
// attention mechanisms, dispensing with recurrence and convolutions
// entirely. — Abstract, Vaswani et al. (2017)
Input Embeddings & Positional Encoding
Tokens are mapped to dense vectors, then positional information is added so the model knows token order. Without positional encoding, the Transformer would be a "bag of words" — it couldn't distinguish "dog bites man" from "man bites dog".
Each token in the vocabulary gets mapped to a learned vector of dimension d_model = 512. In our simplified example we use d_model = 4 for visibility. The embeddings are multiplied by √d_model (Section 3.4 of the paper).
embedding(token) = lookup(token) × √d_model
Since the Transformer has no recurrence or convolution, position information must be injected explicitly. The paper uses sine and cosine functions at different frequencies — each dimension gets a unique wave pattern. This allows the model to learn relative positions.
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
// pos = position in sequence (0, 1, 2, ...)
// i = dimension index (0, 1, 2, ... d_model/2)
The final input to the encoder/decoder is simply the element-wise sum of the token embedding and the positional encoding. This combined vector carries both the token's meaning and its position in the sequence.
Scaled Dot-Product Attention
The core mechanism of the Transformer. Each token creates a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I provide?). Attention weights are computed from Q·K similarity, then used to weight V.
This formula has 4 steps: (1) compute similarity scores via QKT, (2) scale by √d_k to prevent softmax saturation, (3) apply softmax to get attention weights (probabilities that sum to 1), (4) multiply by V to get the output.
We use a tiny 3-token, 4-dimensional example so you can see every number. The tokens are projected to Q, K, V through learned weight matrices WQ, WK, WV. Click through each step:
When d_k is large, the dot products grow large in magnitude, pushing the softmax into regions where it has extremely small gradients. Dividing by √d_k keeps the variance of the dot products at ≈1, regardless of dimension.
"We suspect that for large values of d_k, the dot products grow
large in magnitude, pushing the softmax function into regions
where it has extremely small gradients."
Multi-Head Attention
Instead of one big attention function, the paper splits Q, K, V into h parallel heads, each attending to different representation subspaces. The outputs are concatenated and projected back.
where head_i = Attention(Q · WiQ, K · WiK, V · WiV)
// h = 8 heads, d_k = d_v = d_model/h = 512/8 = 64
Each head sees a different "slice" of the embedding dimensions. With 8 heads and d_model=512, each head gets d_k=64 dimensions. This lets different heads learn different types of relationships (syntax, semantics, position, etc.).
Research on trained Transformers shows heads specializing in different linguistic patterns:
The Encoder Block
Each encoder layer has two sub-layers: (1) multi-head self-attention and (2) a position-wise feed-forward network. Both use residual connections and layer normalization.
Every sub-layer is wrapped with: LayerNorm(x + SubLayer(x)). The residual connection (adding x back) allows gradients to flow directly through the network, solving the vanishing gradient problem. LayerNorm stabilizes training.
// the two sub-layers, followed by layer normalization."
output = LayerNorm( x + SubLayer(x) )
// LayerNorm normalizes across the feature dimension:
LayerNorm(x) = γ · (x − μ) / (σ + ε) + β
// where μ, σ are computed per-token across d_model dimensions
Applied identically to each position (token) independently. It's essentially two linear transformations with a ReLU in between. The inner dimension d_ff = 2048 is 4× the model dimension.
// W₁: d_model × d_ff = 512 × 2048 (expand)
// W₂: d_ff × d_model = 2048 × 512 (contract)
// This is like two 1×1 convolutions with kernel size 1
Trace a sequence of 4 tokens through one complete encoder layer. Each token starts as a d_model vector and ends as a transformed d_model vector, enriched with contextual information from all other tokens.
The Decoder & Masking
The decoder is similar to the encoder but with two key differences: (1) masked self-attention prevents attending to future tokens, and (2) cross-attention lets the decoder attend to the encoder's output.
During training, the decoder sees the entire target sequence at once for parallel processing. But each position should only attend to earlier positions (auto-regressive property). The mask sets future positions to -∞ before softmax, giving them zero attention weight.
// "We need to prevent leftward information flow in the decoder
// to preserve the auto-regressive property."
Masked_Attention = softmax( QKT/√d_k + Mask ) · V
// Mask[i][j] = 0 if j ≤ i (allowed), −∞ if j > i (blocked)
The decoder's second attention layer uses K and V from the encoder but Q from the decoder. This is how the decoder "reads" the source input. The decoder asks "what parts of the input are relevant to what I'm generating now?"
Q = decoder hidden states × WQ // "what am I looking for?"
K = encoder output × WK // "what does the source contain?"
V = encoder output × WV // "what information to extract?"
// This allows every position in the decoder to attend over
// all positions in the input sequence.
Training & Generation
The Transformer is trained with teacher forcing and generates output auto-regressively at inference. The paper used specific optimization choices that became standard practice.
During training, the decoder receives the ground-truth target (shifted right by one) as input. The model predicts the next token at each position, and all positions are computed in parallel thanks to masking.
Encoder input: [I] [love] [ML]
Decoder input: [<SOS>] [Ich] [liebe] [ML] // shifted right
Decoder target: [Ich] [liebe] [ML] [<EOS>] // what we predict
// Loss = cross-entropy between predictions and targets
// All positions computed in parallel (masked attention
// prevents cheating by looking at future tokens)
At inference, there's no ground truth. The model generates one token at a time: predict → append → predict next → append → ... until <EOS>.
The paper used Adam optimizer with a custom learning rate schedule — a warmup phase followed by decay. This became one of the most influential training recipes.
// warmup_steps = 4000
// Linearly increases lr for the first warmup_steps,
// then decreases proportionally to step^(-0.5)
Parameters: ~65M · BLEU: 25.8 (EN→DE)
Parameters: ~213M · BLEU: 28.4 (EN→DE)