Attention Is All You Need
Interactive Transformer Guide · Vaswani et al. (2017)

The Transformer Architecture

The Transformer (Vaswani et al., 2017) replaced recurrence and convolution with attention mechanisms alone. It processes all positions in parallel, enabling much faster training and better long-range dependency modeling.

🏗️ Interactive Architecture — Click any block to learn more

This is the architecture from Figure 1 of the paper. The left side is the Encoder (processes the input), the right side is the Decoder (generates the output). Click any component to see its explanation below:

Encoder (×N)
Feed-Forward Network
↑ Add & Norm
Multi-Head Attention
↑ Add & Norm
+ Positional Encoding
Input Embedding
Decoder (×N)
Linear + Softmax → Output
Feed-Forward Network
↑ Add & Norm
Cross-Attention (Enc→Dec)
↑ Add & Norm
Masked Multi-Head Attention
↑ Add & Norm
+ Positional Encoding
Output Embedding
Click a block above
Each component will be explained with its formula and role in the architecture.
Parallel Processing

Unlike RNNs that process tokens sequentially, the Transformer attends to all positions simultaneously. Training time drops from O(n) sequential steps to O(1).

🔗 Long-Range Dependencies

Self-attention connects every position to every other position directly. No more vanishing gradients over long sequences — path length is O(1).

📐 Key Dimensions

The paper uses: d_model = 512, h = 8 heads, d_k = d_v = 64, d_ff = 2048, N = 6 layers.

📄 Paper Reference

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017).

// The dominant sequence transduction models are based on complex
// recurrent or convolutional neural networks... We propose a new
// simple network architecture, the Transformer, based solely on
// attention mechanisms, dispensing with recurrence and convolutions
// entirely. — Abstract, Vaswani et al. (2017)

Input Embeddings & Positional Encoding

Tokens are mapped to dense vectors, then positional information is added so the model knows token order. Without positional encoding, the Transformer would be a "bag of words" — it couldn't distinguish "dog bites man" from "man bites dog".

🔤 Step 1: Token Embedding

Each token in the vocabulary gets mapped to a learned vector of dimension d_model = 512. In our simplified example we use d_model = 4 for visibility. The embeddings are multiplied by √d_model (Section 3.4 of the paper).

// Section 3.4: "we multiply those weights by √d_model"
embedding(token) = lookup(token) × √d_model
🌊 Step 2: Positional Encoding (Sinusoidal)

Since the Transformer has no recurrence or convolution, position information must be injected explicitly. The paper uses sine and cosine functions at different frequencies — each dimension gets a unique wave pattern. This allows the model to learn relative positions.

PE(pos, 2i)    = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

// pos = position in sequence (0, 1, 2, ...)
// i = dimension index (0, 1, 2, ... d_model/2)
Why sinusoidal?
For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). This means the model can learn to attend to relative positions. The wavelengths form a geometric progression from 2π to 10000·2π.
Positional Encoding Heatmap — rows = positions, cols = dimensions
Step 3: Embedding + Positional Encoding = Model Input

The final input to the encoder/decoder is simply the element-wise sum of the token embedding and the positional encoding. This combined vector carries both the token's meaning and its position in the sequence.

💡
Key insight: The embedding and PE live in the same vector space (d_model). Adding them lets the model use both meaning and position simultaneously. Later work (like RoPE in LLaMA) explored multiplicative alternatives.

Scaled Dot-Product Attention

The core mechanism of the Transformer. Each token creates a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I provide?). Attention weights are computed from Q·K similarity, then used to weight V.

📐 The Attention Formula (Equation 1 in the paper)
Attention(Q, K, V) = softmax( QKT / √d_k ) · V

This formula has 4 steps: (1) compute similarity scores via QKT, (2) scale by √d_k to prevent softmax saturation, (3) apply softmax to get attention weights (probabilities that sum to 1), (4) multiply by V to get the output.

🔢 Interactive: Step-by-Step Attention Calculation

We use a tiny 3-token, 4-dimensional example so you can see every number. The tokens are projected to Q, K, V through learned weight matrices WQ, WK, WV. Click through each step:

🌡️ Temperature & Scaling — Why divide by √d_k?

When d_k is large, the dot products grow large in magnitude, pushing the softmax into regions where it has extremely small gradients. Dividing by √d_k keeps the variance of the dot products at ≈1, regardless of dimension.

Softmax WITHOUT scaling
Softmax WITH √d_k scaling
// Section 3.2.1:
"We suspect that for large values of d_k, the dot products grow
large in magnitude, pushing the softmax function into regions
where it has extremely small gradients."

Multi-Head Attention

Instead of one big attention function, the paper splits Q, K, V into h parallel heads, each attending to different representation subspaces. The outputs are concatenated and projected back.

🔀 The Multi-Head Formula (Equation 2-3)
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · WO

where head_i = Attention(Q · WiQ, K · WiK, V · WiV)

// h = 8 heads, d_k = d_v = d_model/h = 512/8 = 64
👁️ Interactive: Visualize How Heads Split

Each head sees a different "slice" of the embedding dimensions. With 8 heads and d_model=512, each head gets d_k=64 dimensions. This lets different heads learn different types of relationships (syntax, semantics, position, etc.).

Head Allocation — each color = one head's subspace
🧠 What Different Heads Learn

Research on trained Transformers shows heads specializing in different linguistic patterns:

Head A — Positional
Attends to adjacent tokens. Captures local syntax (e.g., adjective → noun).
Head B — Semantic
Attends to semantically related tokens across the sentence, even far apart.
Head C — Separator
Attends to punctuation / special tokens. Tracks sentence boundaries.
📐
Computational cost: Multi-head attention with h heads of d_k dimensions has the same total cost as single-head attention with d_model dimensions, because h × d_k = d_model. You get diversity of attention patterns for free.

The Encoder Block

Each encoder layer has two sub-layers: (1) multi-head self-attention and (2) a position-wise feed-forward network. Both use residual connections and layer normalization.

🔄 Residual Connection + LayerNorm

Every sub-layer is wrapped with: LayerNorm(x + SubLayer(x)). The residual connection (adding x back) allows gradients to flow directly through the network, solving the vanishing gradient problem. LayerNorm stabilizes training.

// Section 3.1: "We employ a residual connection around each of
// the two sub-layers, followed by layer normalization."

output = LayerNorm( x + SubLayer(x) )

// LayerNorm normalizes across the feature dimension:
LayerNorm(x) = γ · (x − μ) / (σ + ε) + β
// where μ, σ are computed per-token across d_model dimensions
🧮 Position-wise Feed-Forward Network

Applied identically to each position (token) independently. It's essentially two linear transformations with a ReLU in between. The inner dimension d_ff = 2048 is 4× the model dimension.

FFN(x) = max(0, x · W₁ + b₁) · W₂ + b₂

// W₁: d_model × d_ff = 512 × 2048 (expand)
// W₂: d_ff × d_model = 2048 × 512 (contract)
// This is like two 1×1 convolutions with kernel size 1
FFN Shape — expand then contract
💡
Why expand-then-contract? The inner layer (d_ff=2048) acts as a larger representational space where the model can compute complex non-linear transformations. Think of it as the model "thinking" in a higher-dimensional space before compressing back.
📊 Interactive: Full Encoder Data Flow

Trace a sequence of 4 tokens through one complete encoder layer. Each token starts as a d_model vector and ends as a transformed d_model vector, enriched with contextual information from all other tokens.

The Decoder & Masking

The decoder is similar to the encoder but with two key differences: (1) masked self-attention prevents attending to future tokens, and (2) cross-attention lets the decoder attend to the encoder's output.

🎭 Masked Self-Attention — Preventing Future Leakage

During training, the decoder sees the entire target sequence at once for parallel processing. But each position should only attend to earlier positions (auto-regressive property). The mask sets future positions to -∞ before softmax, giving them zero attention weight.

// Section 3.2.3.1:
// "We need to prevent leftward information flow in the decoder
// to preserve the auto-regressive property."

Masked_Attention = softmax( QKT/√d_k + Mask ) · V

// Mask[i][j] = 0 if j ≤ i (allowed), −∞ if j > i (blocked)
Attention Mask
After Softmax (rows sum to 1)
💡
Why −∞ and not 0? We need the softmax output to be exactly zero for future positions. Since softmax(−∞) = 0, setting masked positions to −∞ guarantees zero attention weight, regardless of the actual QKT values.
🔗 Cross-Attention — Encoder ↔ Decoder Bridge

The decoder's second attention layer uses K and V from the encoder but Q from the decoder. This is how the decoder "reads" the source input. The decoder asks "what parts of the input are relevant to what I'm generating now?"

// Cross-attention (Section 3.2.3):
Q = decoder hidden states × WQ // "what am I looking for?"
K = encoder output × WK // "what does the source contain?"
V = encoder output × WV // "what information to extract?"

// This allows every position in the decoder to attend over
// all positions in the input sequence.
📊 Full Decoder Block Comparison
Encoder Self-Attention
Q, K, V all come from the same source (encoder input). Every token attends to every other token — no masking.
Decoder Masked Self-Attention
Q, K, V from decoder input. Token i can only attend to positions 0..i — future is masked.
Decoder Cross-Attention
Q from decoder, K and V from encoder output. Decoder attends to all encoder positions (no masking). This is the information bridge between input and output.

Training & Generation

The Transformer is trained with teacher forcing and generates output auto-regressively at inference. The paper used specific optimization choices that became standard practice.

🎓 Training: Teacher Forcing

During training, the decoder receives the ground-truth target (shifted right by one) as input. The model predicts the next token at each position, and all positions are computed in parallel thanks to masking.

// Training example: translate "I love ML" → "Ich liebe ML"

Encoder input: [I] [love] [ML]
Decoder input: [<SOS>] [Ich] [liebe] [ML] // shifted right
Decoder target: [Ich] [liebe] [ML] [<EOS>] // what we predict

// Loss = cross-entropy between predictions and targets
// All positions computed in parallel (masked attention
// prevents cheating by looking at future tokens)
🔮 Inference: Auto-Regressive Generation

At inference, there's no ground truth. The model generates one token at a time: predict → append → predict next → append → ... until <EOS>.

⚙️ Optimizer & Learning Rate Schedule

The paper used Adam optimizer with a custom learning rate schedule — a warmup phase followed by decay. This became one of the most influential training recipes.

lr = d_model−0.5 · min(step−0.5, step · warmup_steps−1.5)

// warmup_steps = 4000
// Linearly increases lr for the first warmup_steps,
// then decreases proportionally to step^(-0.5)
Learning Rate Schedule
📊 Model Configurations from the Paper
Transformer (Base)
N=6 layers, d_model=512, d_ff=2048, h=8 heads, d_k=64
Parameters: ~65M · BLEU: 25.8 (EN→DE)
Transformer (Big)
N=6 layers, d_model=1024, d_ff=4096, h=16 heads, d_k=64
Parameters: ~213M · BLEU: 28.4 (EN→DE)
🌍
Impact: The Transformer architecture became the foundation for BERT, GPT, T5, LLaMA, and virtually all modern LLMs. The key insight — replacing recurrence with attention — unlocked massive parallelism and enabled training on unprecedented data scales.