Attention Is All You Need — Interactive Transformer Guide

01The Transformer Architecture

The Transformer (Vaswani et al., 2017) replaced recurrence with self-attention, processing all positions in parallel. This single architectural change enabled modern LLMs, from BERT to GPT-4.

⬡ Interactive Architecture — click any block

The left stack is the Encoder (reads the input), the right is the Decoder (generates the output). Click any component:

Encoder (×N)

Feed-Forward Network

↑ Add & Norm

Multi-Head Attention

↑ Add & Norm

+ Positional Encoding

↑

Input Embedding

Decoder (×N)

Linear + Softmax → Output

↑

Feed-Forward Network

↑ Add & Norm

Cross-Attention (Enc→Dec)

↑ Add & Norm

Masked Multi-Head Attention

↑ Add & Norm

+ Positional Encoding

↑

Output Embedding

Click a block above

Each component will be explained with its formula and role.

⚡ Parallel processing

Unlike RNNs that process tokens one by one, the Transformer attends to all positions simultaneously. Training time drops from O(n) sequential steps to O(1).

◎ Direct connections

Self-attention connects every token to every other token in a single step. No vanishing gradients — the path length between any two tokens is O(1).

▦ Key dimensions

d_model=512 h=8 heads d_k=d_v=64 d_ff=2048 N=6 layers

02Embeddings & Positional Encoding

Tokens become dense vectors, then positional information is added. Without position encoding, the Transformer can't distinguish "dog bites man" from "man bites dog."

▤ Step 1: Token embedding

Each token maps to a learned vector of dimension d_model=512. We use d_model=4 here for visibility. Embeddings are scaled by √d_model (Section 3.4).

Input sentence

∿ Step 2: Sinusoidal positional encoding

Position information uses sine/cosine waves at different frequencies — each dimension gets a unique pattern. Low dimensions oscillate fast (fine position), high dimensions oscillate slowly (coarse position).

PE(pos, 2i) = sin(pos / 10000^2i/d_model)
PE(pos, 2i+1) = cos(pos / 10000^2i/d_model)

Sequence length

d_model

Why sinusoidal?

For any offset k, PE(pos+k) is a linear function of PE(pos). This lets the model learn relative positions. Wavelengths form a geometric progression from 2π to 10000·2π.

Positional encoding heatmap — rows = positions, cols = dimensions

⊕ Step 3: Sum → model input

The final input is the element-wise sum of token embedding + positional encoding. One vector carries both meaning and position.

03The Intuition Behind Self-Attention

Before we see the math, let's understand why self-attention works. The core idea: every word should be able to "look at" every other word to understand context.

◉ The problem self-attention solves

Consider this sentence:

"The animal didn't cross the street because it was too tired."

What does "it" refer to? A human instantly knows it's the animal (not the street). But a model processing words one-by-one (like an RNN) may lose this connection over the intervening tokens. Self-attention solves this by letting every word directly examine every other word.

Key insight

Self-attention computes a relevance score between every pair of words, then uses those scores to create context-aware representations. The word "it" can directly attend to "animal" regardless of distance.

⊞ The library analogy for Q, K, V

The Query/Key/Value mechanism is like a library search system:

🔍

Query (Q) — "What am I looking for?"
You walk into a library with a question in mind. Each word creates a query vector representing what information it needs from context.

🏷️

Key (K) — "What do I contain?"
Each book has a label describing its contents. Each word creates a key vector advertising what information it offers.

📖

Value (V) — "Here's my actual content."
Once you find the right book (Q·K match), you read its content. Each word provides a value vector — the actual information to extract.

The process: (1) Compare your query against all keys → get relevance scores. (2) Turn scores into probabilities (softmax). (3) Read a weighted mix of all values. The output is a context-enriched version of each word.

◈ Interactive: sentence-level attention

Click any word below to see which other words it attends to most. This simulates how self-attention lets each word gather context from the full sentence.

Sentence

Attention weights — click a word above, then see where it attends

Note

These weights are synthetic illustrations using positional + lexical similarity heuristics, not from a trained model. In a real Transformer, attention patterns are learned from data and vary by head.

△ Before vs. after self-attention

Self-attention transforms isolated word vectors into contextualized representations. The word "bank" means something different in "river bank" vs. "bank account" — self-attention resolves this by mixing in context.

Before (input embeddings)

Each word is an independent vector based only on the word itself. "bank" has the same representation regardless of context. No interaction between words.

After (self-attention output)

Each word is now a weighted mixture of all words. "bank" near "river" absorbs river-related meaning. "bank" near "account" absorbs financial meaning. Same word, different representation.

04Scaled Dot-Product Attention

Now the math. Each token creates a Query, Key, and Value vector. Attention weights come from Q·K similarity, scaled by √d_k, then softmaxed to produce weights over V.

≡ The attention formula (Equation 1)

Attention(Q, K, V) = softmax( QK^T / √d_k ) · V

Four steps: (1) Score — compute QK^T to measure similarity between each pair of tokens. (2) Scale — divide by √d_k to prevent softmax saturation. (3) Normalize — softmax turns scores into probabilities (rows sum to 1). (4) Aggregate — weighted sum of V vectors.

▥ Interactive: step-by-step calculation

A tiny 3-token, 4-dimensional example showing every number. Tokens are projected to Q, K, V through learned weight matrices. Click each step:

⊘ Why divide by √d_k?

When d_k is large, dot products grow in magnitude, pushing softmax into near-zero gradient regions. Scaling by √d_k keeps variance ≈ 1 regardless of dimension.

d_k value

WITHOUT scaling (softmax saturates)

WITH √d_k scaling (gradients flow)

The gradient problem

When softmax outputs approach 0 or 1, its gradient vanishes — the model can't learn. Scaling prevents this by keeping scores in a moderate range where softmax is sensitive to input changes.

05Multi-Head Attention

Instead of one attention function, the model runs h parallel attention heads, each looking at different aspects of the relationships between tokens.

⧉ The formula (Equations 2–3)

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · W^O

where head_i = Attention(Q·Wᵢ^Q, K·Wᵢ^K, V·Wᵢ^V)

// h=8 heads, d_k = d_v = d_model/h = 512/8 = 64

▤ How heads split the embedding

Each head gets a different "slice" of dimensions. With 8 heads and d_model=512, each head works with d_k=64 dimensions — learning different relationship types in parallel.

d_model

Heads (h)

Head allocation — each color = one head's subspace

◉ What different heads learn

Research shows heads specialize in different linguistic patterns:

Head A — positional

Attends to adjacent tokens. Captures local syntax: adjective → noun, determiner → noun.

Head B — semantic

Attends to semantically related tokens across the sentence, even far apart.

Head C — structural

Attends to punctuation and special tokens. Tracks sentence boundaries and clause structure.

Cost

Multi-head with h heads of d_k dimensions costs the same as single-head with d_model dimensions, because h × d_k = d_model. Diversity of attention patterns is free.

06The Encoder Block

Each encoder layer has two sub-layers: (1) multi-head self-attention and (2) a position-wise feed-forward network. Both use residual connections and layer normalization.

↻ Residual connection + LayerNorm

Every sub-layer is wrapped with: LayerNorm(x + SubLayer(x)). The residual path lets gradients flow directly through deep networks. LayerNorm stabilizes training.

output = LayerNorm( x + SubLayer(x) )

// LayerNorm normalizes across the d_model dimension:
LayerNorm(x) = γ · (x − μ) / (σ + ε) + β

▦ Position-wise feed-forward network

Two linear layers with ReLU, applied to each token independently. The inner dimension d_ff=2048 is 4× the model dimension — think of it as "thinking in a higher-dimensional space."

FFN(x) = max(0, x·W₁ + b₁) · W₂ + b₂

// W₁: 512 × 2048 (expand) → ReLU → W₂: 2048 × 512 (contract)

d_model

d_ff

FFN shape — expand then contract

→ Full encoder data flow

Trace tokens through one complete encoder layer:

07The Decoder & Masking

The decoder adds two things: (1) masked self-attention that prevents looking at future tokens, and (2) cross-attention that reads the encoder's output.

◧ Masked self-attention — preventing future leakage

During training, the decoder sees the full target at once (for parallelism). The mask sets future positions to −∞ before softmax, giving them zero weight.

Masked_Attention = softmax( QK^T/√d_k + Mask ) · V

// Mask[i][j] = 0 if j ≤ i (allowed), −∞ if j > i (blocked)

Sequence length

Attention mask

After softmax (rows sum to 1)

Why −∞ and not 0?

softmax(−∞) = 0 exactly. This guarantees zero attention weight for future positions regardless of the QK^T scores.

⇋ Cross-attention — the encoder-decoder bridge

The decoder's second attention uses K, V from the encoder but Q from the decoder. The decoder asks: "what parts of the input are relevant to what I'm generating now?"

Q = decoder states × W^Q // "what am I looking for?"
K = encoder output × W^K // "what does the source contain?"
V = encoder output × W^V // "what to extract?"

⊞ Three types of attention compared

Encoder self-attention

Q, K, V all from encoder input. Every token attends to every other. No mask.

Decoder masked self-attn

Q, K, V from decoder input. Token i attends only to positions ≤ i. Causal mask.

Decoder cross-attention

Q from decoder, K+V from encoder. Decoder reads all encoder positions. No mask.

08Training & Generation

Trained with teacher forcing (parallel), generates output auto-regressively (sequential). The paper's optimization recipe became standard practice.

▶ Teacher forcing (training)

The decoder receives ground-truth target (shifted right) as input. All positions are computed in parallel — masking prevents cheating.

// Translate "I love ML" → "Ich liebe ML"

Encoder input: [I] [love] [ML]
Decoder input: [<SOS>] [Ich] [liebe] [ML] // shifted right
Decoder target: [Ich] [liebe] [ML] [<EOS>] // predict this

// Loss = cross-entropy(predictions, targets)

↝ Auto-regressive generation (inference)

No ground truth available. Generate one token at a time: predict → append → predict next → ... until <EOS>.

⚙ Learning rate schedule

lr = d_model^−0.5 · min(step^−0.5, step · warmup^−1.5)
// Linear warmup → inverse-sqrt decay

Warmup steps

Learning rate schedule

◊ Model configurations

Transformer (base)

N=6 d=512 d_ff=2048 h=8 · ~65M params · BLEU 25.8

Transformer (big)

N=6 d=1024 d_ff=4096 h=16 · ~213M params · BLEU 28.4