ASAI School
NLP
Interactive

How GPT-2 Works (Transformer Block)

Explore causal self-attention with a heatmap and next-token scoring.

transformergpt-2self-attentionlanguage-modeling
Explanation

GPT-2 is a decoder-only transformer: it reads a prompt left to right and repeatedly predicts the next token. The core operation is causal self-attention.

High-level picture

  • Convert tokens into vectors: token embeddings + positional information.
  • In each layer, every token forms a weighted mix of previous tokens via self-attention.
  • A small MLP (feed-forward network) refines the representation.
  • The final vector at the last position is turned into next-token logits over the vocabulary.

Causal self-attention

“Causal” means token i can only attend to tokens ≤ i (no peeking at the future). Multiple attention heads let the model capture different relationships in parallel.

How to use the visualization

  • Write a short prompt and select a layer/head.
  • Click a token to see which earlier tokens it attends to.
  • Look at the “next-token” panel to connect attention → prediction.
GPT-2 (Transformer) Intuition
A tiny, educational transformer: causal self-attention over your prompt, with a heatmap like a “transformer explainer”.
layers 2heads 4tokens 10
Prompt
Tip: keep it short; the heatmap is N×N tokens.
Self-Attention Heatmap
focus token 10
Each row is “where token i looks” (weights sum to 1). GPT-2 uses a causal mask: tokens can’t attend to the future.
Attention From Focus Token
1 The0.40
2 cat0.05
3 sat0.19
4 on0.09
5 the0.10
6 mat0.04
7 ,0.03
8 and0.07
9 then0.01
10 it0.03
Next-Token (Toy)
it73.5%
cat6.0%
to4.3%
is3.6%
sat3.6%
in2.3%
the1.7%
.1.5%
This is a simplified simulation (not real GPT-2 weights/tokenization), but it shows the mechanics: embeddings → self-attention → next-token scores.