AI School | ML Visualizations

GPT-2 is a decoder-only transformer: it reads a prompt left to right and repeatedly predicts the next token. The core operation is causal self-attention.

High-level picture

Convert tokens into vectors: token embeddings + positional information.
In each layer, every token forms a weighted mix of previous tokens via self-attention.
A small MLP (feed-forward network) refines the representation.
The final vector at the last position is turned into next-token logits over the vocabulary.

Causal self-attention

“Causal” means token i can only attend to tokens ≤ i (no peeking at the future). Multiple attention heads let the model capture different relationships in parallel.

How to use the visualization

Write a short prompt and select a layer/head.
Click a token to see which earlier tokens it attends to.
Look at the “next-token” panel to connect attention → prediction.