How LLMs Think
From tokens to reasoning: understand the complete pipeline inside a large language model - tokenization, embeddings, attention, transformers, training, generation, and emergent reasoning.
Tokens
Token
A small chunk of text - roughly syllable-sized - that serves as the fundamental unit of input and output for a language model. Every computation, every cost, and every latency measurement is denominated in tokens.
Before a language model can process text, it must convert that text into a form it can do math on. Text as you type it is a string of characters. The model needs numbers. The first step is : breaking input text into small chunks called tokens and mapping each to a unique integer.
Are tokens words? Close, but not quite. Common short words like "the" or "and" are single tokens. Longer or rarer words get split into multiple tokens. The word "tokenization" itself might split into "token" + "ization" - two tokens for one word.
How Byte Pair Encoding Works
Byte Pair Encoding (BPE)
The algorithm used by most modern language models to build their vocabulary. It starts with individual characters and iteratively merges the most frequent adjacent pairs until reaching a target vocabulary size.
The method behind most modern tokenizers is called (BPE). The core idea is elegant:
- Start with the raw characters of a language
- Scan a large text corpus for the two characters that appear side by side most frequently
- Merge that pair into a single new token
- Repeat: find the next most frequent adjacent pair, merge it
- Continue for tens of thousands of iterations until you reach the desired vocabulary size
The result is a vocabulary that mirrors the actual patterns of the language. Extremely common sequences like "ing", "the", or "tion" become single tokens because they appeared constantly during merging. Rare sequences stay split into smaller pieces because they never triggered a merge.
| Model | Year | Vocabulary Size |
|---|---|---|
| GPT-2 | 2019 | 50,257 tokens |
| Llama 2 | 2023 | 32,000 tokens |
| GPT-4 | 2024 | ~100,000 tokens |
| Gemini 2.0 | 2025 | 262,000 tokens |
Vocabulary sizes have grown dramatically - an eightfold increase from Llama 2 to Gemini in just two years. Larger vocabularies mean shorter token sequences for the same text, which reduces computation in the .
The Practical Token Math
In clean English prose, one token is roughly three-quarters of a word. So 1,000 tokens ≈ 750 words. But this ratio shifts dramatically with different content types.
Interactive code editor — requires JavaScript
Key Takeaway
Tokens are the currency of everything in the LLM system. Context windows are measured in tokens. API costs are charged per token. Latency scales with token count. What you pay per word depends on the words.
Tokenization Pitfalls
Tokenization is not just a preprocessing step. It creates concrete failure modes that affect accuracy, cost, and quality. Understanding these pitfalls is essential for anyone building with .
The Number Problem
Token Boundary
The point where a tokenizer splits text into separate tokens. Token boundaries that fall in the middle of meaningful units (like numbers) can prevent the model from processing them correctly.
Numbers are where tokenization fails most visibly. The token "480" might be a single token, while "481" splits into "4" and "81". The model never sees digits aligned for column addition.
Interactive code editor — requires JavaScript
Research at ICLR 2025 demonstrated that simply reversing the tokenization order (right-to-left instead of left-to-right) improves arithmetic accuracy by over 22 percentage points. The math is the same. The are different. The answers change.
Code Is Expensive
Python code tokenizes far less efficiently than English prose. Variable names, operators, brackets, and indentation each become separate tokens. A single line of code can easily consume 15-20 tokens.
| Content Type | Tokens per Word | Cost Multiplier |
|---|---|---|
| English prose | ~1.3 | 1× (baseline) |
| Technical writing | ~1.5 | 1.2× |
| Python code | ~2.5 | 1.9× |
| JSON / structured data | ~3.0 | 2.3× |
| Non-English (e.g. Chinese) | ~2-4 | 1.5-3× |
| Mathematical notation | ~3-5 | 2.5-4× |
The Multilingual Gap
Most vocabularies were built primarily on English text. When you send text in an underrepresented language, individual words fragment into many small tokens - each covering just a few characters. This means:
- More tokens → more computation per request
- Higher API costs for the same semantic content
- Lower quality output because the model has less room in its
Real Impact: Research on Ukrainian showed that models spend significantly more compute on the same semantic content compared to English because the tokenizer fragments it so heavily. A 128K-token context window holds far fewer words in fragmented languages.
Interactive code editor — requires JavaScript
Key Takeaway
Tokenization is not neutral. It encodes biases from the training corpus - favoring English, common words, and natural language over code, numbers, and underrepresented languages. Every token boundary is a decision that affects model performance.
Embeddings
Embedding
A vector (list of numbers) that represents a token in a high-dimensional space, where proximity encodes semantic similarity. Modern LLMs use embeddings with 768 to 12,288 dimensions.
After , the model has a sequence of integers. But integer IDs carry no semantic information - the number 472 being close to 473 says nothing about their meanings. The tokens for "happy" and "joyful" might have IDs thousands apart, despite meaning nearly the same thing.
Before any processing begins, each integer token gets converted into a : a list of floating-point numbers, typically between 768 and 12,288 values depending on model size.
Meaning as Geometry
The intuition: imagine placing every word in the English language as a point in a high-dimensional space. Similar words cluster together. Unrelated words sit far apart. But words can be similar in many ways simultaneously - meaning, grammar, category, analogy - and with enough , you can capture all of these relationships at once.
Interactive code editor — requires JavaScript
This famous result - king − man + woman ≈ queen - demonstrated in 2013 that vector arithmetic on embeddings produces semantically meaningful results. There exists a direction in the space corresponding to gender, and arithmetic along that direction transforms concepts accordingly.
Static vs. Contextual Embeddings
The king/queen example uses static embeddings, where each word has one fixed vector. Modern language models are more sophisticated - the same token gets different vector representations depending on context.
| Property | Static Embeddings (Word2Vec) | Contextual Embeddings (Transformers) |
|---|---|---|
| Same word, different contexts | Same vector always | Different vector per context |
| "bank" (river vs. financial) | One vector for both meanings | Distinct vectors for each meaning |
| Dimensions | 100-300 | 768-12,288 |
| Learned from | Word co-occurrence | Full sequence context via attention |
| Examples | Word2Vec, GloVe | GPT, BERT, Claude, Llama |
When text gets tokenized and embedded, you start with vectors that mostly encode token identity. By the time the sequence has passed through all the model's layers, each vector has absorbed information from every other token around it. The embedding for "bank" encodes whether it's geographical or financial, based on all surrounding context.
The process that updates these representations - transforming static token vectors into rich, context-aware representations - is the .
Key Takeaway
Embeddings are the bridge between human language and machine computation. They convert discrete tokens into a continuous space where proximity means similarity, arithmetic means analogy, and every dimension captures some facet of meaning. Modern embeddings are contextual - the same word gets different representations depending on what surrounds it.
Self-Attention
Self-Attention
A mechanism that lets each token in a sequence compute how relevant every other token is to it, then blend information from the most relevant tokens into its own representation. This is the core operation inside every transformer.
The 2017 paper that introduced the architecture was titled "Attention Is All You Need." It was not modest. It was also correct.
The core problem attention solves: when processing a given token, which other tokens in the sequence are most relevant to understanding it?
The Coreference Problem
Consider this sentence: "The animal didn't cross the street because it was too tired."
What does "it" refer to? The animal, not the street. You resolved that immediately. A model has to do the same thing computationally. is the mechanism that makes this possible.
Query, Key, Value
Query-Key-Value (QKV)
The three vectors computed from each token's embedding. The Query asks "what am I looking for?", the Key advertises "what do I contain?", and the Value provides "what information do I contribute if attended to."
For every token, the model computes three vectors from its :
- Query (Q): What this token is looking for
- Key (K): What this token is advertising about itself
- Value (V): What this token actually contributes if attended to
Interactive code editor — requires JavaScript
The computation happens simultaneously for every token attending to every other token - all at once, not sequentially. This is what makes transformers parallelizable and fast.
Why Attention Replaced Everything Before It
| Property | Recurrent Networks (RNN/LSTM) | Self-Attention (Transformer) |
|---|---|---|
| Processing order | Sequential (left to right) | All positions at once |
| Long-range connections | Information degrades over distance | Direct connection between any two tokens |
| Parallelization | Cannot parallelize across time steps | Fully parallelizable |
| Computation per layer | O(n × d²) | O(n² × d) |
| Training speed | Slow (sequential bottleneck) | Fast (GPU-parallel) |
processed tokens sequentially. To get information from the beginning of a long sequence to the end, it had to pass through every intermediate step - and in practice, it faded. Early context degraded or disappeared entirely.
Self-attention eliminates this distance problem. Every token can directly attend to every other token in a single operation, regardless of distance.
Key Takeaway
Self-attention is the operation that lets a model understand context. For each token, it computes relevance scores against every other token, then blends information from the most relevant ones. This single mechanism - scaled up - is what makes modern language models work.
Multi-Head Attention
Multi-Head Attention
Running multiple independent attention computations in parallel, each with its own learned Query, Key, and Value matrices. Each head can specialize in tracking different types of relationships in the text.
A single computation can only capture one type of relationship at a time. But language has many simultaneous relationships: syntax, semantics, coreference, temporal ordering, and more. The solution: run many attention heads in parallel.
How It Works
Large models run 32 to 96 in parallel. Each head has its own set of weight matrices, so each head learns to attend to different types of patterns.
Interactive code editor — requires JavaScript
Heads Specialize for Specific Tasks
research in 2025, studying individual attention heads in Mistral-7B, found that specific heads specialize for concrete semantic categories:
| Head Type | What It Tracks | Evidence |
|---|---|---|
| Coreference heads | Pronoun resolution ('it' → 'the animal') | Consistent cross-sentence linking |
| Syntactic heads | Subject-verb agreement, clause structure | Grammar-sensitive attention patterns |
| Nationality heads | Country/nationality relationships | Cluster around geopolitical tokens |
| Temporal heads | Time references and ordering | Attend to dates, tenses, sequences |
| Retrieval heads | Pull facts from long contexts | Pruning them causes hallucination |
Retrieval Heads Are Critical: Researchers identified a class of attention heads responsible for pulling relevant information from long contexts. When you prune those specific heads, the model starts hallucinating. When you prune other heads, retrieval ability is unaffected. Specific heads, specific jobs.
The Concatenation Step
After all heads compute independently, their outputs are concatenated back into a single vector of the original model dimension, then passed through a final linear projection. This lets the model combine insights from all heads into a unified representation.
Key Takeaway
Multi-head attention is what gives transformers their expressive power. Each head is a different lens on the same data - one tracking grammar, another tracking meaning, another tracking long-range references. The model combines all these perspectives into a single, rich representation for each token.
Transformer Architecture
Transformer Block
A single processing unit consisting of multi-head attention followed by a feedforward network, with residual connections and layer normalization around each. Large models stack 80 to 120 of these blocks.
is the core mechanism, but a complete block pairs it with several other components that make deep networks trainable.
The Four Components
One transformer block consists of:
- Multi-Head Self-Attention - each token gathers context from all others
- Feedforward Network (FFN) - two linear layers with a nonlinearity, applied independently to each token
- Residual Connections - shortcut paths that add the input directly to the output
- Layer Normalization - keeps activation values in a stable range
Interactive code editor — requires JavaScript
Why Residual Connections Matter
Residual Connection
A shortcut that adds a layer's input directly to its output: output = input + layer(input). This lets each layer learn a small modification rather than a complete rewrite, and allows gradients to flow backward without vanishing.
Without , training a 100+ layer network would be impossible. Gradients would vanish as they propagated backward through dozens of layers. The residual shortcut gives gradients a direct highway back to earlier layers.
| Component | Purpose | What Happens Without It |
|---|---|---|
| Multi-Head Attention | Gather context from other tokens | No inter-token communication |
| Feedforward Network | Transform individual token representations | Limited representational capacity |
| Residual Connection | Preserve information across layers | Gradient vanishing; untrainable at depth |
| Layer Normalization | Stabilize activation magnitudes | Training instability; exploding values |
Why the FFN Is Bigger Than You'd Expect
The in each block expands to 4× the model dimension, then projects back down. In a model with d_model = 4096, the FFN's hidden layer is 16,384 dimensions. This expansion is where much of the model's factual knowledge is believed to be stored - as patterns in the weight matrices.
Key Takeaway
A transformer block is attention + feedforward + residual connections + normalization. The shape of data is preserved through every block, which is what lets you stack 80-120 of them. Each block reads the current representation, enriches it slightly, and passes it on. The architecture is simple. The depth is what creates capability.
Depth & the Residual Stream
Residual Stream
A conceptual model of how transformers work: rather than each layer creating a new representation from scratch, all layers read from and write to a single shared vector space. The representation is collective and cumulative.
Large models stack 80 to 120 . What is actually happening across all those layers?
What Each Layer Knows
research has established a consistent pattern:
| Layer Depth | What It Captures | Examples |
|---|---|---|
| Early layers (1-20) | Surface features, syntax, part of speech | Word boundaries, grammar rules, punctuation patterns |
| Middle layers (20-60) | Semantic content, entity types, relationships | "Paris is a city", "Einstein was a physicist" |
| Later layers (60-100+) | Abstract representations, task-specific features | Next-token prediction, reasoning patterns, output formatting |
The Residual Stream View: Rather than thinking of each transformer block as creating a new representation, researchers now understand the blocks as successively writing into a single, shared space. Each block reads from this stream and adds to it. No single block owns the representation - it accumulates gradually across all layers.
Visualizing the Stream
Interactive code editor — requires JavaScript
Practical Implications
This cumulative architecture has important consequences:
-
No single "understanding" layer exists. Comprehension builds gradually - you can't point to one layer and say "this is where the model understands the sentence."
-
Early exit is possible. For simple tasks, the model may have enough information after 30 layers. Research on "early exit" strategies shows you can skip later layers on easy inputs to save computation.
-
Layer pruning works selectively. Some layers contribute more than others. Studies show you can remove certain middle layers with minimal quality loss - they were adding redundant information to the stream.
Key Takeaway
Understanding in a transformer does not live in any one place. It accumulates across the full depth of the network, with each layer reading from and writing to a shared residual stream. The first layers capture what words are. The middle layers capture what they mean. The last layers decide what comes next.
Training & Scaling Laws
Pre-training
The initial training phase where a language model learns to predict the next token across trillions of tokens of text. This establishes the model's core capabilities - language understanding, factual knowledge, and reasoning patterns.
The training objective for language models is almost insultingly simple: predict the next .
The Training Loop
Take an enormous corpus of text - hundreds of billions to trillions of words. Feed sequences of tokens into the model. At each position, ask: what does the model predict comes next? Compare the prediction to the actual next token. Compute the error. Propagate that error backward through all layers. Adjust each weight by a tiny amount in the direction that reduces the error.
Repeat across enough token positions and the model's weights organize into a compressed representation of enormous amounts of human knowledge.
Interactive code editor — requires JavaScript
Chinchilla Scaling Laws
In 2022, DeepMind published the : the compute-optimal training ratio is approximately 20 tokens per parameter. A 70B parameter model should train on ~1.4 trillion tokens.
| Model | Parameters | Training Tokens | Tokens/Parameter | Strategy |
|---|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | 1.7× | Undertrained (pre-Chinchilla) |
| Chinchilla (2022) | 70B | 1.4T | 20× | Compute-optimal |
| Llama 3 8B (2024) | 8B | 15T | 1,875× | Inference-optimal |
| Qwen3-0.6B (2025) | 0.6B | 36T | 60,000× | Maximum overtrain |
The Chinchilla Trap: Training only to the compute-optimal point gives you a model that's too large and expensive to run at inference. Modern labs train much smaller models far beyond the Chinchilla ratio - Llama 3's 8B model at 1,875× - because smaller models cost less to serve at scale.
Pre-training vs. Post-training
on next-token prediction teaches the model what language looks like. After that, most models go through , where human preferences guide fine-tuning toward useful behavior.
Pre-training establishes capability. Post-training shapes behavior.
Key Takeaway
Language models learn by predicting the next token across trillions of words. The scaling laws reveal that the relationship between model size, data, and compute follows predictable patterns - and that modern practice favors smaller, heavily trained models over large, undertrained ones.
Generation & Sampling
Autoregressive Generation
The process of generating text one token at a time, where each new token is sampled from a probability distribution and then fed back as input for predicting the next token.
When you submit a prompt, here is what happens:
- Your input gets and
- The sequence passes through all
- The final block outputs a probability distribution over the vocabulary (32K-262K entries)
- The model samples one token from that distribution
- That token gets appended to the sequence
- Steps 2-5 repeat until an end-of-sequence token or length limit
This is . One token at a time, each feeding into the next pass.
Temperature Controls Randomness
Temperature
A parameter that controls how peaked or flat the probability distribution is before sampling. Low temperature (→ 0) makes the model deterministic. High temperature (→ 2) makes it creative but potentially incoherent.
divides the (raw scores) before the function:
Interactive code editor — requires JavaScript
Top-p (Nucleus) Sampling
Top-p Sampling
A sampling strategy that limits the candidate pool to the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g., 0.9). This prevents sampling extremely unlikely tokens while preserving natural diversity.
Most production systems combine temperature with :
Interactive code editor — requires JavaScript
| Setting | Temperature | Top-p | Use Case |
|---|---|---|---|
| Deterministic | 0 | 1.0 | Code generation, data extraction, pipelines |
| Balanced | 0.7 | 0.9 | General conversation, Q&A |
| Creative | 1.0 | 0.95 | Creative writing, brainstorming |
| Experimental | 1.5 | 0.99 | Poetry, unusual ideas (may lose coherence) |
Key Takeaway
The same prompt does not give the same answer every time. Unless temperature is zero, there is randomness in the sampling. This is not a bug - for creative tasks, diversity is the point. For deterministic pipelines, set temperature to zero. That control is always available.
Emergent Reasoning
Is a language model actually reasoning? Or is it a very sophisticated pattern matcher producing text that resembles reasoning?
The question has gotten considerably more interesting in the past year.
Chain-of-Thought Prompting
Chain-of-Thought (CoT)
A prompting technique where the model is asked to "think step by step," generating intermediate reasoning tokens that serve as working memory for subsequent steps. This significantly improves performance on complex reasoning tasks.
When you ask a model to think step by step, performance on complex tasks improves dramatically. This is not just stylistic - writing out intermediate steps forces the model to generate that encode intermediate results, which become context for the next steps.
Interactive code editor — requires JavaScript
DeepSeek-R1: Reasoning from Reinforcement Learning
The chain-of-thought research assumed that teaching reasoning required human-labeled reasoning examples. DeepSeek's R1 paper in January 2025 challenged this assumption.
The experiment: Train a model on complex reasoning tasks using pure . No human-labeled reasoning examples. The only reward signal: whether the final answer was correct.
What emerged was striking. The model began generating longer responses that incorporated:
- Self-verification - checking its own intermediate results
- Self-reflection - recognizing when an approach isn't working
- Dynamic strategy adaptation - switching methods mid-solution
The researchers tracked the word "wait" (appearing as a self-correction signal) across training steps:
| Training Step | Frequency of 'wait' | AIME 2024 Score | Behavior |
|---|---|---|---|
| Step 0 | Absent | 15.6% | Direct answers, no reflection |
| Step 4,000 | Sporadic | ~40% | Occasional self-correction |
| Step 8,000+ | Frequent | 77.9% | Systematic verification and reflection |
The model discovered, without being told, that pausing to reconsider was useful. The 77.9% score on AIME 2024 exceeds the average performance of human competitors.
What This Means
The behaviors researchers assumed required human demonstration emerged from a reward signal alone. The model was not told to reflect. It was told to get the right answer. Reflection was what it discovered.
Key Takeaway
The output text functions as working memory. Chain-of-thought forces the model to generate intermediate tokens that encode reasoning steps. DeepSeek-R1 showed that sophisticated reasoning behaviors - self-verification, reflection, strategy switching - can emerge from pure reinforcement learning without any human-labeled reasoning examples. The model learns to think because thinking helps it get the right answer.
Interpretability
Mechanistic Interpretability
The field of research that attempts to understand what happens inside neural networks by identifying specific circuits, features, and computations - moving from treating models as black boxes to understanding their internal mechanisms.
Emergent reasoning is striking. But what is actually happening inside the model while it produces these outputs?
The Polysemanticity Problem
Polysemanticity
The phenomenon where individual neurons in a neural network activate in response to many different, unrelated concepts simultaneously - because the model needs to represent more concepts than it has neurons.
Individual neurons in a language model are : they activate for many unrelated concepts at once. A single neuron might fire for both "legal documents" and "cooking recipes" and "the color blue." This makes direct interpretation of individual neurons nearly impossible.
Interactive code editor — requires JavaScript
What Sparse Autoencoders Found
Sparse Autoencoder
A technique that decomposes polysemantic neurons into individual interpretable features by learning a sparse representation where each feature corresponds to a single coherent concept.
Anthropic's work applying to Claude 3 Sonnet found features corresponding to surprisingly abstract, high-level concepts:
| Feature Category | Examples Found | Significance |
|---|---|---|
| Behavioral features | Sycophantic praise, deceptive responses | Model has internal representations of its own behavioral tendencies |
| Meta-cognitive features | Confidence assessment, knowledge uncertainty | Model tracks its own knowledge state |
| Task recognition features | Summarization requests, joke detection | Dedicated circuits for recognizing task types |
| Semantic features | Countries, professions, emotions | Clean categorical representations |
| Safety features | Harmful content detection, refusal triggers | Learned safety-relevant representations |
These are not low-level pattern detectors. They are abstract representations that would not be out of place in a theory of cognition. Features for "deception" and "confidence in own knowledge" suggest internal structure far richer than naive pattern matching.
The Honest Position
The distinction between reasoning and very sophisticated pattern matching may be less clean than it sounds. The question of where pattern-matching ends and genuine reasoning begins has not been settled for biological systems either.
What is practically useful:
-
Reasoning quality is sensitive to context. The you provide serve as working memory. Structure what the model sees and you structure what it produces.
-
Internal representations are richer than expected. The model has abstract features, not just surface patterns.
-
The debate may be the wrong question. What matters for practitioners: does the model produce correct, useful output for your task? Understanding the mechanisms helps you engineer better inputs and evaluate outputs more accurately.
Key Takeaway
A language model is a function that takes tokens and returns a probability distribution over what comes next. The computation between input and output is a stack of attention operations, layered 80-120 times, each enriching token representations with contextual information. That this produces behavior we recognize as intelligent is not magic. It is not fake either. It is what happens when you scale a simple objective with enough data and enough compute.
How LLMs Think - Exercises
Test your understanding of language model internals with 7 progressive challenges.
Exercise 1: Token Counting (Warm-up)
Interactive exercise — requires JavaScript to display
Exercise 2: BPE Merge Steps
Interactive exercise — requires JavaScript to display
Exercise 3: Embedding Analogies
Interactive quiz — requires JavaScript to display
Exercise 4: Attention Score Calculator
Interactive code editor — requires JavaScript
Exercise 5: Transformer Block Assembly
Interactive exercise — requires JavaScript to display
Exercise 6: Temperature Exploration
Interactive code editor — requires JavaScript
Exercise 7: Chain-of-Thought Detection (Stretch)
Interactive code editor — requires JavaScript
Build a Transformer from Scratch
From raw text to generated output: build a working transformer language model step by step in pure Python.
Tutorial progress tracker — requires JavaScript to display
Step 1: Setup & Data
Interactive code editor — requires JavaScript
Interactive code editor — requires JavaScript
Step 2: Tokenizer
Build a simple word-level tokenizer with encode/decode methods.
Interactive code editor — requires JavaScript
Step 3: Embeddings
Create an embedding layer that converts token IDs to dense vectors.
Interactive code editor — requires JavaScript
Step 4: Single-Head Attention
Implement the core QKV attention mechanism.
Interactive code editor — requires JavaScript
Step 5: Multi-Head Attention
Run multiple attention heads in parallel and concatenate.
Interactive code editor — requires JavaScript
Step 6: Transformer Block
Combine attention, feedforward, residual connections, and layer normalization.
Interactive code editor — requires JavaScript
Step 7: Text Generation
Build autoregressive generation - predict one token at a time.
Interactive code editor — requires JavaScript
Step 8: Temperature & Top-p Sampling
Add temperature and nucleus sampling for controlled generation.
Interactive code editor — requires JavaScript
What You Built:
- A tokenizer that converts text to integers and back
- An embedding layer that gives tokens vector meaning
- Single-head and multi-head attention mechanisms
- A full transformer block with residual connections and normalization
- Autoregressive text generation
- Temperature and top-p sampling for controlled output
In production, these same components are scaled to billions of parameters and trained on trillions of tokens.