How LLMs Think

From tokens to reasoning: understand the complete pipeline inside a large language model - tokenization, embeddings, attention, transformers, training, generation, and emergent reasoning.

Tokens

Token

A small chunk of text - roughly syllable-sized - that serves as the fundamental unit of input and output for a language model. Every computation, every cost, and every latency measurement is denominated in tokens.

Before a language model can process text, it must convert that text into a form it can do math on. Text as you type it is a string of characters. The model needs numbers. The first step is : breaking input text into small chunks called tokens and mapping each to a unique integer.

Are tokens words? Close, but not quite. Common short words like "the" or "and" are single tokens. Longer or rarer words get split into multiple tokens. The word "tokenization" itself might split into "token" + "ization" - two tokens for one word.

How Byte Pair Encoding Works

Byte Pair Encoding (BPE)

The algorithm used by most modern language models to build their vocabulary. It starts with individual characters and iteratively merges the most frequent adjacent pairs until reaching a target vocabulary size.

The method behind most modern tokenizers is called (BPE). The core idea is elegant:

Start with the raw characters of a language
Scan a large text corpus for the two characters that appear side by side most frequently
Merge that pair into a single new token
Repeat: find the next most frequent adjacent pair, merge it
Continue for tens of thousands of iterations until you reach the desired vocabulary size

The result is a vocabulary that mirrors the actual patterns of the language. Extremely common sequences like "ing", "the", or "tion" become single tokens because they appeared constantly during merging. Rare sequences stay split into smaller pieces because they never triggered a merge.

Model	Year	Vocabulary Size
GPT-2	2019	50,257 tokens
Llama 2	2023	32,000 tokens
GPT-4	2024	~100,000 tokens
Gemini 2.0	2025	262,000 tokens

Vocabulary sizes have grown dramatically - an eightfold increase from Llama 2 to Gemini in just two years. Larger vocabularies mean shorter token sequences for the same text, which reduces computation in the .

The Practical Token Math

In clean English prose, one token is roughly three-quarters of a word. So 1,000 tokens ≈ 750 words. But this ratio shifts dramatically with different content types.

Python

Interactive code editor — requires JavaScript

Key Takeaway

Tokens are the currency of everything in the LLM system. Context windows are measured in tokens. API costs are charged per token. Latency scales with token count. What you pay per word depends on the words.

Tokenization Pitfalls

Tokenization is not just a preprocessing step. It creates concrete failure modes that affect accuracy, cost, and quality. Understanding these pitfalls is essential for anyone building with .

The Number Problem

Token Boundary

The point where a tokenizer splits text into separate tokens. Token boundaries that fall in the middle of meaningful units (like numbers) can prevent the model from processing them correctly.

Numbers are where tokenization fails most visibly. The token "480" might be a single token, while "481" splits into "4" and "81". The model never sees digits aligned for column addition.

Python

Interactive code editor — requires JavaScript

Research at ICLR 2025 demonstrated that simply reversing the tokenization order (right-to-left instead of left-to-right) improves arithmetic accuracy by over 22 percentage points. The math is the same. The are different. The answers change.

Code Is Expensive

Python code tokenizes far less efficiently than English prose. Variable names, operators, brackets, and indentation each become separate tokens. A single line of code can easily consume 15-20 tokens.

Content Type	Tokens per Word	Cost Multiplier
English prose	~1.3	1× (baseline)
Technical writing	~1.5	1.2×
Python code	~2.5	1.9×
JSON / structured data	~3.0	2.3×
Non-English (e.g. Chinese)	~2-4	1.5-3×
Mathematical notation	~3-5	2.5-4×

The Multilingual Gap

Most vocabularies were built primarily on English text. When you send text in an underrepresented language, individual words fragment into many small tokens - each covering just a few characters. This means:

More tokens → more computation per request
Higher API costs for the same semantic content
Lower quality output because the model has less room in its

Real Impact: Research on Ukrainian showed that models spend significantly more compute on the same semantic content compared to English because the tokenizer fragments it so heavily. A 128K-token context window holds far fewer words in fragmented languages.

Python

Interactive code editor — requires JavaScript

Key Takeaway

Tokenization is not neutral. It encodes biases from the training corpus - favoring English, common words, and natural language over code, numbers, and underrepresented languages. Every token boundary is a decision that affects model performance.

Embeddings

Embedding

A vector (list of numbers) that represents a token in a high-dimensional space, where proximity encodes semantic similarity. Modern LLMs use embeddings with 768 to 12,288 dimensions.

After , the model has a sequence of integers. But integer IDs carry no semantic information - the number 472 being close to 473 says nothing about their meanings. The tokens for "happy" and "joyful" might have IDs thousands apart, despite meaning nearly the same thing.

Before any processing begins, each integer token gets converted into a : a list of floating-point numbers, typically between 768 and 12,288 values depending on model size.

Meaning as Geometry

The intuition: imagine placing every word in the English language as a point in a high-dimensional space. Similar words cluster together. Unrelated words sit far apart. But words can be similar in many ways simultaneously - meaning, grammar, category, analogy - and with enough , you can capture all of these relationships at once.

Python

Interactive code editor — requires JavaScript

This famous result - king − man + woman ≈ queen - demonstrated in 2013 that vector arithmetic on embeddings produces semantically meaningful results. There exists a direction in the space corresponding to gender, and arithmetic along that direction transforms concepts accordingly.

Static vs. Contextual Embeddings

The king/queen example uses static embeddings, where each word has one fixed vector. Modern language models are more sophisticated - the same token gets different vector representations depending on context.

Property	Static Embeddings (Word2Vec)	Contextual Embeddings (Transformers)
Same word, different contexts	Same vector always	Different vector per context
"bank" (river vs. financial)	One vector for both meanings	Distinct vectors for each meaning
Dimensions	100-300	768-12,288
Learned from	Word co-occurrence	Full sequence context via attention
Examples	Word2Vec, GloVe	GPT, BERT, Claude, Llama

When text gets tokenized and embedded, you start with vectors that mostly encode token identity. By the time the sequence has passed through all the model's layers, each vector has absorbed information from every other token around it. The embedding for "bank" encodes whether it's geographical or financial, based on all surrounding context.

The process that updates these representations - transforming static token vectors into rich, context-aware representations - is the .

Key Takeaway

Embeddings are the bridge between human language and machine computation. They convert discrete tokens into a continuous space where proximity means similarity, arithmetic means analogy, and every dimension captures some facet of meaning. Modern embeddings are contextual - the same word gets different representations depending on what surrounds it.

Self-Attention

A mechanism that lets each token in a sequence compute how relevant every other token is to it, then blend information from the most relevant tokens into its own representation. This is the core operation inside every transformer.

The 2017 paper that introduced the architecture was titled "Attention Is All You Need." It was not modest. It was also correct.

The core problem attention solves: when processing a given token, which other tokens in the sequence are most relevant to understanding it?

The Coreference Problem

Consider this sentence: "The animal didn't cross the street because it was too tired."

What does "it" refer to? The animal, not the street. You resolved that immediately. A model has to do the same thing computationally. is the mechanism that makes this possible.

Query, Key, Value

Query-Key-Value (QKV)

The three vectors computed from each token's embedding. The Query asks "what am I looking for?", the Key advertises "what do I contain?", and the Value provides "what information do I contribute if attended to."

For every token, the model computes three vectors from its :

Query (Q): What this token is looking for
Key (K): What this token is advertising about itself
Value (V): What this token actually contributes if attended to

Python

Interactive code editor — requires JavaScript

The computation happens simultaneously for every token attending to every other token - all at once, not sequentially. This is what makes transformers parallelizable and fast.

Why Attention Replaced Everything Before It

Property	Recurrent Networks (RNN/LSTM)	Self-Attention (Transformer)
Processing order	Sequential (left to right)	All positions at once
Long-range connections	Information degrades over distance	Direct connection between any two tokens
Parallelization	Cannot parallelize across time steps	Fully parallelizable
Computation per layer	O(n × d²)	O(n² × d)
Training speed	Slow (sequential bottleneck)	Fast (GPU-parallel)

processed tokens sequentially. To get information from the beginning of a long sequence to the end, it had to pass through every intermediate step - and in practice, it faded. Early context degraded or disappeared entirely.

Self-attention eliminates this distance problem. Every token can directly attend to every other token in a single operation, regardless of distance.

Key Takeaway

Self-attention is the operation that lets a model understand context. For each token, it computes relevance scores against every other token, then blends information from the most relevant ones. This single mechanism - scaled up - is what makes modern language models work.

Multi-Head Attention

Running multiple independent attention computations in parallel, each with its own learned Query, Key, and Value matrices. Each head can specialize in tracking different types of relationships in the text.

A single computation can only capture one type of relationship at a time. But language has many simultaneous relationships: syntax, semantics, coreference, temporal ordering, and more. The solution: run many attention heads in parallel.

How It Works

Large models run 32 to 96 in parallel. Each head has its own set of weight matrices, so each head learns to attend to different types of patterns.

Python

Interactive code editor — requires JavaScript

Heads Specialize for Specific Tasks

research in 2025, studying individual attention heads in Mistral-7B, found that specific heads specialize for concrete semantic categories:

Head Type	What It Tracks	Evidence
Coreference heads	Pronoun resolution ('it' → 'the animal')	Consistent cross-sentence linking
Syntactic heads	Subject-verb agreement, clause structure	Grammar-sensitive attention patterns
Nationality heads	Country/nationality relationships	Cluster around geopolitical tokens
Temporal heads	Time references and ordering	Attend to dates, tenses, sequences
Retrieval heads	Pull facts from long contexts	Pruning them causes hallucination

Retrieval Heads Are Critical: Researchers identified a class of attention heads responsible for pulling relevant information from long contexts. When you prune those specific heads, the model starts hallucinating. When you prune other heads, retrieval ability is unaffected. Specific heads, specific jobs.

The Concatenation Step

After all heads compute independently, their outputs are concatenated back into a single vector of the original model dimension, then passed through a final linear projection. This lets the model combine insights from all heads into a unified representation.

Key Takeaway

Multi-head attention is what gives transformers their expressive power. Each head is a different lens on the same data - one tracking grammar, another tracking meaning, another tracking long-range references. The model combines all these perspectives into a single, rich representation for each token.

Transformer Architecture

Transformer Block

A single processing unit consisting of multi-head attention followed by a feedforward network, with residual connections and layer normalization around each. Large models stack 80 to 120 of these blocks.

is the core mechanism, but a complete block pairs it with several other components that make deep networks trainable.

The Four Components

One transformer block consists of:

Multi-Head Self-Attention - each token gathers context from all others
Feedforward Network (FFN) - two linear layers with a nonlinearity, applied independently to each token
Residual Connections - shortcut paths that add the input directly to the output
Layer Normalization - keeps activation values in a stable range

Python

Interactive code editor — requires JavaScript

Why Residual Connections Matter

Residual Connection

A shortcut that adds a layer's input directly to its output: output = input + layer(input). This lets each layer learn a small modification rather than a complete rewrite, and allows gradients to flow backward without vanishing.

Without , training a 100+ layer network would be impossible. Gradients would vanish as they propagated backward through dozens of layers. The residual shortcut gives gradients a direct highway back to earlier layers.

Component	Purpose	What Happens Without It
Multi-Head Attention	Gather context from other tokens	No inter-token communication
Feedforward Network	Transform individual token representations	Limited representational capacity
Residual Connection	Preserve information across layers	Gradient vanishing; untrainable at depth
Layer Normalization	Stabilize activation magnitudes	Training instability; exploding values

Why the FFN Is Bigger Than You'd Expect

The in each block expands to 4× the model dimension, then projects back down. In a model with d_model = 4096, the FFN's hidden layer is 16,384 dimensions. This expansion is where much of the model's factual knowledge is believed to be stored - as patterns in the weight matrices.

Key Takeaway

A transformer block is attention + feedforward + residual connections + normalization. The shape of data is preserved through every block, which is what lets you stack 80-120 of them. Each block reads the current representation, enriches it slightly, and passes it on. The architecture is simple. The depth is what creates capability.

Depth & the Residual Stream

Residual Stream

A conceptual model of how transformers work: rather than each layer creating a new representation from scratch, all layers read from and write to a single shared vector space. The representation is collective and cumulative.

Large models stack 80 to 120 . What is actually happening across all those layers?

What Each Layer Knows

research has established a consistent pattern:

Layer Depth	What It Captures	Examples
Early layers (1-20)	Surface features, syntax, part of speech	Word boundaries, grammar rules, punctuation patterns
Middle layers (20-60)	Semantic content, entity types, relationships	"Paris is a city", "Einstein was a physicist"
Later layers (60-100+)	Abstract representations, task-specific features	Next-token prediction, reasoning patterns, output formatting

The Residual Stream View: Rather than thinking of each transformer block as creating a new representation, researchers now understand the blocks as successively writing into a single, shared space. Each block reads from this stream and adds to it. No single block owns the representation - it accumulates gradually across all layers.

Visualizing the Stream

Python

Interactive code editor — requires JavaScript

Practical Implications

This cumulative architecture has important consequences:

No single "understanding" layer exists. Comprehension builds gradually - you can't point to one layer and say "this is where the model understands the sentence."
Early exit is possible. For simple tasks, the model may have enough information after 30 layers. Research on "early exit" strategies shows you can skip later layers on easy inputs to save computation.
Layer pruning works selectively. Some layers contribute more than others. Studies show you can remove certain middle layers with minimal quality loss - they were adding redundant information to the stream.

Key Takeaway

Understanding in a transformer does not live in any one place. It accumulates across the full depth of the network, with each layer reading from and writing to a shared residual stream. The first layers capture what words are. The middle layers capture what they mean. The last layers decide what comes next.

Training & Scaling Laws

Pre-training

The initial training phase where a language model learns to predict the next token across trillions of tokens of text. This establishes the model's core capabilities - language understanding, factual knowledge, and reasoning patterns.

The training objective for language models is almost insultingly simple: predict the next .

The Training Loop

Take an enormous corpus of text - hundreds of billions to trillions of words. Feed sequences of tokens into the model. At each position, ask: what does the model predict comes next? Compare the prediction to the actual next token. Compute the error. Propagate that error backward through all layers. Adjust each weight by a tiny amount in the direction that reduces the error.

Repeat across enough token positions and the model's weights organize into a compressed representation of enormous amounts of human knowledge.

Python

Interactive code editor — requires JavaScript

Chinchilla Scaling Laws

In 2022, DeepMind published the : the compute-optimal training ratio is approximately 20 tokens per parameter. A 70B parameter model should train on ~1.4 trillion tokens.

Model	Parameters	Training Tokens	Tokens/Parameter	Strategy
GPT-3 (2020)	175B	300B	1.7×	Undertrained (pre-Chinchilla)
Chinchilla (2022)	70B	1.4T	20×	Compute-optimal
Llama 3 8B (2024)	8B	15T	1,875×	Inference-optimal
Qwen3-0.6B (2025)	0.6B	36T	60,000×	Maximum overtrain

The Chinchilla Trap: Training only to the compute-optimal point gives you a model that's too large and expensive to run at inference. Modern labs train much smaller models far beyond the Chinchilla ratio - Llama 3's 8B model at 1,875× - because smaller models cost less to serve at scale.

Pre-training vs. Post-training

on next-token prediction teaches the model what language looks like. After that, most models go through , where human preferences guide fine-tuning toward useful behavior.

Pre-training establishes capability. Post-training shapes behavior.

Key Takeaway

Language models learn by predicting the next token across trillions of words. The scaling laws reveal that the relationship between model size, data, and compute follows predictable patterns - and that modern practice favors smaller, heavily trained models over large, undertrained ones.

Generation & Sampling

Autoregressive Generation

The process of generating text one token at a time, where each new token is sampled from a probability distribution and then fed back as input for predicting the next token.

When you submit a prompt, here is what happens:

Your input gets and
The sequence passes through all
The final block outputs a probability distribution over the vocabulary (32K-262K entries)
The model samples one token from that distribution
That token gets appended to the sequence
Steps 2-5 repeat until an end-of-sequence token or length limit

This is . One token at a time, each feeding into the next pass.

Temperature Controls Randomness

Temperature

A parameter that controls how peaked or flat the probability distribution is before sampling. Low temperature (→ 0) makes the model deterministic. High temperature (→ 2) makes it creative but potentially incoherent.

divides the (raw scores) before the function:

Python

Interactive code editor — requires JavaScript

Top-p (Nucleus) Sampling

Top-p Sampling

A sampling strategy that limits the candidate pool to the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g., 0.9). This prevents sampling extremely unlikely tokens while preserving natural diversity.

Most production systems combine temperature with :

Python

Interactive code editor — requires JavaScript

Setting	Temperature	Top-p	Use Case
Deterministic	0	1.0	Code generation, data extraction, pipelines
Balanced	0.7	0.9	General conversation, Q&A
Creative	1.0	0.95	Creative writing, brainstorming
Experimental	1.5	0.99	Poetry, unusual ideas (may lose coherence)

Key Takeaway

The same prompt does not give the same answer every time. Unless temperature is zero, there is randomness in the sampling. This is not a bug - for creative tasks, diversity is the point. For deterministic pipelines, set temperature to zero. That control is always available.

Emergent Reasoning

Is a language model actually reasoning? Or is it a very sophisticated pattern matcher producing text that resembles reasoning?

The question has gotten considerably more interesting in the past year.

Chain-of-Thought Prompting

Chain-of-Thought (CoT)

A prompting technique where the model is asked to "think step by step," generating intermediate reasoning tokens that serve as working memory for subsequent steps. This significantly improves performance on complex reasoning tasks.

When you ask a model to think step by step, performance on complex tasks improves dramatically. This is not just stylistic - writing out intermediate steps forces the model to generate that encode intermediate results, which become context for the next steps.

Python

Interactive code editor — requires JavaScript

DeepSeek-R1: Reasoning from Reinforcement Learning

The chain-of-thought research assumed that teaching reasoning required human-labeled reasoning examples. DeepSeek's R1 paper in January 2025 challenged this assumption.

The experiment: Train a model on complex reasoning tasks using pure . No human-labeled reasoning examples. The only reward signal: whether the final answer was correct.

What emerged was striking. The model began generating longer responses that incorporated:

Self-verification - checking its own intermediate results
Self-reflection - recognizing when an approach isn't working
Dynamic strategy adaptation - switching methods mid-solution

The researchers tracked the word "wait" (appearing as a self-correction signal) across training steps:

Training Step	Frequency of 'wait'	AIME 2024 Score	Behavior
Step 0	Absent	15.6%	Direct answers, no reflection
Step 4,000	Sporadic	~40%	Occasional self-correction
Step 8,000+	Frequent	77.9%	Systematic verification and reflection

The model discovered, without being told, that pausing to reconsider was useful. The 77.9% score on AIME 2024 exceeds the average performance of human competitors.

What This Means

The behaviors researchers assumed required human demonstration emerged from a reward signal alone. The model was not told to reflect. It was told to get the right answer. Reflection was what it discovered.

Key Takeaway

The output text functions as working memory. Chain-of-thought forces the model to generate intermediate tokens that encode reasoning steps. DeepSeek-R1 showed that sophisticated reasoning behaviors - self-verification, reflection, strategy switching - can emerge from pure reinforcement learning without any human-labeled reasoning examples. The model learns to think because thinking helps it get the right answer.

Interpretability

Mechanistic Interpretability

The field of research that attempts to understand what happens inside neural networks by identifying specific circuits, features, and computations - moving from treating models as black boxes to understanding their internal mechanisms.

Emergent reasoning is striking. But what is actually happening inside the model while it produces these outputs?

The Polysemanticity Problem

Polysemanticity

The phenomenon where individual neurons in a neural network activate in response to many different, unrelated concepts simultaneously - because the model needs to represent more concepts than it has neurons.

Individual neurons in a language model are : they activate for many unrelated concepts at once. A single neuron might fire for both "legal documents" and "cooking recipes" and "the color blue." This makes direct interpretation of individual neurons nearly impossible.

Python

Interactive code editor — requires JavaScript

What Sparse Autoencoders Found

Sparse Autoencoder

A technique that decomposes polysemantic neurons into individual interpretable features by learning a sparse representation where each feature corresponds to a single coherent concept.

Anthropic's work applying to Claude 3 Sonnet found features corresponding to surprisingly abstract, high-level concepts:

Feature Category	Examples Found	Significance
Behavioral features	Sycophantic praise, deceptive responses	Model has internal representations of its own behavioral tendencies
Meta-cognitive features	Confidence assessment, knowledge uncertainty	Model tracks its own knowledge state
Task recognition features	Summarization requests, joke detection	Dedicated circuits for recognizing task types
Semantic features	Countries, professions, emotions	Clean categorical representations
Safety features	Harmful content detection, refusal triggers	Learned safety-relevant representations

These are not low-level pattern detectors. They are abstract representations that would not be out of place in a theory of cognition. Features for "deception" and "confidence in own knowledge" suggest internal structure far richer than naive pattern matching.

The Honest Position

The distinction between reasoning and very sophisticated pattern matching may be less clean than it sounds. The question of where pattern-matching ends and genuine reasoning begins has not been settled for biological systems either.

What is practically useful:

Reasoning quality is sensitive to context. The you provide serve as working memory. Structure what the model sees and you structure what it produces.
Internal representations are richer than expected. The model has abstract features, not just surface patterns.
The debate may be the wrong question. What matters for practitioners: does the model produce correct, useful output for your task? Understanding the mechanisms helps you engineer better inputs and evaluate outputs more accurately.

Key Takeaway

A language model is a function that takes tokens and returns a probability distribution over what comes next. The computation between input and output is a stack of attention operations, layered 80-120 times, each enriching token representations with contextual information. That this produces behavior we recognize as intelligent is not magic. It is not fake either. It is what happens when you scale a simple objective with enough data and enough compute.

Exercise•Intermediate

Practice Exercises

Test your understanding with 7 interactive challenges.

Tutorial•Intermediate

Build a Transformer

Build a transformer from scratch in pure Python.

Concept•Beginner

Context Engineering

Now that you understand the model, learn to engineer what it sees.

Build a Transformer from Scratch

From raw text to generated output: build a working transformer language model step by step in pure Python.

Tutorial progress tracker — requires JavaScript to display

Step 1: Setup & Data

Python

Interactive code editor — requires JavaScript

Python

Interactive code editor — requires JavaScript

Step 2: Tokenizer

Build a simple word-level tokenizer with encode/decode methods.

Python

Interactive code editor — requires JavaScript

Step 3: Embeddings

Create an embedding layer that converts token IDs to dense vectors.

Python

Interactive code editor — requires JavaScript

Step 4: Single-Head Attention

Implement the core QKV attention mechanism.

Python

Interactive code editor — requires JavaScript

Step 5: Multi-Head Attention

Run multiple attention heads in parallel and concatenate.

Python

Interactive code editor — requires JavaScript

Step 6: Transformer Block

Combine attention, feedforward, residual connections, and layer normalization.

Python

Interactive code editor — requires JavaScript

Step 7: Text Generation

Build autoregressive generation - predict one token at a time.

Python

Interactive code editor — requires JavaScript

Step 8: Temperature & Top-p Sampling

Add temperature and nucleus sampling for controlled generation.

Python

Interactive code editor — requires JavaScript

What You Built:

A tokenizer that converts text to integers and back
An embedding layer that gives tokens vector meaning
Single-head and multi-head attention mechanisms
A full transformer block with residual connections and normalization
Autoregressive text generation
Temperature and top-p sampling for controlled output

In production, these same components are scaled to billions of parameters and trained on trillions of tokens.

Concept•Beginner

How LLMs Think

Review the theory behind every component you just built.

Exercise•Intermediate

Practice Exercises

Test your understanding with 7 interactive challenges.

Concept•Beginner

Context Engineering

Now that you understand the model, learn to engineer what it sees.

Concept Bank

OCR

ETL

Token

Tokenization

LLM

VLM

Multimodal

Latent Representation

Vision Encoder

Mixture of Experts (MoE)

Inference

Hallucination

Context Window

Throughput

Compression

Downsampling

VRAM

Deterministic

Structured Data

Vector Store

RAG

ColPali

SAM

CLIP

Confidence Score

API

Context Engineering

System Prompt

Prompt Engineering

Transformer

Attention Mechanism

Vector Database

Sub-Agent

Scratchpad

Tool Use / Function Calling

Tokenizer

Prompt Caching

Tesseract

DeepSeek-OCR Base

DeepSeek-OCR Gundam

LLM Direct OCR

Sliding Window

Byte Pair Encoding (BPE)

Embedding

Vector

Dimension

Self-Attention

Attention

Query, Key, Value (QKV)

Attention Head

Recurrent Neural Network (RNN)

Residual Connection

Feedforward Network (FFN)

Mechanistic Interpretability

Pre-training

Chinchilla Scaling Laws

RLHF

Autoregressive Generation

Temperature

Logits

Softmax

Top-p (Nucleus) Sampling

Reinforcement Learning

Polysemantic Neuron

Sparse Autoencoder

How LLMs Think

Tokens

How Byte Pair Encoding Works

The Practical Token Math

Tokenization Pitfalls

The Number Problem

Code Is Expensive

The Multilingual Gap

Embeddings

Meaning as Geometry

Static vs. Contextual Embeddings

Self-Attention

The Coreference Problem

Query, Key, Value

Why Attention Replaced Everything Before It

Multi-Head Attention

How It Works

Heads Specialize for Specific Tasks

The Concatenation Step

Transformer Architecture

The Four Components

Why Residual Connections Matter

Why the FFN Is Bigger Than You'd Expect

Depth & the Residual Stream

What Each Layer Knows

Visualizing the Stream

Practical Implications

Training & Scaling Laws

The Training Loop

Chinchilla Scaling Laws

Pre-training vs. Post-training

Generation & Sampling

Temperature Controls Randomness

Top-p (Nucleus) Sampling

Emergent Reasoning

Chain-of-Thought Prompting

DeepSeek-R1: Reasoning from Reinforcement Learning

What This Means

Interpretability

The Polysemanticity Problem

What Sparse Autoencoders Found

The Honest Position

Practice Exercises

Build a Transformer

Context Engineering

How LLMs Think - Exercises

Exercise 1: Token Counting (Warm-up)

Exercise 2: BPE Merge Steps

Exercise 3: Embedding Analogies

Exercise 4: Attention Score Calculator

Exercise 5: Transformer Block Assembly

Exercise 6: Temperature Exploration

Exercise 7: Chain-of-Thought Detection (Stretch)

Back to Concepts

Build a Transformer

Build a Transformer from Scratch

Step 1: Setup & Data

Step 2: Tokenizer

Step 3: Embeddings

Step 4: Single-Head Attention

Step 5: Multi-Head Attention

Step 6: Transformer Block

Step 7: Text Generation

Step 8: Temperature & Top-p Sampling

How LLMs Think

Practice Exercises

Context Engineering