How LLMs Work: Tokens, Transformers & Context Explained

๐Ÿ“– 14 min read ยท AI & Machine Learning ยท Try Token Counter โ†’

What is a Large Language Model?

A Large Language Model (LLM) is a type of artificial intelligence trained on massive amounts of text data to understand and generate human language. Models like GPT-4, Claude 3, and Gemini Ultra are LLMs โ€” they can write code, answer questions, summarize documents, translate languages, and much more.

But how do they actually work? Under the hood, LLMs are statistical machines that predict the most likely next token given everything that came before it. That simple idea, scaled to hundreds of billions of parameters and trained on trillions of words, produces surprisingly intelligent behavior.

Step 1 โ€” Tokenization

Before an LLM can process text, it must convert words into tokens. A token is not exactly a word โ€” it's a chunk of text that the model's vocabulary recognizes. Common words are single tokens; rare words get split into multiple tokens.

Example tokenization (GPT-4):
"Hello world" โ†’ ["Hello", " world"] โ†’ [9906, 1917]
"tokenization" โ†’ ["token", "ization"] โ†’ [3239, 2065]
"DevBench" โ†’ ["Dev", "Bench"] โ†’ [7469, 33, 8097]

This is why token counting matters for API costs โ€” you pay per token, not per word. On average, 1 token โ‰ˆ 4 characters or ยพ of a word in English. Non-English languages often use more tokens per word.

Step 2 โ€” Embeddings

Each token ID gets converted into a high-dimensional vector called an embedding. These vectors capture semantic meaning โ€” similar words end up close together in vector space.

King โˆ’ Man + Woman โ‰ˆ Queen

Classic example showing embeddings capture relationships

Paris โˆ’ France + Italy โ‰ˆ Rome

Geographic relationships are encoded in the vectors

GPT-4 uses 12,288-dimensional embeddings

Each token becomes a vector with 12,288 numbers

Step 3 โ€” The Transformer Architecture

The core of every modern LLM is the Transformer, introduced in the 2017 paper "Attention Is All You Need." It processes all tokens in parallel (unlike older RNNs that processed sequentially) using a mechanism called self-attention.

โ†’
Self-Attention: For each token, the model calculates how much "attention" to pay to every other token in the context. This lets it understand that "it" in "The cat sat on the mat because it was tired" refers to "cat" not "mat".
โ†’
Multi-Head Attention: Multiple attention heads run in parallel, each learning different types of relationships โ€” one might focus on syntax, another on semantics, another on coreference.
โ†’
Feed-Forward Layers: After attention, each token passes through a feed-forward neural network that transforms the representation. This is where most of the model's "knowledge" is stored.
โ†’
Layer Stacking: GPT-4 has 96 transformer layers. Each layer refines the representation, building from low-level syntax in early layers to high-level semantics in later layers.

Context Windows Explained

The context window is the maximum number of tokens an LLM can process at once โ€” both your input and its output combined. Everything outside the context window is invisible to the model.

ModelContext WindowApprox. Pages
GPT-3.5 Turbo16K tokens~12 pages
GPT-4o128K tokens~96 pages
Claude 3.5 Sonnet200K tokens~150 pages
Gemini 1.5 Pro1M tokens~750 pages
Gemini 1.5 Ultra2M tokens~1,500 pages

Training: Pre-training vs Fine-tuning

Pre-training

The model is trained on a massive corpus of internet text (books, websites, code, Wikipedia) to predict the next token. This teaches it language, facts, reasoning patterns, and world knowledge. GPT-4 was trained on roughly 13 trillion tokens.

Supervised Fine-tuning (SFT)

Human trainers write example conversations showing ideal assistant behavior. The model is fine-tuned on these examples to learn to be helpful, follow instructions, and format responses well.

RLHF (Reinforcement Learning from Human Feedback)

Human raters compare pairs of model responses and rank which is better. A reward model is trained on these preferences, then used to further fine-tune the LLM via reinforcement learning. This is what makes ChatGPT feel "aligned."

Why Do LLMs Hallucinate?

Hallucination โ€” when an LLM confidently states something false โ€” happens because LLMs are trained to produce plausible text, not true text. Key reasons:

โ€ขThe model has no internal "fact checker" โ€” it generates tokens based on statistical patterns, not verified knowledge
โ€ขTraining data has a cutoff date โ€” the model doesn't know about events after its training
โ€ขRare or niche topics have less training data, so the model fills gaps with plausible-sounding guesses
โ€ขThe model can't say "I don't know" naturally โ€” it's trained to always produce a response
โ€ขLong contexts can cause the model to "forget" earlier information (lost in the middle problem)
๐Ÿ’ก Mitigation: Use RAG (Retrieval-Augmented Generation) to ground the model in real documents, or use models with web search access like GPT-4o with browsing.

Temperature & Sampling

When generating the next token, the model produces a probability distribution over its entire vocabulary (~50,000 tokens). Temperature controls how "random" the selection is:

Temperature 0.0Always picks the highest probability token. Deterministic, repetitive, best for factual tasks.
Temperature 0.3โ€“0.7Balanced. Good for most tasks โ€” some creativity but still coherent.
Temperature 1.0+High randomness. Creative but can become incoherent. Good for brainstorming.

Explore AI Tools on DevBench

Count tokens before sending to the API, compare model capabilities, and build structured prompts.