How LLMs Work: Tokens, Transformers & Context Explained
๐ 14 min read ยท AI & Machine Learning ยท Try Token Counter โ
What is a Large Language Model?
A Large Language Model (LLM) is a type of artificial intelligence trained on massive amounts of text data to understand and generate human language. Models like GPT-4, Claude 3, and Gemini Ultra are LLMs โ they can write code, answer questions, summarize documents, translate languages, and much more.
But how do they actually work? Under the hood, LLMs are statistical machines that predict the most likely next token given everything that came before it. That simple idea, scaled to hundreds of billions of parameters and trained on trillions of words, produces surprisingly intelligent behavior.
Step 1 โ Tokenization
Before an LLM can process text, it must convert words into tokens. A token is not exactly a word โ it's a chunk of text that the model's vocabulary recognizes. Common words are single tokens; rare words get split into multiple tokens.
This is why token counting matters for API costs โ you pay per token, not per word. On average, 1 token โ 4 characters or ยพ of a word in English. Non-English languages often use more tokens per word.
Step 2 โ Embeddings
Each token ID gets converted into a high-dimensional vector called an embedding. These vectors capture semantic meaning โ similar words end up close together in vector space.
King โ Man + Woman โ QueenClassic example showing embeddings capture relationships
Paris โ France + Italy โ RomeGeographic relationships are encoded in the vectors
GPT-4 uses 12,288-dimensional embeddingsEach token becomes a vector with 12,288 numbers
Step 3 โ The Transformer Architecture
The core of every modern LLM is the Transformer, introduced in the 2017 paper "Attention Is All You Need." It processes all tokens in parallel (unlike older RNNs that processed sequentially) using a mechanism called self-attention.
Context Windows Explained
The context window is the maximum number of tokens an LLM can process at once โ both your input and its output combined. Everything outside the context window is invisible to the model.
| Model | Context Window | Approx. Pages |
|---|---|---|
| GPT-3.5 Turbo | 16K tokens | ~12 pages |
| GPT-4o | 128K tokens | ~96 pages |
| Claude 3.5 Sonnet | 200K tokens | ~150 pages |
| Gemini 1.5 Pro | 1M tokens | ~750 pages |
| Gemini 1.5 Ultra | 2M tokens | ~1,500 pages |
Training: Pre-training vs Fine-tuning
The model is trained on a massive corpus of internet text (books, websites, code, Wikipedia) to predict the next token. This teaches it language, facts, reasoning patterns, and world knowledge. GPT-4 was trained on roughly 13 trillion tokens.
Human trainers write example conversations showing ideal assistant behavior. The model is fine-tuned on these examples to learn to be helpful, follow instructions, and format responses well.
Human raters compare pairs of model responses and rank which is better. A reward model is trained on these preferences, then used to further fine-tune the LLM via reinforcement learning. This is what makes ChatGPT feel "aligned."
Why Do LLMs Hallucinate?
Hallucination โ when an LLM confidently states something false โ happens because LLMs are trained to produce plausible text, not true text. Key reasons:
Temperature & Sampling
When generating the next token, the model produces a probability distribution over its entire vocabulary (~50,000 tokens). Temperature controls how "random" the selection is:
Temperature 0.0Always picks the highest probability token. Deterministic, repetitive, best for factual tasks.Temperature 0.3โ0.7Balanced. Good for most tasks โ some creativity but still coherent.Temperature 1.0+High randomness. Creative but can become incoherent. Good for brainstorming.Explore AI Tools on DevBench
Count tokens before sending to the API, compare model capabilities, and build structured prompts.