How LLMs Work: Tokens, Transformers & Context Explained

📖 14 min read · AI & Machine Learning · Try Token Counter →

What is a Large Language Model?

A Large Language Model (LLM) is a type of artificial intelligence trained on massive amounts of text data to understand and generate human language. Models like GPT-4, Claude 3, and Gemini Ultra are LLMs — they can write code, answer questions, summarize documents, translate languages, and much more.

But how do they actually work? Under the hood, LLMs are statistical machines that predict the most likely next token given everything that came before it. That simple idea, scaled to hundreds of billions of parameters and trained on trillions of words, produces surprisingly intelligent behavior.

Step 1 — Tokenization

Before an LLM can process text, it must convert words into tokens. A token is not exactly a word — it's a chunk of text that the model's vocabulary recognizes. Common words are single tokens; rare words get split into multiple tokens.

Example tokenization (GPT-4):

"Hello world" → ["Hello", " world"] → [9906, 1917]

"tokenization" → ["token", "ization"] → [3239, 2065]

"DevBench" → ["Dev", "Bench"] → [7469, 33, 8097]

This is why token counting matters for API costs — you pay per token, not per word. On average, 1 token ≈ 4 characters or ¾ of a word in English. Non-English languages often use more tokens per word.

Step 2 — Embeddings

Each token ID gets converted into a high-dimensional vector called an embedding. These vectors capture semantic meaning — similar words end up close together in vector space.

King − Man + Woman ≈ Queen

Classic example showing embeddings capture relationships

Paris − France + Italy ≈ Rome

Geographic relationships are encoded in the vectors

GPT-4 uses 12,288-dimensional embeddings

Each token becomes a vector with 12,288 numbers

Step 3 — The Transformer Architecture

The core of every modern LLM is the Transformer, introduced in the 2017 paper "Attention Is All You Need." It processes all tokens in parallel (unlike older RNNs that processed sequentially) using a mechanism called self-attention.

→

Self-Attention: For each token, the model calculates how much "attention" to pay to every other token in the context. This lets it understand that "it" in "The cat sat on the mat because it was tired" refers to "cat" not "mat".

→

Multi-Head Attention: Multiple attention heads run in parallel, each learning different types of relationships — one might focus on syntax, another on semantics, another on coreference.

→

Feed-Forward Layers: After attention, each token passes through a feed-forward neural network that transforms the representation. This is where most of the model's "knowledge" is stored.

→

Layer Stacking: GPT-4 has 96 transformer layers. Each layer refines the representation, building from low-level syntax in early layers to high-level semantics in later layers.

Context Windows Explained

The context window is the maximum number of tokens an LLM can process at once — both your input and its output combined. Everything outside the context window is invisible to the model.

Model	Context Window	Approx. Pages
GPT-3.5 Turbo	16K tokens	~12 pages
GPT-4o	128K tokens	~96 pages
Claude 3.5 Sonnet	200K tokens	~150 pages
Gemini 1.5 Pro	1M tokens	~750 pages
Gemini 1.5 Ultra	2M tokens	~1,500 pages

Training: Pre-training vs Fine-tuning

Pre-training

The model is trained on a massive corpus of internet text (books, websites, code, Wikipedia) to predict the next token. This teaches it language, facts, reasoning patterns, and world knowledge. GPT-4 was trained on roughly 13 trillion tokens.

Supervised Fine-tuning (SFT)

Human trainers write example conversations showing ideal assistant behavior. The model is fine-tuned on these examples to learn to be helpful, follow instructions, and format responses well.

RLHF (Reinforcement Learning from Human Feedback)

Human raters compare pairs of model responses and rank which is better. A reward model is trained on these preferences, then used to further fine-tune the LLM via reinforcement learning. This is what makes ChatGPT feel "aligned."

Why Do LLMs Hallucinate?

Hallucination — when an LLM confidently states something false — happens because LLMs are trained to produce plausible text, not true text. Key reasons:

•The model has no internal "fact checker" — it generates tokens based on statistical patterns, not verified knowledge

•Training data has a cutoff date — the model doesn't know about events after its training

•Rare or niche topics have less training data, so the model fills gaps with plausible-sounding guesses

•The model can't say "I don't know" naturally — it's trained to always produce a response

•Long contexts can cause the model to "forget" earlier information (lost in the middle problem)

💡 Mitigation: Use RAG (Retrieval-Augmented Generation) to ground the model in real documents, or use models with web search access like GPT-4o with browsing.

Temperature & Sampling

When generating the next token, the model produces a probability distribution over its entire vocabulary (~50,000 tokens). Temperature controls how "random" the selection is:

Temperature 0.0Always picks the highest probability token. Deterministic, repetitive, best for factual tasks.

Temperature 0.3–0.7Balanced. Good for most tasks — some creativity but still coherent.

Temperature 1.0+High randomness. Creative but can become incoherent. Good for brainstorming.

Explore AI Tools on DevBench

Count tokens before sending to the API, compare model capabilities, and build structured prompts.

Token Counter →Compare AI Models →AI Prompt Builder →

DevBench Editorial Team

Software Developers & Technical Writers

The DevBench team builds and maintains 90+ free developer tools used by thousands of developers daily. We write practical, no-fluff guides covering web development, APIs, security, data formats, and AI tools.

About DevBench →More Articles →