What is RAG? Retrieval-Augmented Generation Explained
๐ 12 min read ยท AI & Machine Learning ยท Try AI Prompt Builder โ
The Problem RAG Solves
LLMs like GPT-4 and Claude are trained on data up to a certain cutoff date. They don't know about your company's internal documents, your product's latest features, or anything that happened after their training ended. They also hallucinate โ confidently stating false information when they don't know the answer.
Retrieval-Augmented Generation (RAG) solves this by giving the LLM access to a knowledge base at query time. Instead of relying purely on memorized training data, the model retrieves relevant documents and uses them as context to generate accurate, grounded answers.
How RAG Works โ Step by Step
Your documents (PDFs, web pages, database records, etc.) are loaded and split into smaller chunks โ typically 200โ500 tokens each. Smaller chunks improve retrieval precision.
Each chunk is converted into a vector embedding using an embedding model (e.g., OpenAI text-embedding-3-small, or open-source models like BGE or E5). These vectors capture semantic meaning.
The embeddings are stored in a vector database (Pinecone, Weaviate, Chroma, pgvector). The database is optimized for fast similarity search across millions of vectors.
When a user asks a question, the query is also converted to an embedding using the same model. The vector database finds the most semantically similar chunks using cosine similarity or dot product.
The top-k retrieved chunks (typically 3โ10) are injected into the LLM prompt as context. The model is instructed to answer based on this context.
The LLM generates an answer grounded in the retrieved documents. It can cite sources, acknowledge uncertainty, and avoid hallucinating facts not in the context.
RAG Pipeline Code Example
// Simplified RAG pipeline with OpenAI + Chroma
import { OpenAI } from 'openai';
import { ChromaClient } from 'chromadb';
const openai = new OpenAI();
const chroma = new ChromaClient();
const collection = await chroma.getCollection({ name: 'docs' });
async function ragQuery(userQuestion: string) {
// 1. Embed the user's question
const queryEmbedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: userQuestion,
});
// 2. Retrieve top 5 relevant chunks
const results = await collection.query({
queryEmbeddings: [queryEmbedding.data[0].embedding],
nResults: 5,
});
const context = results.documents[0].join('\n\n');
// 3. Generate answer with context
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: 'Answer based only on the provided context. If the answer is not in the context, say so.',
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${userQuestion}`,
},
],
});
return response.choices[0].message.content;
}Chunking Strategies
How you split documents dramatically affects retrieval quality:
Vector Databases Compared
| Database | Best for | Hosting |
|---|---|---|
| Pinecone | Production, managed, scalable | Cloud (managed) |
| Chroma | Local dev, prototyping | Self-hosted / local |
| Weaviate | Hybrid search (vector + keyword) | Cloud or self-hosted |
| pgvector | Already using PostgreSQL | Self-hosted |
| Qdrant | High performance, open source | Cloud or self-hosted |
| Milvus | Large scale (billions of vectors) | Self-hosted |
Advanced RAG Techniques
Combine vector similarity search with keyword (BM25) search. Catches exact matches that semantic search might miss.
After retrieving top-20 chunks, use a cross-encoder model to re-rank them by relevance. Improves precision significantly.
Ask the LLM to generate a hypothetical answer, then embed that to search. Often retrieves better results than embedding the raw question.
Store small chunks for retrieval but return their larger parent chunks as context. Better precision + more context.
Generate multiple variations of the user's question and retrieve for all of them. Reduces sensitivity to exact wording.
Build Better AI Prompts
Use the DevBench AI Prompt Builder to structure your RAG system prompts with proper context injection templates.