What is RAG? Retrieval-Augmented Generation Explained

๐Ÿ“– 12 min read ยท AI & Machine Learning ยท Try AI Prompt Builder โ†’

The Problem RAG Solves

LLMs like GPT-4 and Claude are trained on data up to a certain cutoff date. They don't know about your company's internal documents, your product's latest features, or anything that happened after their training ended. They also hallucinate โ€” confidently stating false information when they don't know the answer.

Retrieval-Augmented Generation (RAG) solves this by giving the LLM access to a knowledge base at query time. Instead of relying purely on memorized training data, the model retrieves relevant documents and uses them as context to generate accurate, grounded answers.

How RAG Works โ€” Step by Step

1. Document Ingestion

Your documents (PDFs, web pages, database records, etc.) are loaded and split into smaller chunks โ€” typically 200โ€“500 tokens each. Smaller chunks improve retrieval precision.

2. Embedding Generation

Each chunk is converted into a vector embedding using an embedding model (e.g., OpenAI text-embedding-3-small, or open-source models like BGE or E5). These vectors capture semantic meaning.

3. Vector Storage

The embeddings are stored in a vector database (Pinecone, Weaviate, Chroma, pgvector). The database is optimized for fast similarity search across millions of vectors.

4. Query Processing

When a user asks a question, the query is also converted to an embedding using the same model. The vector database finds the most semantically similar chunks using cosine similarity or dot product.

5. Context Injection

The top-k retrieved chunks (typically 3โ€“10) are injected into the LLM prompt as context. The model is instructed to answer based on this context.

6. Generation

The LLM generates an answer grounded in the retrieved documents. It can cite sources, acknowledge uncertainty, and avoid hallucinating facts not in the context.

RAG Pipeline Code Example

// Simplified RAG pipeline with OpenAI + Chroma
import { OpenAI } from 'openai';
import { ChromaClient } from 'chromadb';

const openai = new OpenAI();
const chroma = new ChromaClient();
const collection = await chroma.getCollection({ name: 'docs' });

async function ragQuery(userQuestion: string) {
  // 1. Embed the user's question
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: userQuestion,
  });

  // 2. Retrieve top 5 relevant chunks
  const results = await collection.query({
    queryEmbeddings: [queryEmbedding.data[0].embedding],
    nResults: 5,
  });

  const context = results.documents[0].join('\n\n');

  // 3. Generate answer with context
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: 'Answer based only on the provided context. If the answer is not in the context, say so.',
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${userQuestion}`,
      },
    ],
  });

  return response.choices[0].message.content;
}

Chunking Strategies

How you split documents dramatically affects retrieval quality:

โ†’
Fixed-size chunking: Split every N tokens. Simple but can cut sentences mid-thought. Add overlap (e.g., 50 tokens) to avoid losing context at boundaries.
โ†’
Sentence-based chunking: Split on sentence boundaries. More natural but chunks vary in size. Good for prose documents.
โ†’
Recursive character splitting: Try to split on paragraphs โ†’ sentences โ†’ words in order. Used by LangChain's RecursiveCharacterTextSplitter. Best general-purpose approach.
โ†’
Semantic chunking: Use embeddings to detect topic shifts and split there. Highest quality but computationally expensive.
โ†’
Document-aware chunking: Respect document structure โ€” keep headings with their content, keep code blocks intact, keep table rows together.

Vector Databases Compared

DatabaseBest forHosting
PineconeProduction, managed, scalableCloud (managed)
ChromaLocal dev, prototypingSelf-hosted / local
WeaviateHybrid search (vector + keyword)Cloud or self-hosted
pgvectorAlready using PostgreSQLSelf-hosted
QdrantHigh performance, open sourceCloud or self-hosted
MilvusLarge scale (billions of vectors)Self-hosted

Advanced RAG Techniques

Hybrid Search

Combine vector similarity search with keyword (BM25) search. Catches exact matches that semantic search might miss.

Re-ranking

After retrieving top-20 chunks, use a cross-encoder model to re-rank them by relevance. Improves precision significantly.

HyDE (Hypothetical Document Embeddings)

Ask the LLM to generate a hypothetical answer, then embed that to search. Often retrieves better results than embedding the raw question.

Parent-Child Chunking

Store small chunks for retrieval but return their larger parent chunks as context. Better precision + more context.

Query Expansion

Generate multiple variations of the user's question and retrieve for all of them. Reduces sensitivity to exact wording.

Build Better AI Prompts

Use the DevBench AI Prompt Builder to structure your RAG system prompts with proper context injection templates.