What Is RAG?

Retrieval-Augmented Generation (RAG) is a technique that makes LLMs smarter by giving them access to external knowledge at query time. Instead of relying solely on what the model learned during training, RAG first retrieves relevant documents, then generates a response grounded in that context.

The Problem RAG Solves

LLMs have two fundamental limitations:

Knowledge cutoff — they only know what was in their training data
Hallucination — they sometimes generate confident but incorrect information

RAG addresses both by providing the model with fresh, verified information at query time.

How RAG Works

The RAG pipeline has three stages:

1. Indexing (Offline)

Your knowledge base (documents, FAQs, code, etc.) is processed:

from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
 
# Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)
 
# Create embeddings and store in vector database
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

Each chunk is converted into an embedding — a dense vector that captures its semantic meaning — and stored in a vector database.

2. Retrieval (At Query Time)

When a user asks a question, the system:

Embeds the question into the same vector space
Finds the most similar document chunks via similarity search
Returns the top-k most relevant chunks

\text{similarity}(q, d) = \frac{q \cdot d}{\|q\| \cdot \|d\|}

3. Generation

The retrieved chunks are injected into the LLM's prompt as context:

Given the following context, answer the question.

Context:
{retrieved_chunks}

Question: {user_question}

The model generates its response grounded in the retrieved information.

When to Use RAG

RAG is ideal when your application needs to answer questions about specific, frequently updated, or proprietary data that wasn't in the LLM's training set.

Common use cases:

Customer support — answering questions from product documentation
Internal knowledge bases — making company wikis searchable via natural language
Research assistants — querying academic papers or reports
Code documentation — answering questions about a specific codebase

Key Takeaways

RAG = Retrieve relevant context + Generate a grounded response
It reduces hallucinations by anchoring responses in real documents
The pipeline: chunk → embed → store → retrieve → generate
Vector databases enable fast semantic similarity search

References

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)
LangChain documentation — building RAG pipelines