How To Build a RAG Pipeline

In this tutorial, we'll build a complete RAG pipeline that can answer questions about any collection of documents. We'll use FAISS for vector storage and Claude for generation.

What We're Building

By the end of this tutorial, you'll have a working system that:

Loads and chunks documents
Creates embeddings and stores them in FAISS
Retrieves relevant chunks for a query
Generates answers using Claude with the retrieved context

Setup

First, install the required packages:

pip install anthropic faiss-cpu sentence-transformers

Step 1: Document Loading and Chunking

We need to split our documents into manageable chunks:

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks
 
# Load your documents
with open("knowledge_base.txt", "r") as f:
    text = f.read()
 
chunks = chunk_text(text)
print(f"Created {len(chunks)} chunks")

Step 2: Creating Embeddings

We'll use sentence-transformers for free, local embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np
 
model = SentenceTransformer("all-MiniLM-L6-v2")
 
# Embed all chunks
embeddings = model.encode(chunks, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")

The all-MiniLM-L6-v2 model produces 384-dimensional embeddings and runs fast on CPU. For production, consider larger models like all-mpnet-base-v2 for better quality.

Step 3: Building the FAISS Index

import faiss
 
# Create a FAISS index
dimension = embeddings.shape[1]  # 384
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
 
print(f"Index contains {index.ntotal} vectors")

Step 4: Retrieval Function

def retrieve(query: str, k: int = 3) -> list[str]:
    """Retrieve the top-k most relevant chunks for a query."""
    query_embedding = model.encode([query]).astype("float32")
    distances, indices = index.search(query_embedding, k)
    return [chunks[i] for i in indices[0]]

Step 5: Generation with Claude

import anthropic
 
client = anthropic.Anthropic()
 
def ask(question: str) -> str:
    """Answer a question using RAG."""
    # Retrieve relevant context
    context_chunks = retrieve(question, k=3)
    context = "\n\n---\n\n".join(context_chunks)
 
    # Generate answer
    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Based on the following context, answer the question.
If the context doesn't contain enough information, say so.
 
Context:
{context}
 
Question: {question}"""
        }]
    )
    return message.content[0].text

Putting It All Together

# Ask questions!
answer = ask("What is the main advantage of transformers over RNNs?")
print(answer)

To improve retrieval quality, experiment with: different chunk sizes, embedding models, reranking retrieved results, and hybrid search (combining semantic + keyword search).

Next Steps

Add persistent storage with faiss.write_index() / faiss.read_index()
Implement metadata filtering (filter by date, source, etc.)
Add a web interface with Gradio or Streamlit
Try hybrid search combining FAISS with BM25 keyword search

Key Takeaways

RAG pipelines are straightforward to build with modern tools
The quality of your chunking strategy directly impacts retrieval quality
Local embedding models like sentence-transformers work well and are free
Claude excels at synthesizing retrieved context into coherent answers