How To Build a RAG Pipeline
A step-by-step guide to building a Retrieval-Augmented Generation pipeline from scratch using Python, FAISS, and Claude.
In this tutorial, we'll build a complete RAG pipeline that can answer questions about any collection of documents. We'll use FAISS for vector storage and Claude for generation.
What We're Building
By the end of this tutorial, you'll have a working system that:
- Loads and chunks documents
- Creates embeddings and stores them in FAISS
- Retrieves relevant chunks for a query
- Generates answers using Claude with the retrieved context
Setup
First, install the required packages:
pip install anthropic faiss-cpu sentence-transformersStep 1: Document Loading and Chunking
We need to split our documents into manageable chunks:
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
# Load your documents
with open("knowledge_base.txt", "r") as f:
text = f.read()
chunks = chunk_text(text)
print(f"Created {len(chunks)} chunks")Step 2: Creating Embeddings
We'll use sentence-transformers for free, local embeddings:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Embed all chunks
embeddings = model.encode(chunks, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")The all-MiniLM-L6-v2 model produces 384-dimensional embeddings and runs fast on CPU. For production, consider larger models like all-mpnet-base-v2 for better quality.
Step 3: Building the FAISS Index
import faiss
# Create a FAISS index
dimension = embeddings.shape[1] # 384
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
print(f"Index contains {index.ntotal} vectors")Step 4: Retrieval Function
def retrieve(query: str, k: int = 3) -> list[str]:
"""Retrieve the top-k most relevant chunks for a query."""
query_embedding = model.encode([query]).astype("float32")
distances, indices = index.search(query_embedding, k)
return [chunks[i] for i in indices[0]]Step 5: Generation with Claude
import anthropic
client = anthropic.Anthropic()
def ask(question: str) -> str:
"""Answer a question using RAG."""
# Retrieve relevant context
context_chunks = retrieve(question, k=3)
context = "\n\n---\n\n".join(context_chunks)
# Generate answer
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Based on the following context, answer the question.
If the context doesn't contain enough information, say so.
Context:
{context}
Question: {question}"""
}]
)
return message.content[0].textPutting It All Together
# Ask questions!
answer = ask("What is the main advantage of transformers over RNNs?")
print(answer)To improve retrieval quality, experiment with: different chunk sizes, embedding models, reranking retrieved results, and hybrid search (combining semantic + keyword search).
Next Steps
- Add persistent storage with
faiss.write_index()/faiss.read_index() - Implement metadata filtering (filter by date, source, etc.)
- Add a web interface with Gradio or Streamlit
- Try hybrid search combining FAISS with BM25 keyword search
Key Takeaways
- RAG pipelines are straightforward to build with modern tools
- The quality of your chunking strategy directly impacts retrieval quality
- Local embedding models like
sentence-transformerswork well and are free - Claude excels at synthesizing retrieved context into coherent answers