How ToIntermediate

How To Build a RAG Pipeline

A step-by-step guide to building a Retrieval-Augmented Generation pipeline from scratch using Python, FAISS, and Claude.

February 24, 20263 min read
Prerequisites: Basic Python, Understanding of RAG concepts

In this tutorial, we'll build a complete RAG pipeline that can answer questions about any collection of documents. We'll use FAISS for vector storage and Claude for generation.

What We're Building

By the end of this tutorial, you'll have a working system that:

  1. Loads and chunks documents
  2. Creates embeddings and stores them in FAISS
  3. Retrieves relevant chunks for a query
  4. Generates answers using Claude with the retrieved context

Setup

First, install the required packages:

pip install anthropic faiss-cpu sentence-transformers

Step 1: Document Loading and Chunking

We need to split our documents into manageable chunks:

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks
 
# Load your documents
with open("knowledge_base.txt", "r") as f:
    text = f.read()
 
chunks = chunk_text(text)
print(f"Created {len(chunks)} chunks")

Step 2: Creating Embeddings

We'll use sentence-transformers for free, local embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np
 
model = SentenceTransformer("all-MiniLM-L6-v2")
 
# Embed all chunks
embeddings = model.encode(chunks, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")

The all-MiniLM-L6-v2 model produces 384-dimensional embeddings and runs fast on CPU. For production, consider larger models like all-mpnet-base-v2 for better quality.

Step 3: Building the FAISS Index

import faiss
 
# Create a FAISS index
dimension = embeddings.shape[1]  # 384
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
 
print(f"Index contains {index.ntotal} vectors")

Step 4: Retrieval Function

def retrieve(query: str, k: int = 3) -> list[str]:
    """Retrieve the top-k most relevant chunks for a query."""
    query_embedding = model.encode([query]).astype("float32")
    distances, indices = index.search(query_embedding, k)
    return [chunks[i] for i in indices[0]]

Step 5: Generation with Claude

import anthropic
 
client = anthropic.Anthropic()
 
def ask(question: str) -> str:
    """Answer a question using RAG."""
    # Retrieve relevant context
    context_chunks = retrieve(question, k=3)
    context = "\n\n---\n\n".join(context_chunks)
 
    # Generate answer
    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Based on the following context, answer the question.
If the context doesn't contain enough information, say so.
 
Context:
{context}
 
Question: {question}"""
        }]
    )
    return message.content[0].text

Putting It All Together

# Ask questions!
answer = ask("What is the main advantage of transformers over RNNs?")
print(answer)

To improve retrieval quality, experiment with: different chunk sizes, embedding models, reranking retrieved results, and hybrid search (combining semantic + keyword search).

Next Steps

  • Add persistent storage with faiss.write_index() / faiss.read_index()
  • Implement metadata filtering (filter by date, source, etc.)
  • Add a web interface with Gradio or Streamlit
  • Try hybrid search combining FAISS with BM25 keyword search

Key Takeaways

  • RAG pipelines are straightforward to build with modern tools
  • The quality of your chunking strategy directly impacts retrieval quality
  • Local embedding models like sentence-transformers work well and are free
  • Claude excels at synthesizing retrieved context into coherent answers

## Related Posts