Fully local retrieval-augmented generation pipeline
Replaces DGX Spark: RAG in AI Workbench
raglangchain
Basic idea
Retrieval-Augmented Generation (RAG) solves the problem of LLMs not knowing your private data. Instead of fine-tuning โ which is expensive, slow, and bakes knowledge into weights that go stale โ RAG retrieves relevant document passages at query time and injects them into the LLM's context window. The pipeline: embed your documents as dense vectors โ store them in a vector database โ at query time, embed the query โ find the k nearest document vectors โ send those chunks as context to the LLM โ get a grounded answer with citations. Updating your knowledge base is as simple as re-embedding new documents; no retraining needed. The entire pipeline runs locally on your Mac using Ollama for both embeddings and the LLM.
What you'll accomplish
A fully local RAG pipeline: documents loaded from text or PDF files, split into overlapping chunks, embedded with nomic-embed-text via Ollama, stored in ChromaDB on disk, and queried with semantic search. Answers come from qwen2.5:7b with source document citations. The vector store persists between sessions โ you embed once and query repeatedly.
What to know before starting
Text embeddings: โ Dense floating-point vectors (768 dimensions for nomic-embed-text) where semantically similar texts are geometrically close. "The cat sat on the mat" and "A feline rested on a rug" produce vectors with high cosine similarity (~0.9), even though they share no words. This is the magic that makes semantic search work.
Vector database: โ ChromaDB stores your document vectors and supports approximate nearest-neighbor (ANN) search. Given a query vector, it finds the k document vectors with highest cosine similarity in milliseconds, even across millions of documents.
Chunking and why it matters: โ LLMs have context windows (e.g., 8k tokens for qwen2.5:7b). You can't fit a 100-page PDF. Chunks of 1000 tokens balance two competing concerns: large chunks give the LLM more context per retrieval, but small chunks give more precise retrieval (less noise). Overlapping chunks (200-token overlap) prevent losing context at chunk boundaries.
k-nearest-neighbor retrieval: โ At query time, the top-k most similar chunks (typically k=3 to 5) are retrieved and concatenated into the LLM prompt. If k is too small, you miss relevant context; if too large, you dilute the context window with noise.
Why a dedicated embedding model: โ `nomic-embed-text` is a 137M-parameter model optimized specifically for embedding quality. A general-purpose 7B model can also embed, but nomic-embed-text is 50ร smaller (faster) and produces embeddings that rank comparably on retrieval benchmarks.
Prerequisites
โข Ollama installed and running (`ollama serve`)
โข `nomic-embed-text` and `qwen2.5:7b` pulled
โข Python 3.10+
Time & risk
Duration:: 30 minutes setup; embedding time depends on document volume (~1 min per 100 pages)
Risk level:: Low โ reads local files, writes to a local `chroma_db/` directory
Rollback:: Delete `chroma_db/` directory and `pip uninstall chromadb langchain`