RAG with LangChain + Ollama

Fully local retrieval-augmented generation pipeline

Replaces DGX Spark: RAG in AI Workbench

raglangchain

Basic idea

Retrieval-Augmented Generation (RAG) solves the problem of LLMs not knowing your private data. Instead of fine-tuning — which is expensive, slow, and bakes knowledge into weights that go stale — RAG retrieves relevant document passages at query time and injects them into the LLM's context window. The pipeline: embed your documents as dense vectors → store them in a vector database → at query time, embed the query → find the k nearest document vectors → send those chunks as context to the LLM → get a grounded answer with citations. Updating your knowledge base is as simple as re-embedding new documents; no retraining needed. The entire pipeline runs locally on your Mac using Ollama for both embeddings and the LLM.

What you'll accomplish

A fully local RAG pipeline: documents loaded from text or PDF files, split into overlapping chunks, embedded with nomic-embed-text via Ollama, stored in ChromaDB on disk, and queried with semantic search. Answers come from qwen2.5:7b with source document citations. The vector store persists between sessions — you embed once and query repeatedly.

What to know before starting

Text embeddings: — Dense floating-point vectors (768 dimensions for nomic-embed-text) where semantically similar texts are geometrically close. "The cat sat on the mat" and "A feline rested on a rug" produce vectors with high cosine similarity (~0.9), even though they share no words. This is the magic that makes semantic search work.

Vector database: — ChromaDB stores your document vectors and supports approximate nearest-neighbor (ANN) search. Given a query vector, it finds the k document vectors with highest cosine similarity in milliseconds, even across millions of documents.

Chunking and why it matters: — LLMs have context windows (e.g., 8k tokens for qwen2.5:7b). You can't fit a 100-page PDF. Chunks of 1000 tokens balance two competing concerns: large chunks give the LLM more context per retrieval, but small chunks give more precise retrieval (less noise). Overlapping chunks (200-token overlap) prevent losing context at chunk boundaries.

k-nearest-neighbor retrieval: — At query time, the top-k most similar chunks (typically k=3 to 5) are retrieved and concatenated into the LLM prompt. If k is too small, you miss relevant context; if too large, you dilute the context window with noise.

Why a dedicated embedding model: — `nomic-embed-text` is a 137M-parameter model optimized specifically for embedding quality. A general-purpose 7B model can also embed, but nomic-embed-text is 50× smaller (faster) and produces embeddings that rank comparably on retrieval benchmarks.

Prerequisites

• Ollama installed and running (`ollama serve`)

• `nomic-embed-text` and `qwen2.5:7b` pulled

• Python 3.10+

Time & risk

Duration:: 30 minutes setup; embedding time depends on document volume (~1 min per 100 pages)

Risk level:: Low — reads local files, writes to a local `chroma_db/` directory

Rollback:: Delete `chroma_db/` directory and `pip uninstall chromadb langchain`