Extract knowledge triples with local LLMs and graph DBs
Replaces DGX Spark: Text to Knowledge Graph
knowledge graphnlp
Basic idea
A knowledge graph represents information as (subject, predicate, object) triples โ for example, (Apple, founded_by, Steve Jobs) or (iPhone, released_in, 2007). Extracting these triples from unstructured text with a local LLM converts free-form documents into a structured, queryable graph where you can ask "what are all the things connected to Apple, two hops away?"
On your Mac, Ollama provides the LLM for extraction. NetworkX stores the graph in memory for immediate visualization, and Neo4j (optional, via Docker) provides persistent storage with a query language built for graph traversal.
What you'll accomplish
A Python pipeline that takes any text document, extracts knowledge triples using a local LLM, and stores them as a directed graph. You'll produce an interactive HTML visualization (drag nodes, zoom, inspect edges) and optionally export the graph to Neo4j for persistent, Cypher-queryable storage.
What to know before starting
Knowledge graphs: Nodes are entities (Apple, Steve Jobs, 1976). Edges are typed, directional relationships (founded_by, acquired, competes_with). Direction matters: Apple --[founded_by]--> Steve Jobs is not the same as Steve Jobs --[founded_by]--> Apple.
Triple extraction limitations: LLMs hallucinate relationships. Always spot-check extracted triples against the source text before treating them as facts. Critical applications need human review.
Graph databases vs relational: Relational databases optimize for row-based queries. Graph databases optimize for traversal โ "find everything connected to node X within N hops" is a single query in Cypher but requires complex JOINs in SQL.
NetworkX: Python's in-memory graph library. Fast for analysis and visualization, but data is lost when the process ends. Use Neo4j if you need persistence.
Cypher: Neo4j's query language. `MATCH (a)-[:founded_by]->(b) RETURN a, b` is the graph equivalent of a SQL SELECT with JOIN.
Few-shot prompting: Including 2-3 examples in the prompt dramatically improves JSON formatting consistency from the LLM.
Prerequisites
โข Ollama running with `llama3.1:8b` pulled (or `qwen2.5:7b` as fallback)
โข Python 3.10+
โข Docker Desktop (optional, for Neo4j persistent storage)
Time & risk
Duration:: 30 minutes
Risk level:: Low โ pure Python, no system changes, no data leaves your machine