Ollama

Install and run LLMs locally with a single command

Replaces DGX Spark: Ollama

inference

Basic idea

Ollama packages llama.cpp with an automatic model manager, a REST API server, and a CLI into a single installable binary. You do not manage model files, choose quantization formats, or configure GPU settings manually — Ollama handles all of that.

Under the hood, Ollama uses llama.cpp's Metal backend to dispatch matrix multiplications to your Mac's GPU. On NVIDIA hardware (Linux/Windows), the same workload runs through CUDA. On Apple Silicon, it goes through Metal — Apple's GPU compute framework — because Apple Silicon has no CUDA support. Ollama abstracts this entirely so you interact with the same CLI and API regardless of hardware.

The practical benefit: you run ollama run qwen2.5:7b and get a working chatbot. You don't need to know what a GGUF file is, how many GPU layers to offload, or what quantization format to choose.

What you'll accomplish

After following this playbook you will have:

• Ollama running as a background service on port 11434

• `qwen2.5:7b` downloaded and cached locally (~4.7 GB on disk)

• A working interactive CLI chat session with the model

• A working `curl` call to the REST API confirming the model responds

• A rough tokens/sec baseline from the built-in benchmark

On an M2 Pro with 16 GB RAM, qwen2.5:7b at Q4_K_M quantization runs at roughly 40–55 tokens/sec.

What to know before starting

What LLMs are:: Large language models are next-token predictors. Given a sequence of text tokens, they predict the probability distribution of the next token and sample from it. "Generating text" is just doing this thousands of times in sequence. They are not databases and do not look things up — they predict plausible continuations based on patterns learned during training.

What quantization means:: A 7B-parameter model in full 32-bit float precision requires ~28 GB of RAM. Quantization reduces each weight from 32-bit to fewer bits (4-bit in Q4_K_M). The 7B model then fits in ~4.7 GB. Quality degrades slightly but is usually imperceptible for chat tasks. Ollama downloads Q4_K_M by default for most models.

What an API server means:: `ollama serve` starts an HTTP server. Clients send JSON requests describing the model and messages; the server runs inference and returns JSON responses. This lets any app — Python scripts, web UIs, IDEs — use local models without embedding the inference engine themselves.

What Metal is:: Metal is Apple's GPU compute and graphics framework. When llama.cpp (inside Ollama) runs a matrix multiplication, it dispatches it as a Metal shader to the GPU cores in your M-series chip. This is what makes inference fast — without Metal, every matrix operation would run on the CPU only.

Why unified memory matters:: On Apple Silicon, CPU and GPU share the same physical RAM pool. There is no separate "VRAM." A 16 GB M2 has 16 GB accessible to both CPU and GPU simultaneously. This means a 7B Q4 model loaded into RAM is already in GPU-accessible memory — no copying required. On discrete GPUs (NVIDIA), models must be copied from system RAM to VRAM before inference can begin.

Prerequisites

• macOS 13.0+ Ventura (macOS 14 Sonoma recommended)

• Apple Silicon Mac (M1 or later) — Intel Macs are supported but Metal GPU acceleration is Apple Silicon only

• Homebrew installed (`/opt/homebrew/bin/brew` present) OR ability to download a .pkg installer

• 8 GB+ unified memory (16 GB+ recommended for 7B models to leave headroom for other apps)

Time & risk

Duration:: 5 minutes

Risk level:: None — standard Homebrew formula, no system modifications

Rollback:: `brew uninstall ollama && rm -rf ~/.ollama`