Install and run LLMs locally with a single command
Replaces DGX Spark: Ollama
inference
Basic idea
Ollama packages llama.cpp with an automatic model manager, a REST API server, and a CLI into a single installable binary. You do not manage model files, choose quantization formats, or configure GPU settings manually โ Ollama handles all of that.
Under the hood, Ollama uses llama.cpp's Metal backend to dispatch matrix multiplications to your Mac's GPU. On NVIDIA hardware (Linux/Windows), the same workload runs through CUDA. On Apple Silicon, it goes through Metal โ Apple's GPU compute framework โ because Apple Silicon has no CUDA support. Ollama abstracts this entirely so you interact with the same CLI and API regardless of hardware.
The practical benefit: you run ollama run qwen2.5:7b and get a working chatbot. You don't need to know what a GGUF file is, how many GPU layers to offload, or what quantization format to choose.
What you'll accomplish
After following this playbook you will have:
โข Ollama running as a background service on port 11434
โข `qwen2.5:7b` downloaded and cached locally (~4.7 GB on disk)
โข A working interactive CLI chat session with the model
โข A working `curl` call to the REST API confirming the model responds
โข A rough tokens/sec baseline from the built-in benchmark
On an M2 Pro with 16 GB RAM, qwen2.5:7b at Q4_K_M quantization runs at roughly 40โ55 tokens/sec.
What to know before starting
What LLMs are:: Large language models are next-token predictors. Given a sequence of text tokens, they predict the probability distribution of the next token and sample from it. "Generating text" is just doing this thousands of times in sequence. They are not databases and do not look things up โ they predict plausible continuations based on patterns learned during training.
What quantization means:: A 7B-parameter model in full 32-bit float precision requires ~28 GB of RAM. Quantization reduces each weight from 32-bit to fewer bits (4-bit in Q4_K_M). The 7B model then fits in ~4.7 GB. Quality degrades slightly but is usually imperceptible for chat tasks. Ollama downloads Q4_K_M by default for most models.
What an API server means:: `ollama serve` starts an HTTP server. Clients send JSON requests describing the model and messages; the server runs inference and returns JSON responses. This lets any app โ Python scripts, web UIs, IDEs โ use local models without embedding the inference engine themselves.
What Metal is:: Metal is Apple's GPU compute and graphics framework. When llama.cpp (inside Ollama) runs a matrix multiplication, it dispatches it as a Metal shader to the GPU cores in your M-series chip. This is what makes inference fast โ without Metal, every matrix operation would run on the CPU only.
Why unified memory matters:: On Apple Silicon, CPU and GPU share the same physical RAM pool. There is no separate "VRAM." A 16 GB M2 has 16 GB accessible to both CPU and GPU simultaneously. This means a 7B Q4 model loaded into RAM is already in GPU-accessible memory โ no copying required. On discrete GPUs (NVIDIA), models must be copied from system RAM to VRAM before inference can begin.