llama.cpp with Metal

Run GGUF models with first-class Apple Silicon Metal acceleration

Replaces DGX Spark: TRT-LLM / Nemotron with llama.cpp

inferencemetal

Basic idea

llama.cpp is the foundational C++ implementation of LLM inference that powers Ollama, LM Studio, and many other popular tools. When you run ollama run qwen2.5:7b, Ollama is invoking llama.cpp's Metal backend under the hood. Running llama.cpp directly removes that abstraction layer and gives you fine-grained control over every aspect of inference.

Why use llama.cpp directly instead of Ollama? Three reasons:

1. **Control:** You choose exactly how many GPU layers to offload (-ngl), what context size to allocate (-c), how many threads to use, and what sampling strategy to apply. Ollama makes sensible defaults; llama.cpp lets you tune for your specific hardware and use case.

2. **GGUF format:** llama.cpp uses GGUF — a single-file model format that packages weights and quantization metadata together. You can download individual GGUF files from Hugging Face and run them directly without a model manager. This is useful when you need a specific quantization variant that Ollama doesn't expose.

3. **Transparency:** You see exactly what the runtime is doing. The verbose output shows Metal initialization, layer allocation, memory usage, and tokens/sec all in your terminal.

On NVIDIA hardware, you'd use TensorRT-LLM or vLLM for maximum performance. On Apple Silicon, llama.cpp with Metal is the C++ equivalent — highly optimized for the hardware, maximum control.

What you'll accomplish

After following this playbook you will have:

• `llama-cli` installed and running interactive chat sessions via the command line

• `llama-server` running an OpenAI-compatible HTTP API on port 8080

• A downloaded GGUF model file running with full Metal GPU acceleration

• An understanding of the key flags so you can tune performance for your machine

What to know before starting

What GGUF format is:: GGUF (GPT-Generated Unified Format) is a single-file model format designed for llama.cpp. A GGUF file contains model weights, quantization metadata, tokenizer vocabulary, and model architecture information all in one file. You download one file and run it. Compare this to Hugging Face's multi-file safetensors format used by PyTorch and MLX.

What Metal GPU layers means:: A transformer model consists of multiple "layers" (a 7B model typically has 28–32 layers). Each layer is a set of matrix multiplications. The `-ngl N` flag tells llama.cpp to offload `N` layers to the Metal GPU. More layers on GPU = faster inference. If you set `-ngl 99` (more than the model has), all layers go to GPU. If you have less RAM, reduce this number to leave some layers on CPU and reduce peak memory usage.

What quantization variants mean:: GGUF files come in many quantization levels, each a tradeoff between quality and size:

- Q2_K — 2-bit, aggressive compression, fits large models in small RAM, noticeable quality loss

- Q4_K_M — 4-bit with K-means grouping, Mixed precision — the standard recommendation, good balance of quality and size

- Q5_K_M — 5-bit, noticeably higher quality than Q4 at ~20% more RAM

- Q8_0 — 8-bit, near-lossless quality, uses ~2x the RAM of Q4

- F16 — full 16-bit float, maximum quality, uses ~2x the RAM of Q8

llama.cpp's lineage:: llama.cpp was created by Georgi Gerganov in January 2023 after the original LLaMA model weights leaked from Facebook/Meta. It was initially a pure CPU C++ implementation. Metal GPU support was added within weeks by the community. It now supports Llama, Qwen, Mistral, Phi, Gemma, and nearly every major open model architecture. Ollama, LM Studio, Jan, and many other tools use llama.cpp as their inference engine.

Prerequisites

• macOS 12.0+ (macOS 14 Sonoma recommended for latest Metal optimizations)

• Apple Silicon (M1 or later) — Intel Macs work but without Metal acceleration

• Xcode Command Line Tools: run `xcode-select --install` if you haven't already

• Homebrew (for the Homebrew install path) OR CMake 3.14+ (for the from-source path)

• A GGUF model file — you will download one in the Install tab

Time & risk

Duration:: 15 minutes via Homebrew, 25–30 minutes if building from source

Risk level:: Low — standard CLI tools, no system modifications

Rollback:: `brew uninstall llama.cpp` (Homebrew path) or `rm -rf ~/llama.cpp/build` (source path)