MLX LM for Inference

Apple's high-performance native LLM engine — fastest on Mac

Replaces DGX Spark: vLLM for Inference

mlxinference

Basic idea

MLX is Apple's machine learning framework built from scratch for the unified memory architecture of Apple Silicon. Unlike PyTorch (which was designed for CUDA and then ported to Apple's MPS backend) or llama.cpp (written for CPU first and then accelerated with Metal), MLX was designed from the ground up assuming that CPU and GPU share a single memory pool with zero latency between them.

The practical consequence: MLX never copies tensors between CPU and GPU memory. On a discrete GPU system (NVIDIA), loading a model means copying weights from system RAM to VRAM — this takes seconds and consumes VRAM capacity separately from system RAM. On Apple Silicon with MLX, weights live in unified memory and are accessible to both CPU and GPU simultaneously. There is no copy step and no VRAM cap separate from your total RAM.

MLX LM is the language model layer on top of MLX. It provides the model loading, tokenization, and sampling logic that turns the low-level MLX framework into something you can point at a Hugging Face model ID and get text out of.

Ollama also uses Apple Silicon acceleration (via llama.cpp's Metal backend), but MLX LM consistently benchmarks 10–30% faster on the same models because MLX's kernels are tuned specifically for the Apple Silicon memory hierarchy rather than being adapted from CUDA-first code.

What you'll accomplish

After following this playbook you will have:

• CLI inference at the highest available tokens/sec on your Mac (`mlx_lm.generate`)

• A running OpenAI-compatible API server you can hit with `curl` or any OpenAI SDK

• Python API access to run inference programmatically from your own code

• The ability to download and run any model from the `mlx-community` org on Hugging Face

What to know before starting

What Hugging Face model hub is:: A public repository of pre-trained model weights, tokenizers, and configs. When you run `mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit`, MLX LM downloads the model from Hugging Face and caches it in `~/.cache/huggingface/hub/`. You need an internet connection for the first download; after that, inference is fully offline.

What the mlx-community org is:: A Hugging Face organization that maintains pre-converted MLX versions of popular models. Original models are released in PyTorch format (safetensors). mlx-community converts them to MLX-compatible quantized format and publishes them. You don't need to convert models yourself unless you want a specific quantization that doesn't exist yet.

What safetensors format is:: The file format MLX uses for model weights. It's a tensor serialization format that is faster to load than PyTorch's pickle-based `.bin` format and doesn't have security issues with untrusted weights. MLX model repos on Hugging Face contain `.safetensors` files alongside the tokenizer configs.

What tokenization is:: LLMs operate on "tokens," not characters. A tokenizer converts input text into a sequence of integer IDs (e.g., "Apple Silicon" → [15789, 22153]) and converts output IDs back to text. MLX LM loads the correct tokenizer for each model automatically. This matters because different models have different vocabularies and different chat formatting conventions.

Lazy vs. eager evaluation:: MLX uses lazy evaluation. When you write `y = mlx.core.matmul(a, b)`, nothing is computed yet — MLX records the operation in a computation graph. Computation only happens when the result is needed (e.g., when you call `.tolist()` or when `generate()` needs the next token probability). This allows MLX to optimize the full computation graph before executing, and it is why the first generation call is slightly slower than subsequent ones.

Prerequisites

• macOS 14.0+ Sonoma (macOS 15 Sequoia recommended; MLX gets performance improvements with each OS release)

• Apple Silicon (M1 or later) — MLX does not run on Intel Macs

• Python 3.10, 3.11, or 3.12 (3.13 support depends on MLX release)

• `pip` or a virtual environment manager (`uv`, `conda`, `venv`)

Time & risk

Duration:: 10 minutes (plus model download time — a 7B 4-bit model is ~4.3 GB)

Risk level:: Low — pip install into user Python; no system changes

Rollback:: `pip uninstall mlx-lm mlx`