โšก
Mac Playbook
โฑ 10 min

MLX Quantization

Quantize any HF model to 2/4/8-bit for Apple Silicon

Replaces DGX Spark: NVFP4 Quantization
mlxquantization

Basic idea

Quantization reduces model weights from 32-bit or 16-bit floating-point numbers to lower-precision integers (2โ€“8 bit). This dramatically shrinks memory requirements: a 32B model at FP16 needs 64 GB of RAM โ€” at 4-bit it needs only ~20 GB.

MLX uses **grouped quantization**: weights are divided into groups of 32 or 64 values, and each group gets its own learned scale factor and bias. This per-group scaling preserves more accuracy than naive per-tensor quantization, because local patterns in the weight matrix are captured individually. The quality loss at 4-bit is surprisingly small โ€” perplexity increases by only 1โ€“3%, which is imperceptible in conversational use.

During inference, the quantized weights are dequantized on the fly within the Metal shader โ€” Apple Silicon's unified memory bandwidth handles this efficiently because the memory transfer is 4ร— smaller, and the dequantization math is simple multiply-add operations that run nearly for free on the GPU cores.

What you'll accomplish

A locally quantized version of any Hugging Face model, ready to run with mlx_lm.generate. You'll also understand when to quantize yourself versus using pre-quantized models from the mlx-community organization, and how to evaluate quality degradation.

What to know before starting

FP16/BF16: โ€” 16-bit floating point formats use 2 bytes per parameter. A 7B model at FP16 = 7,000,000,000 ร— 2 bytes = 14 GB. BF16 (Brain Float) is similar but with a wider exponent range; both are common for inference.
INT4 / Q4: โ€” 4-bit integer format uses 0.5 bytes per parameter. A 7B model at Q4 โ‰ˆ 3.5โ€“4.5 GB (plus overhead for scales/biases). MLX's 4-bit format stores 2 weights per byte with per-group metadata.
Grouped quantization mechanics: โ€” given a group of 64 weight values, find the min and max, then map each value to the nearest of 16 levels (for 4-bit). Store the scale (range/16) and zero-point alongside. At runtime, multiply stored integer by scale, add zero-point to recover approximately the original float.
Perplexity: โ€” a measure of how surprised a language model is by a held-out text corpus. Lower perplexity = better language model. FP16 baseline for Qwen2.5-7B is ~8.2; at Q4_0 it rises to ~8.4 (+2.4%); at Q2 it might reach ~10.5 (+28%). The 4-bit increase is imperceptible in practice.
Why conversion needs 2ร— RAM: โ€” the conversion process loads the full FP16 model (~14 GB for 7B), quantizes it layer by layer, then writes the quantized result (~4 GB). Both the full model and output must fit in memory simultaneously. After conversion, the original is no longer needed.

Prerequisites

โ€ข macOS 14.0+ (Sonoma or later)
โ€ข Apple Silicon Mac
โ€ข Python 3.10, 3.11, or 3.12
โ€ข `mlx-lm` installed: `pip install mlx-lm`
โ€ข Hugging Face account (free); `huggingface-cli login` for gated models
โ€ข RAM: at least 2ร— the FP16 model size (7B needs 28+ GB free during conversion; 32B needs 128+ GB)

Time & risk

Duration:: 10 minutes setup + conversion time (~5 min for 7B on M2 Max, ~20 min for 32B on M2 Max)
Risk level:: None โ€” conversion creates new files and never modifies the original model
Rollback:: Delete the output directory; the source model on Hugging Face is unchanged