MLX Quantization

Quantize any HF model to 2/4/8-bit for Apple Silicon

Replaces DGX Spark: NVFP4 Quantization

mlxquantization

Basic idea

Quantization reduces model weights from 32-bit or 16-bit floating-point numbers to lower-precision integers (2–8 bit). This dramatically shrinks memory requirements: a 32B model at FP16 needs 64 GB of RAM — at 4-bit it needs only ~20 GB.

MLX uses **grouped quantization**: weights are divided into groups of 32 or 64 values, and each group gets its own learned scale factor and bias. This per-group scaling preserves more accuracy than naive per-tensor quantization, because local patterns in the weight matrix are captured individually. The quality loss at 4-bit is surprisingly small — perplexity increases by only 1–3%, which is imperceptible in conversational use.

During inference, the quantized weights are dequantized on the fly within the Metal shader — Apple Silicon's unified memory bandwidth handles this efficiently because the memory transfer is 4× smaller, and the dequantization math is simple multiply-add operations that run nearly for free on the GPU cores.

What you'll accomplish

A locally quantized version of any Hugging Face model, ready to run with mlx_lm.generate. You'll also understand when to quantize yourself versus using pre-quantized models from the mlx-community organization, and how to evaluate quality degradation.

What to know before starting

FP16/BF16: — 16-bit floating point formats use 2 bytes per parameter. A 7B model at FP16 = 7,000,000,000 × 2 bytes = 14 GB. BF16 (Brain Float) is similar but with a wider exponent range; both are common for inference.

INT4 / Q4: — 4-bit integer format uses 0.5 bytes per parameter. A 7B model at Q4 ≈ 3.5–4.5 GB (plus overhead for scales/biases). MLX's 4-bit format stores 2 weights per byte with per-group metadata.

Grouped quantization mechanics: — given a group of 64 weight values, find the min and max, then map each value to the nearest of 16 levels (for 4-bit). Store the scale (range/16) and zero-point alongside. At runtime, multiply stored integer by scale, add zero-point to recover approximately the original float.

Perplexity: — a measure of how surprised a language model is by a held-out text corpus. Lower perplexity = better language model. FP16 baseline for Qwen2.5-7B is ~8.2; at Q4_0 it rises to ~8.4 (+2.4%); at Q2 it might reach ~10.5 (+28%). The 4-bit increase is imperceptible in practice.

Why conversion needs 2× RAM: — the conversion process loads the full FP16 model (~14 GB for 7B), quantizes it layer by layer, then writes the quantized result (~4 GB). Both the full model and output must fit in memory simultaneously. After conversion, the original is no longer needed.

Prerequisites

• macOS 14.0+ (Sonoma or later)

• Apple Silicon Mac

• Python 3.10, 3.11, or 3.12

• `mlx-lm` installed: `pip install mlx-lm`

• Hugging Face account (free); `huggingface-cli login` for gated models

• RAM: at least 2× the FP16 model size (7B needs 28+ GB free during conversion; 32B needs 128+ GB)

Time & risk

Duration:: 10 minutes setup + conversion time (~5 min for 7B on M2 Max, ~20 min for 32B on M2 Max)

Risk level:: None — conversion creates new files and never modifies the original model

Rollback:: Delete the output directory; the source model on Hugging Face is unchanged