πŸ§ͺ
Mac Playbook
⏱ 30 min

MLX LoRA Fine-tuning

Fine-tune LLMs with LoRA/QLoRA natively on Apple Silicon

Replaces DGX Spark: NeMo / Unsloth Fine-tune
mlxfine-tuning

Basic idea

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds small trainable matrices to a frozen pretrained model. Instead of updating all 7 billion parameters β€” which would require more GPU memory than a Mac has β€” LoRA inserts pairs of small matrices (A and B) at each attention layer. Only these adapter matrices are updated during training. The frozen model weights never change.

The math: a standard weight matrix W has shape [4096, 4096] = 16.7M parameters. LoRA replaces the weight update Ξ”W with two smaller matrices: A [4096, r] and B [r, 4096], where r is the "rank" (typically 4–16). At rank 8, this is [4096Γ—8] + [8Γ—4096] = 65,536 parameters β€” just 0.4% of the original. During inference, the adapter is applied as: output = WΒ·x + (BΒ·A)Β·xΒ·scaling_factor.

On Apple Silicon, MLX implements LoRA natively using Metal: the frozen model runs on the GPU, and gradients flow through only the adapter matrices, keeping memory usage manageable even on 16 GB machines.

What you'll accomplish

A LoRA adapter fine-tuned on your custom dataset (100–10,000 examples), tested for quality improvement over the base model, and merged into a standalone model ready for deployment with mlx_lm.generate or converted to GGUF for Ollama.

What to know before starting

Fine-tuning vs prompting: β€” prompting changes how you ask; fine-tuning changes what the model knows. Use fine-tuning when you need consistent style, specialized vocabulary, a specific response format, or behavior that system prompts can't reliably produce.
LoRA rank: β€” `r=4` adds ~0.1% of parameters, good for style adaptation. `r=16` adds ~0.5%, better for learning new facts or formats. Higher rank = more capacity but also more memory and training time. Start with `r=8` for general use.
Chat template: β€” instruct models expect inputs wrapped in special tokens that signal role boundaries. Qwen2.5 uses ChatML format: `<|im_start|>user\nYour message<|im_end|>\n<|im_start|>assistant\n`. If your training data doesn't match the model's expected template, training will proceed but results will be poor.
Learning rate: β€” how much the weights change per gradient step. Too high (> 5e-4) causes loss spikes; too low (< 1e-6) means no learning. The safe range for LoRA is 1e-5 to 2e-4. Default `1e-5` is conservative and safe.
Iterations vs epochs: β€” MLX counts training steps (gradient updates), not passes through the data. With batch_size=4 and 400 training examples, one epoch = 100 iterations. 1000 iterations = 10 epochs.

Prerequisites

β€’ macOS 14.0+ (Sonoma or later)
β€’ Apple Silicon Mac (M1 or later)
β€’ Python 3.10, 3.11, or 3.12
β€’ `mlx-lm` installed: `pip install mlx-lm`
β€’ 16 GB+ unified memory (32 GB recommended for 7B full-quality fine-tuning)
β€’ Custom dataset in JSONL format (see Prepare Data tab for the required structure)

Time & risk

Duration:: 30 minutes setup + training time (1000 iters β‰ˆ 20 min for 7B on M3 Max, β‰ˆ 45 min on M1 Pro)
Risk level:: Low β€” adapter files are stored separately; the base model is never modified
Rollback:: Delete `./adapters/` directory; the base model in `~/.cache/huggingface/` is unchanged