Fine-tune LLMs with LoRA/QLoRA natively on Apple Silicon
Replaces DGX Spark: NeMo / Unsloth Fine-tune
mlxfine-tuning
Basic idea
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds small trainable matrices to a frozen pretrained model. Instead of updating all 7 billion parameters β which would require more GPU memory than a Mac has β LoRA inserts pairs of small matrices (A and B) at each attention layer. Only these adapter matrices are updated during training. The frozen model weights never change.
The math: a standard weight matrix W has shape [4096, 4096] = 16.7M parameters. LoRA replaces the weight update ΞW with two smaller matrices: A [4096, r] and B [r, 4096], where r is the "rank" (typically 4β16). At rank 8, this is [4096Γ8] + [8Γ4096] = 65,536 parameters β just 0.4% of the original. During inference, the adapter is applied as: output = WΒ·x + (BΒ·A)Β·xΒ·scaling_factor.
On Apple Silicon, MLX implements LoRA natively using Metal: the frozen model runs on the GPU, and gradients flow through only the adapter matrices, keeping memory usage manageable even on 16 GB machines.
What you'll accomplish
A LoRA adapter fine-tuned on your custom dataset (100β10,000 examples), tested for quality improvement over the base model, and merged into a standalone model ready for deployment with mlx_lm.generate or converted to GGUF for Ollama.
What to know before starting
Fine-tuning vs prompting: β prompting changes how you ask; fine-tuning changes what the model knows. Use fine-tuning when you need consistent style, specialized vocabulary, a specific response format, or behavior that system prompts can't reliably produce.
LoRA rank: β `r=4` adds ~0.1% of parameters, good for style adaptation. `r=16` adds ~0.5%, better for learning new facts or formats. Higher rank = more capacity but also more memory and training time. Start with `r=8` for general use.
Chat template: β instruct models expect inputs wrapped in special tokens that signal role boundaries. Qwen2.5 uses ChatML format: `<|im_start|>user\nYour message<|im_end|>\n<|im_start|>assistant\n`. If your training data doesn't match the model's expected template, training will proceed but results will be poor.
Learning rate: β how much the weights change per gradient step. Too high (> 5e-4) causes loss spikes; too low (< 1e-6) means no learning. The safe range for LoRA is 1e-5 to 2e-4. Default `1e-5` is conservative and safe.
Iterations vs epochs: β MLX counts training steps (gradient updates), not passes through the data. With batch_size=4 and 400 training examples, one epoch = 100 iterations. 1000 iterations = 10 epochs.