MLX LoRA Fine-tuning

Fine-tune LLMs with LoRA/QLoRA natively on Apple Silicon

Replaces DGX Spark: NeMo / Unsloth Fine-tune

mlxfine-tuning

Basic idea

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds small trainable matrices to a frozen pretrained model. Instead of updating all 7 billion parameters — which would require more GPU memory than a Mac has — LoRA inserts pairs of small matrices (A and B) at each attention layer. Only these adapter matrices are updated during training. The frozen model weights never change.

The math: a standard weight matrix W has shape [4096, 4096] = 16.7M parameters. LoRA replaces the weight update ΔW with two smaller matrices: A [4096, r] and B [r, 4096], where r is the "rank" (typically 4–16). At rank 8, this is [4096×8] + [8×4096] = 65,536 parameters — just 0.4% of the original. During inference, the adapter is applied as: output = W·x + (B·A)·x·scaling_factor.

On Apple Silicon, MLX implements LoRA natively using Metal: the frozen model runs on the GPU, and gradients flow through only the adapter matrices, keeping memory usage manageable even on 16 GB machines.

What you'll accomplish

A LoRA adapter fine-tuned on your custom dataset (100–10,000 examples), tested for quality improvement over the base model, and merged into a standalone model ready for deployment with mlx_lm.generate or converted to GGUF for Ollama.

What to know before starting

Fine-tuning vs prompting: — prompting changes how you ask; fine-tuning changes what the model knows. Use fine-tuning when you need consistent style, specialized vocabulary, a specific response format, or behavior that system prompts can't reliably produce.

LoRA rank: — `r=4` adds ~0.1% of parameters, good for style adaptation. `r=16` adds ~0.5%, better for learning new facts or formats. Higher rank = more capacity but also more memory and training time. Start with `r=8` for general use.

Chat template: — instruct models expect inputs wrapped in special tokens that signal role boundaries. Qwen2.5 uses ChatML format: `<|im_start|>user\nYour message<|im_end|>\n<|im_start|>assistant\n`. If your training data doesn't match the model's expected template, training will proceed but results will be poor.

Learning rate: — how much the weights change per gradient step. Too high (> 5e-4) causes loss spikes; too low (< 1e-6) means no learning. The safe range for LoRA is 1e-5 to 2e-4. Default `1e-5` is conservative and safe.

Iterations vs epochs: — MLX counts training steps (gradient updates), not passes through the data. With batch_size=4 and 400 training examples, one epoch = 100 iterations. 1000 iterations = 10 epochs.

Prerequisites

• macOS 14.0+ (Sonoma or later)

• Apple Silicon Mac (M1 or later)

• Python 3.10, 3.11, or 3.12

• `mlx-lm` installed: `pip install mlx-lm`

• 16 GB+ unified memory (32 GB recommended for 7B full-quality fine-tuning)

• Custom dataset in JSONL format (see Prepare Data tab for the required structure)

Time & risk

Duration:: 30 minutes setup + training time (1000 iters ≈ 20 min for 7B on M3 Max, ≈ 45 min on M1 Pro)

Risk level:: Low — adapter files are stored separately; the base model is never modified

Rollback:: Delete `./adapters/` directory; the base model in `~/.cache/huggingface/` is unchanged