Use draft models to accelerate generation 1.5โ2.5ร
Replaces DGX Spark: Speculative Decoding
inferenceoptimization
Basic idea
Speculative decoding is a technique where a small "draft" model proposes multiple tokens in one shot, then the large "target" model verifies all of them in a single forward pass. Tokens that pass verification are accepted for free; the first rejected token triggers a correction and the process repeats.
The key insight: verifying N tokens costs the same compute as generating 1 token, because a transformer's forward pass processes all positions in parallel. If the draft model proposes 8 tokens and 6 are accepted, you generated 6 tokens for the price of 1 verification. With a 60โ80% acceptance rate, real-world speedups of 1.5โ2.5ร are typical โ with **identical output quality** to running the target model alone.
What you'll accomplish
A quantified speedup benchmark comparing standard inference against speculative decoding on the same prompt, using llama.cpp's llama-speculative binary with a Qwen2.5-32B target model and a Qwen2.5-1.5B draft model. You'll measure tokens/sec in both modes and calculate the actual multiplier on your hardware.
What to know before starting
Transformer forward passes: โ each pass generates one token's probability distribution over the vocabulary. The expensive part is the matrix multiplications across all layers, which run in parallel regardless of context length.
Draft-target alignment: โ the draft model must be from the same model family as the target (same tokenizer, same pre-training distribution) for acceptance rates to be high. A Qwen2.5-1.5B draft with a Qwen2.5-32B target achieves 65โ80% acceptance; a mismatched draft might get 20โ30%.
Why acceptance rate matters: โ at 100% acceptance with 8 draft tokens, you'd get 8ร speedup. At 0% acceptance, every draft guess is wrong and you actually run slower than baseline due to wasted computation. The break-even acceptance rate is roughly 30%.
GGUF format: โ the binary format llama.cpp uses for quantized models. Each file contains model weights, tokenizer, and metadata. GGUF files are self-contained and portable.
GPU offloading with -ngl: โ "number of GPU layers" controls how many transformer layers are computed on the Metal GPU vs CPU RAM. -ngl 99 offloads all layers, maximizing throughput.
Prerequisites
โข llama.cpp installed with `llama-speculative` binary in your PATH (see the llama.cpp playbook)
โข Two GGUF model files โ same model family, different sizes (see Setup tab for exact downloads)
โข 32 GB+ unified memory (the 32B Q4_K_M target needs ~20 GB, the 1.5B draft needs ~1.7 GB; both must fit simultaneously)
โข `huggingface-cli` for downloading models: `pip install huggingface_hub`
Time & risk
Duration:: 15 minutes setup (plus model download time โ ~20 GB total on first run)
Risk level:: Low โ read-only model inference, no system modifications
Rollback:: Nothing to roll back; simply stop the process