โšก
Mac Playbook
โฑ 15 min

Speculative Decoding

Use draft models to accelerate generation 1.5โ€“2.5ร—

Replaces DGX Spark: Speculative Decoding
inferenceoptimization

Basic idea

Speculative decoding is a technique where a small "draft" model proposes multiple tokens in one shot, then the large "target" model verifies all of them in a single forward pass. Tokens that pass verification are accepted for free; the first rejected token triggers a correction and the process repeats.

The key insight: verifying N tokens costs the same compute as generating 1 token, because a transformer's forward pass processes all positions in parallel. If the draft model proposes 8 tokens and 6 are accepted, you generated 6 tokens for the price of 1 verification. With a 60โ€“80% acceptance rate, real-world speedups of 1.5โ€“2.5ร— are typical โ€” with **identical output quality** to running the target model alone.

What you'll accomplish

A quantified speedup benchmark comparing standard inference against speculative decoding on the same prompt, using llama.cpp's llama-speculative binary with a Qwen2.5-32B target model and a Qwen2.5-1.5B draft model. You'll measure tokens/sec in both modes and calculate the actual multiplier on your hardware.

What to know before starting

Transformer forward passes: โ€” each pass generates one token's probability distribution over the vocabulary. The expensive part is the matrix multiplications across all layers, which run in parallel regardless of context length.
Draft-target alignment: โ€” the draft model must be from the same model family as the target (same tokenizer, same pre-training distribution) for acceptance rates to be high. A Qwen2.5-1.5B draft with a Qwen2.5-32B target achieves 65โ€“80% acceptance; a mismatched draft might get 20โ€“30%.
Why acceptance rate matters: โ€” at 100% acceptance with 8 draft tokens, you'd get 8ร— speedup. At 0% acceptance, every draft guess is wrong and you actually run slower than baseline due to wasted computation. The break-even acceptance rate is roughly 30%.
GGUF format: โ€” the binary format llama.cpp uses for quantized models. Each file contains model weights, tokenizer, and metadata. GGUF files are self-contained and portable.
GPU offloading with -ngl: โ€” "number of GPU layers" controls how many transformer layers are computed on the Metal GPU vs CPU RAM. -ngl 99 offloads all layers, maximizing throughput.

Prerequisites

โ€ข llama.cpp installed with `llama-speculative` binary in your PATH (see the llama.cpp playbook)
โ€ข Two GGUF model files โ€” same model family, different sizes (see Setup tab for exact downloads)
โ€ข 32 GB+ unified memory (the 32B Q4_K_M target needs ~20 GB, the 1.5B draft needs ~1.7 GB; both must fit simultaneously)
โ€ข `huggingface-cli` for downloading models: `pip install huggingface_hub`

Time & risk

Duration:: 15 minutes setup (plus model download time โ€” ~20 GB total on first run)
Risk level:: Low โ€” read-only model inference, no system modifications
Rollback:: Nothing to roll back; simply stop the process