Speculative Decoding

Use draft models to accelerate generation 1.5–2.5×

Replaces DGX Spark: Speculative Decoding

inferenceoptimization

Basic idea

Speculative decoding is a technique where a small "draft" model proposes multiple tokens in one shot, then the large "target" model verifies all of them in a single forward pass. Tokens that pass verification are accepted for free; the first rejected token triggers a correction and the process repeats.

The key insight: verifying N tokens costs the same compute as generating 1 token, because a transformer's forward pass processes all positions in parallel. If the draft model proposes 8 tokens and 6 are accepted, you generated 6 tokens for the price of 1 verification. With a 60–80% acceptance rate, real-world speedups of 1.5–2.5× are typical — with **identical output quality** to running the target model alone.

What you'll accomplish

A quantified speedup benchmark comparing standard inference against speculative decoding on the same prompt, using llama.cpp's llama-speculative binary with a Qwen2.5-32B target model and a Qwen2.5-1.5B draft model. You'll measure tokens/sec in both modes and calculate the actual multiplier on your hardware.

What to know before starting

Transformer forward passes: — each pass generates one token's probability distribution over the vocabulary. The expensive part is the matrix multiplications across all layers, which run in parallel regardless of context length.

Draft-target alignment: — the draft model must be from the same model family as the target (same tokenizer, same pre-training distribution) for acceptance rates to be high. A Qwen2.5-1.5B draft with a Qwen2.5-32B target achieves 65–80% acceptance; a mismatched draft might get 20–30%.

Why acceptance rate matters: — at 100% acceptance with 8 draft tokens, you'd get 8× speedup. At 0% acceptance, every draft guess is wrong and you actually run slower than baseline due to wasted computation. The break-even acceptance rate is roughly 30%.

GGUF format: — the binary format llama.cpp uses for quantized models. Each file contains model weights, tokenizer, and metadata. GGUF files are self-contained and portable.

GPU offloading with -ngl: — "number of GPU layers" controls how many transformer layers are computed on the Metal GPU vs CPU RAM. -ngl 99 offloads all layers, maximizing throughput.

Prerequisites

• llama.cpp installed with `llama-speculative` binary in your PATH (see the llama.cpp playbook)

• Two GGUF model files — same model family, different sizes (see Setup tab for exact downloads)

• 32 GB+ unified memory (the 32B Q4_K_M target needs ~20 GB, the 1.5B draft needs ~1.7 GB; both must fit simultaneously)

• `huggingface-cli` for downloading models: `pip install huggingface_hub`

Time & risk

Duration:: 15 minutes setup (plus model download time — ~20 GB total on first run)

Risk level:: Low — read-only model inference, no system modifications

Rollback:: Nothing to roll back; simply stop the process