--- name: dflash-mlx-speculative-decoding description: Lossless DFlash speculative decoding for MLX on Apple Silicon — 1.7–4x faster LLM inference using block diffusion drafting with target model verification. triggers: - use dflash for faster inference - speed up mlx generation with speculative decoding - dflash speculative decoding apple silicon - run dflash-serve openai compatible server - benchmark dflash vs baseline mlx - set up dflash with qwen model - dflash draft model auto resolution - faster token generation on apple silicon mlx --- # dflash-mlx Speculative Decoding > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. DFlash implements lossless speculative decoding for MLX on Apple Silicon. A small draft model (~1B params) generates 16 tokens in parallel using block diffusion; the target model verifies all 16 in a single forward pass. Tokens are only emitted after target verification — output is lossless (every token is the target model's greedy argmax). **Typical speedups**: 1.7x–4.1x over baseline `mlx_lm` depending on model size and context length. Acceptance rates hover around 87–90% for Qwen3.5 models. ## Installation ```bash pip install dflash-mlx # or isolated install pipx install dflash-mlx ``` Requires Python 3.10+, MLX 0.31.1+, Apple Silicon Mac. ## Key CLI Commands ### Generate text ```bash # Auto-resolve draft model from registry dflash --model Qwen/Qwen3.5-9B --prompt "Explain backpropagation" # Explicit draft model dflash --model Qwen/Qwen3.5-9B \ --draft z-lab/Qwen3.5-9B-DFlash \ --prompt "Explain backpropagation" # Disable EOS (useful for benchmarking fixed token counts) dflash --model Qwen/Qwen3.5-9B --prompt "..." --max-tokens 1024 --no-eos ``` ### OpenAI-compatible server ```bash # Basic server dflash-serve --model Qwen/Qwen3.5-9B --port 8000 # With explicit draft dflash-serve --model Qwen/Qwen3.5-9B \ --draft z-lab/Qwen3.5-9B-DFlash \ --port 8000 # Disable thinking/reasoning tokens (Qwen3.5 thinking models) dflash-serve --model Qwen/Qwen3.5-9B --port 8000 \ --chat-template-args '{"enable_thinking": false}' # Raise fallback threshold for longer prompts (large models) dflash-serve --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8000 \ --chat-template-args '{"enable_thinking": false}' \ --dflash-max-ctx 16384 ``` ### Benchmark ```bash dflash-benchmark \ --model Qwen/Qwen3.5-9B \ --draft z-lab/Qwen3.5-9B-DFlash \ --prompt "The function f satisfies..." \ --max-tokens 1024 \ --repeat 3 \ --no-eos ``` Outputs per-run JSON reports with tok/s, acceptance rate, and speedup vs baseline. ## Supported Model Pairs | Target Model | Draft Model | |---|---| | `Qwen/Qwen3.5-4B` | `z-lab/Qwen3.5-4B-DFlash` | | `Qwen/Qwen3.5-9B` | `z-lab/Qwen3.5-9B-DFlash` | | `mlx-community/Qwen3.5-27B-4bit` | `z-lab/Qwen3.5-27B-DFlash` | | `mlx-community/Qwen3.5-35B-A3B-4bit` | `z-lab/Qwen3.5-35B-A3B-DFlash` | Draft models are auto-resolved from a registry — no `--draft` flag needed for listed pairs. Models without a matching draft are rejected at startup. ## Python API Usage ### Streaming generation ```python from dflash_mlx import DFlashRuntime runtime = DFlashRuntime.from_pretrained( model="Qwen/Qwen3.5-9B", draft="z-lab/Qwen3.5-9B-DFlash", # optional, auto-resolved ) prompt = "Explain the Pythagorean theorem step by step." for token_text in runtime.stream_generate( prompt=prompt, max_tokens=512, use_chat_template=True, ): print(token_text, end="", flush=True) print() ``` ### Full generation with stats ```python from dflash_mlx import DFlashRuntime runtime = DFlashRuntime.from_pretrained(model="Qwen/Qwen3.5-9B") result = runtime.generate( prompt="What is speculative decoding?", max_tokens=256, use_chat_template=True, ) print(result.text) print(f"Tokens/sec: {result.tokens_per_second:.2f}") print(f"Acceptance rate: {result.acceptance_rate:.2%}") print(f"Total tokens: {result.total_tokens}") ``` ### Custom draft block size and context ```python from dflash_mlx import DFlashRuntime, DFlashConfig config = DFlashConfig( draft_block_size=16, # tokens drafted per speculative step max_ctx=8192, # max context length before fallback enable_tape_replay=True, # GatedDeltaNet recurrent rollback jit_sdpa=True, # custom Metal SDPA for long contexts ) runtime = DFlashRuntime.from_pretrained( model="mlx-community/Qwen3.5-27B-4bit", config=config, ) ``` ### OpenAI client against dflash-serve ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed", # dflash-serve does not require auth by default ) # Non-streaming response = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[ {"role": "user", "content": "Explain gradient descent."} ], max_tokens=512, ) print(response.choices[0].message.content) # Streaming stream = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[{"role": "user", "content": "Write a haiku about silicon."}], max_tokens=128, stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True) print() ``` ### Tool calling (via dflash-serve) ```python import json from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"}, }, "required": ["city"], }, }, } ] response = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[{"role": "user", "content": "What's the weather in Tokyo?"}], tools=tools, tool_choice="auto", ) tool_call = response.choices[0].message.tool_calls[0] print(f"Function: {tool_call.function.name}") print(f"Args: {json.loads(tool_call.function.arguments)}") ``` ## Common Patterns ### Side-by-side demo (baseline vs DFlash) ```bash PYTHONPATH=. python3 -m examples.demo --mode dflash \ --target-model Qwen/Qwen3.5-9B \ --draft-model z-lab/Qwen3.5-9B-DFlash \ --prompt "Solve: f(x) + f(y) = f(x+y) - xy - 1" \ --max-tokens 2048 \ --no-eos ``` ### Integrating with Open WebUI 1. Start `dflash-serve --model Qwen/Qwen3.5-9B --port 8000` 2. In Open WebUI settings → Connections → add OpenAI API with URL `http://localhost:8000/v1` 3. Select model `Qwen/Qwen3.5-9B` in the chat UI Works the same for Continue, aider, OpenCode, and any OpenAI-compatible client. ### Override draft for unsupported models ```bash # Force a custom draft — bypasses registry check dflash --model my-org/MyCustomModel \ --draft my-org/MyCustomModel-DFlash \ --prompt "Hello" ``` ### Disable thinking tokens for Qwen3.5 ```bash # CLI dflash --model Qwen/Qwen3.5-9B \ --chat-template-args '{"enable_thinking": false}' \ --prompt "What is 2+2?" # Server dflash-serve --model Qwen/Qwen3.5-9B \ --chat-template-args '{"enable_thinking": false}' \ --port 8000 ``` ## Architecture Notes - **Tape-replay rollback**: For hybrid GatedDeltaNet + attention models (Qwen3.5), dflash records an innovation tape during verify and replays only accepted steps via a custom Metal kernel — avoids full state snapshots. - **JIT SDPA 2-pass**: For contexts ≥ 1024 tokens, a custom Metal attention kernel maintains numerical alignment with stock MLX attention. - **Greedy acceptance**: Keeps the longest correct prefix from the 16 drafted tokens, rejects the rest. No temperature/sampling on verification — strictly lossless. - **Qwen3 (pure attention)** models work but don't benefit from tape-replay rollback (that's GatedDeltaNet-specific). ## Troubleshooting **Model rejected at startup** ``` Error: No DFlash draft found for model 'org/ModelName' ``` → Pass `--draft org/ModelName-DFlash` explicitly, or use a model from the supported pairs table. **Low acceptance rate (< 80%)** - Usually caused by very long context (4096+). Try `--dflash-max-ctx 8192` to extend the fallback threshold. - Qwen3 (non-3.5) models have lower acceptance than Qwen3.5 hybrid models. **Numerical divergence / output differs from pure AR** - Expected behavior: "Output can still differ from pure AR because of MLX dispatch divergence, but no unverified token is ever emitted." - If outputs seem wrong (not just different), ensure MLX 0.31.1+ is installed: `python -c "import mlx; print(mlx.__version__)"` **Server not accepting connections** ```bash # Check port is not in use lsof -i :8000 # Bind to all interfaces for network access dflash-serve --model Qwen/Qwen3.5-9B --port 8000 --host 0.0.0.0 ``` **Out of memory with large models** - Use 4-bit quantized variants: `mlx-community/Qwen3.5-27B-4bit` instead of the full model. - The draft model loads alongside the target — budget ~1–2GB extra for the draft. **Benchmark results JSON location** ```bash ls benchmark/results/ # Per-run JSON with tok/s, acceptance rate, repeat measurements ```