# DS4 Imatrix Pipeline

This directory contains the calibration dataset and instructions used to build
activation importance matrices for DeepSeek V4 Flash and Pro GGUF
quantization.

The current imatrix target is the routed MoE path.  Flash has 43 layers and
256 routed experts per layer.  Pro has 61 layers and 384 routed experts per
layer.  Both variants expose three routed expert tensors per layer:

- `blk.N.ffn_gate_exps.weight`
- `blk.N.ffn_up_exps.weight`
- `blk.N.ffn_down_exps.weight`

For gate/up tensors, the collector records the squared FFN-normalized input
activation.  For down tensors, it records the squared routed SwiGLU row after
route weighting.  The result tells the quantizer which input columns are used
more heavily by the actual DS4 inference graph.

## 1. Build The Calibration Dataset

The tracked dataset is in `gguf-tools/imatrix/dataset/`.  Regenerate it from
the repository root with:

```sh
python3 gguf-tools/imatrix/dataset/build_ds4_imatrix_dataset.py
```

The important output is:

```text
gguf-tools/imatrix/dataset/rendered_prompts.txt
```

It contains DS4-rendered chat prompts, separated by visible
`DS4_IMATRIX_PROMPT` markers.  The prompts include:

- C/Metal source-review prompts from this repository.
- Long-context snippets.
- Agent/tool-call prompts using DS4's DSML syntax.
- Language/prose rewriting, summarization, extraction, and translation prompts.
- `ds4-eval` GPQA Diamond, SuperGPQA, and AIME2025 benchmark prompts.
- Both thinking and non-thinking assistant prefixes.

The current tracked dataset has 4682 rendered prompts and roughly 2.91M tokens
by the coarse bytes/4 estimate.  Check
`gguf-tools/imatrix/dataset/manifest.json` for the exact generated-file
summary.

## 2. Collect The Imatrix

Use the DS4 runtime itself to collect routed MoE activation statistics.  The
collector uses the loaded GGUF metadata, so the same command shape works for
Flash and Pro.

Flash example:

```sh
./ds4 \
  -m ../deepseek-v4-quants/gguf/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf \
  --imatrix-dataset gguf-tools/imatrix/dataset/rendered_prompts.txt \
  --imatrix-out ../deepseek-v4-quants/imatrix/DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-1p5m.dat \
  --ctx 32768
```

Pro example with a smaller calibration budget:

```sh
./ds4 \
  -m ../deepseek-v4-quants/gguf/DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct.gguf \
  --imatrix-dataset gguf-tools/imatrix/dataset/rendered_prompts.txt \
  --imatrix-out ../deepseek-v4-quants/imatrix/DeepSeek-V4-Pro-Instruct-routed-moe-ds4-small.dat \
  --imatrix-max-prompts 16 \
  --imatrix-max-tokens 32768 \
  --ctx 32768
```

Useful smoke-test limits:

```sh
./ds4 \
  -m MODEL.gguf \
  --imatrix-dataset gguf-tools/imatrix/dataset/rendered_prompts.txt \
  --imatrix-out /tmp/ds4-test.imatrix.dat \
  --imatrix-max-prompts 1 \
  --imatrix-max-tokens 4096
```

The collector is Metal-only because it hooks the layer-major Metal prefill graph.
It does not change inference math; it reads the already materialized MoE inputs
and accumulates `sum(x[column]^2)` per routed expert.

The output format is llama.cpp's legacy binary `.dat` imatrix format.  DS4 packs
per-expert vectors into one entry per routed expert tensor:

```text
entry length = n_expert * n_columns
```

The quantizer slices the right expert's segment when quantizing each expert.
For Pro, that means each routed tensor entry contains 384 vectors instead of
Flash's 256.

## 3. Generate GGUF Files With The Imatrix

The local C quantization tool in `gguf-tools/` supports:

```text
--imatrix FILE
--imatrix-strict
```

Example Q4 routed-expert regeneration:

```sh
gguf-tools/deepseek4-quantize \
  --hf ../deepseek-v4-quants/hf/DeepSeek-V4-Flash \
  --template ../deepseek-v4-quants/gguf/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf \
  --out ../deepseek-v4-quants/gguf/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguf \
  --imatrix ../deepseek-v4-quants/imatrix/DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-1p5m.dat
```

Example Q2 regeneration:

```sh
gguf-tools/deepseek4-quantize \
  --hf ../deepseek-v4-quants/hf/DeepSeek-V4-Flash \
  --template ../deepseek-v4-quants/gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
  --out ../deepseek-v4-quants/gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
  --imatrix ../deepseek-v4-quants/imatrix/DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-1p5m.dat
```

Example Pro Q2 regeneration from a Pro template:

```sh
gguf-tools/deepseek4-quantize \
  --hf ../deepseek-v4-quants/hf/DeepSeek-V4-Pro \
  --template ../deepseek-v4-quants/gguf/DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct.gguf \
  --out ../deepseek-v4-quants/gguf/DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix-small.gguf \
  --imatrix ../deepseek-v4-quants/imatrix/DeepSeek-V4-Pro-Instruct-routed-moe-ds4-small.dat \
  --threads 24
```

`deepseek4-quantize` reads the routed expert count from the template GGUF.  Use
`--n-experts` only when working with an older template that lacks
`deepseek4.expert_count`.

For Q4, the imatrix does not change the runtime tensor type: routed experts
remain `Q4_K`.  It changes how quantization error is weighted while choosing
scales and codes.  For Q2, it replaces the previous synthetic weight-energy
fallback used for `IQ2_XXS` gate/up experts with real activation statistics.

## 4. Evaluate

Useful local tools:

```text
misc/quant_eval.c
gguf-tools/quality-testing/
```

`misc/quant_eval.c` compares local GGUF variants by greedy/top-logit behavior.
`gguf-tools/quality-testing/` can score local GGUFs against official DeepSeek
API continuations by target-token negative log likelihood.

The Q4 imatrix file uploaded to Hugging Face was tested on 100 official
DeepSeek V4 Flash continuations:

```text
old Q4 avg NLL:         0.177357819
Q4 imatrix avg NLL:     0.173895148
relative NLL change:   -1.95%
case wins:              54 imatrix / 46 old
first-token matches:    83 imatrix / 81 old
avg greedy LCP:         12.21 imatrix / 11.94 old
```

## Compatibility

The `.dat` file is intentionally in llama.cpp's legacy imatrix format, so the
data is not conceptually tied to DS4.  In practice, it is immediately useful
only with a quantizer that understands DS4's tensor names and packed per-expert
entries.  The current `deepseek4-quantize` tooling does that.

Other GGUF creation tools can use the same imatrix if they implement the same
name mapping and per-expert slicing convention.  Without that DS4-specific
mapping, a generic imatrix loader will see valid data but will not know how to
apply the packed routed-expert vectors correctly.