# PQC Training Data Transparency

![PQC Native](https://img.shields.io/badge/PQC-Native-blue)
![Merkle SHA3-256](https://img.shields.io/badge/Merkle-SHA3--256-green)
![ML-DSA](https://img.shields.io/badge/ML--DSA-FIPS%20204-green)
![License](https://img.shields.io/badge/License-Apache%202.0-orange)
![Version](https://img.shields.io/badge/version-0.1.0-lightgrey)

**Cryptographic transparency for AI training data.** Build an SHA3-256 Merkle tree over every record in your training set, sign the root with **ML-DSA** (FIPS 204), and publish it. Anyone who holds a single document can later receive an `O(log n)` inclusion proof showing that the record was in the training set — without revealing any of the other records. The audit trail survives the transition to post-quantum cryptography, so commitments made today remain verifiable in 2035 and beyond.

## The Problem

AI copyright litigation, regulatory audits, and red-team requests keep asking the same question: *what exactly was used to train this model?* Model creators today have no cryptographic answer.

- "Prove this document was NOT in your training set" — requires revealing the entire training set (impossible for proprietary or licensed data).
- "Prove your model wasn't trained on PII" — requires deleting, then proving a negative.
- "Which records were used for fine-tune v2 vs v3?" — no binding commitment exists, so claims are unfalsifiable.

And the few audit trails that do exist are typically RSA- or ECDSA-signed. A cryptographically relevant quantum computer breaks those signatures, and the entire audit chain collapses retroactively. Training data provenance has a 15-20 year shelf life; the crypto under it must survive that long.

## The Solution

Commit once, prove selectively:

- Hash every record into a leaf: `SHA3-256(content || canonical(metadata))`.
- Build an SHA3-256 Merkle tree over the leaves.
- Wrap the root in a `TrainingCommitment` (dataset name, version, record count, timestamps, licenses, tags).
- Sign the canonical commitment with **ML-DSA** at model-release time.
- Publish the commitment anywhere — on-chain, in a transparency log, on quantamrkt.com, stapled to the model card.

Later, anyone can ask "was record X in the training set?" The creator returns an inclusion proof (`log₂(n)` sibling hashes). The verifier checks the proof against the signed root. No other record is revealed.

## Installation

```bash
pip install pqc-training-data-transparency
```

Development:

```bash
pip install -e ".[dev]"
```

## Quick Start

### Build and sign a commitment

```python
from quantumshield import AgentIdentity
from pqc_training_data import (
    CommitmentBuilder, CommitmentSigner, DataRecord,
)

identity = AgentIdentity.create("model-creator")
signer = CommitmentSigner(identity)

corpus = [
    DataRecord(content=doc_bytes, metadata={"source": "internal", "id": i})
    for i, doc_bytes in enumerate(your_documents)
]

builder = CommitmentBuilder(dataset_name="model-v1-train", dataset_version="1.0.0")
builder.add_records(corpus)
builder.licenses = ["cc-by-4.0"]
builder.tags = ["production"]

commitment = signer.sign(builder.build(description="Production training set"))

# Publish commitment.to_json() — this is the public audit artifact.
```

### Prove a single record is in the training set

```python
# Auditor holds only one specific record + the public commitment.
proof = builder.tree.inclusion_proof(index=42)
result = CommitmentVerifier.verify(corpus[42], proof, commitment)

assert result.fully_verified
# -> signature_valid=True, proof_valid=True, leaf_matches_record=True
```

### Detect a forged inclusion claim

```python
forged = DataRecord(content=b"never-in-training", metadata={"id": 999})
pretend_proof = builder.tree.inclusion_proof(index=0)  # hijack a real slot

result = CommitmentVerifier.verify(forged, pretend_proof, commitment)
assert not result.fully_verified            # rejected
# result.error: "record leaf_hash ... does not match proof ..."
```

## Architecture

```
  Training Pipeline (creator)                        Audit Path (third party)
  --------------------------                         ------------------------
                                                                |
  records = [doc1, doc2, ..., docN]                             |
         |                                                      |
         | 1. leaf_hash = SHA3-256(                              |
         |       SHA3-256(content) || canonical_json(metadata)) |
         v                                                      |
  [leaf_1, leaf_2, ..., leaf_N]                                 |
         |                                                      |
         | 2. Merkle fold (SHA3-256, 0x00/0x01 domain sep)      |
         v                                                      |
       ROOT                                                     |
         |                                                      |
         | 3. wrap in TrainingCommitment                        |
         |    (id, dataset, version, created_at, ...)           |
         |                                                      |
         | 4. ML-DSA.sign(canonical(commitment))                |
         v                                                      |
  SIGNED COMMITMENT  -->  published (on-chain, log, model card) |
                                                                |
                                                                | 5. request
                                                                |    inclusion
                                                                |    proof for
                                                                |    record R
                                                                v
                          InclusionProof (leaf, siblings, dirs, root)
                                                                |
                                                                | 6. verify:
                                                                |    ML-DSA(commitment) OK?
                                                                |    leaf_hash(R) == proof.leaf?
                                                                |    walk siblings -> root?
                                                                |    proof.root == commitment.root?
                                                                v
                                                         VerificationResult
                                                         (fully_verified T/F)
```

## Threat Model

| Threat | Handled | Notes |
|---|---|---|
| **Forged inclusion claim** (attacker claims doc X is in the set) | Yes | Verifier recomputes `leaf_hash(X)` and compares to the proof; walk to root fails or mismatches. |
| **Tampered commitment signature** (attacker edits dataset_name, record_count, root) | Yes | Canonical bytes change, ML-DSA signature no longer verifies. |
| **Tampered inclusion proof** (attacker flips a sibling hash) | Yes | Root recomputation diverges from signed root. |
| **Quantum forgery in 2035+** (CRQC forges the audit trail retroactively) | Yes | ML-DSA is a FIPS 204 post-quantum signature; not broken by Shor/Grover. |
| **Proving NON-inclusion** (prove a record was *not* in training) | No | Requires a sorted-tree / Verkle construction. Future work. |
| **Revealing private training data** | No (by design) | Commitment contains only the root; proofs reveal `log₂(n)` sibling hashes, never other records. The creator decides what to reveal. |
| **Selective disclosure of metadata fields** | No | A record's metadata is fully inside its leaf. Hashing over `metadata` is all-or-nothing; carve out separate fields into the leaf if you need partial reveals. |
| **Re-publication of old commitment** (attacker re-uses prior root for a new model release) | Partial | `commitment_id` + `dataset_version` + `created_at` are all signed; enforce freshness by policy. |

## API Reference

### `DataRecord`

Frozen dataclass. One training example.

| Field / Method | Description |
|---|---|
| `content: bytes` | Raw record payload (doc text, image bytes, serialized row, ...). |
| `metadata: dict` | Arbitrary metadata — participates in the leaf hash. |
| `canonical_bytes()` | Deterministic `SHA3-256(content) || "|" || canonical_json(metadata)`. |
| `leaf_hash() -> RecordHash` | SHA3-256 of canonical bytes — the Merkle leaf value. |
| `to_dict()` | Safe serialization. **Does not include raw content.** |

### `MerkleTree`

SHA3-256 Merkle tree with RFC6962-style odd-node promotion.

| Method | Description |
|---|---|
| `add(leaf_hash)` / `add_many(leaves)` | Append leaves. |
| `root() -> str` | Hex Merkle root. Raises `EmptyTreeError` for empty trees. |
| `inclusion_proof(index) -> InclusionProof` | `O(log n)` proof for leaf at `index`. |
| `MerkleTree.verify_inclusion(proof) -> bool` | Static verification — independent of tree state. |

### `InclusionProof`

Frozen dataclass carried from prover to verifier.

| Field | Description |
|---|---|
| `leaf_hash` | Hex of the leaf being proven. |
| `index`, `tree_size` | Position and total size at time of proof. |
| `root` | Hex root the prover claims. |
| `siblings`, `directions` | `log₂(n)` sibling hashes + `'L'`/`'R'` flags. |

### `TrainingCommitment`

The signed audit artifact.

| Field | Description |
|---|---|
| `commitment_id` | `urn:pqc-td:<uuid>`. |
| `dataset_name`, `dataset_version`, `description` | Human-readable identification. |
| `record_count`, `root` | Cryptographic binding to the tree. |
| `created_at`, `licenses`, `tags`, `extra` | Provenance metadata — all signed. |
| `signer_did`, `algorithm`, `signature`, `public_key`, `signed_at` | ML-DSA signature block (populated by `CommitmentSigner.sign`). |
| `to_json()` / `from_json()` | Network-safe round-trip. |
| `canonical_bytes()` | Deterministic JSON covered by the signature. |

### `CommitmentBuilder`

Accumulator for records, emits an unsigned `TrainingCommitment`.

| Method | Description |
|---|---|
| `CommitmentBuilder(dataset_name, dataset_version)` | Start a build. |
| `add_record(record)` / `add_records(records)` | Queue records. |
| `add_leaf_hash_hex(hex)` | Direct-add when caller pre-hashed the data. |
| `build(description="") -> TrainingCommitment` | Produce unsigned commitment. |
| `.tree` | Underlying `MerkleTree` — use to generate inclusion proofs later. |

### `CommitmentSigner`

ML-DSA sign + verify.

| Method | Description |
|---|---|
| `CommitmentSigner(identity)` | Wrap a QuantumShield `AgentIdentity`. |
| `sign(commitment) -> TrainingCommitment` | Populate signature fields. |
| `CommitmentSigner.verify(commitment) -> bool` | Static — verify signature against embedded public key. |

### `CommitmentVerifier` + `VerificationResult`

End-to-end check of (record, proof, commitment).

| Method | Description |
|---|---|
| `CommitmentVerifier.verify(record, proof, commitment)` | Returns a `VerificationResult`. |
| `CommitmentVerifier.verify_or_raise(...)` | Raises `CommitmentVerificationError` on any failure. |

`VerificationResult` fields: `signature_valid`, `proof_valid`, `leaf_matches_record`, `commitment_id`, `record_leaf_hash`, `claimed_root`, `error`, and the `fully_verified` property.

### Exceptions

| Exception | When |
|---|---|
| `TrainingDataError` | Base class. |
| `EmptyTreeError` | Tree operation requires at least one leaf. |
| `InclusionProofError` | Malformed or unverifiable proof. |
| `CommitmentVerificationError` | Raised by `verify_or_raise` on failure. |
| `IndexOutOfRangeError` | Leaf index outside `[0, size)`. |

## Why PQC for Training Data

Training data provenance is a 15-to-20-year commitment:

- Regulatory discovery can ask about training data *decades* after the model was released.
- Copyright plaintiffs litigate on timelines that long outlive a model's commercial life.
- Medical, legal, and financial AI systems are audited for the lifetime of the decisions they influenced.

A Merkle commitment signed today with RSA-2048 or ECDSA-P256 becomes forgeable the moment a cryptographically relevant quantum computer exists. An adversary with a CRQC can retroactively forge arbitrary "signed commitments" and "inclusion proofs", collapsing the entire audit trail.

ML-DSA (FIPS 204) is not broken by Shor's algorithm. Commitments minted today remain verifiable through the post-quantum transition.

## Examples

See the `examples/` directory:

- **`commit_corpus.py`** — build a signed commitment over a small training corpus.
- **`prove_inclusion.py`** — produce and verify an `O(log n)` inclusion proof.
- **`detect_false_inclusion_claim.py`** — demonstrate rejection of a forged "my data was in training" claim.

Run them:

```bash
python examples/commit_corpus.py
python examples/prove_inclusion.py
python examples/detect_false_inclusion_claim.py
```

## Development

```bash
pip install -e ".[dev]"
pytest
ruff check src/ tests/ examples/
```

## Related

Part of the [QuantaMrkt](https://quantamrkt.com) post-quantum tooling registry. See also:

- **QuantumShield** — the PQC toolkit (`AgentIdentity`, `SignatureAlgorithm`, `sign/verify`).
- **PQC RAG Signing** — sister tool for signing RAG corpus chunks with ML-DSA.
- **PQC Content Provenance** — signed manifests for content authenticity.
- **PQC MCP Transport** — signed JSON-RPC transport for Model Context Protocol.

## License

Apache License 2.0. See [LICENSE](LICENSE).