{ "cells": [ { "cell_type": "markdown", "id": "99dc9dec", "metadata": {}, "source": "# The audit workflow\n\nThis is the provenance and audit workflow end to end. You run an analysis,\nrecord its provenance digest next to the number you report, export the full\nresult as JSON, then verify and trace that number back to the raw data later.\nEvery cell below runs, so the digests, the lineage, and the True/False from\nverification are printed as real output, not pasted in.\n\n
Run in ColabView on GitHubDownload notebook
" }, { "cell_type": "markdown", "id": "7caf3139", "metadata": {}, "source": [ "Install it first (skip this if mfgQC is already in your environment):" ] }, { "cell_type": "code", "execution_count": null, "id": "cdd67515", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:07.059758Z", "iopub.status.busy": "2026-06-26T13:09:07.059692Z", "iopub.status.idle": "2026-06-26T13:09:07.795947Z", "shell.execute_reply": "2026-06-26T13:09:07.795456Z" }, "tags": [ "skip-execution" ] }, "outputs": [], "source": [ "!pip install mfgqc" ] }, { "cell_type": "markdown", "id": "d03cde72", "metadata": {}, "source": [ "## The example\n", "\n", "We use a small, strictly-positive dataset and apply a Box-Cox transform, so the\n", "lineage has something interesting in it. The seed is fixed, so the digests this\n", "notebook prints are reproducible run to run." ] }, { "cell_type": "code", "execution_count": 2, "id": "fa8927ff", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:07.797560Z", "iopub.status.busy": "2026-06-26T13:09:07.797479Z", "iopub.status.idle": "2026-06-26T13:09:08.651784Z", "shell.execute_reply": "2026-06-26T13:09:08.651423Z" } }, "outputs": [ { "data": { "text/html": [ "
Process Capability (method=normal)\n",
       "==================================\n",
       "n = 80   mean = 1.7437\n",
       "sigma (within)  =   n/a\n",
       "sigma (overall) = 0.57318\n",
       "Cp/Cpk sigma    = overall\n",
       "\n",
       "Cp  = 3.344  95% CI (2.82, 3.86)\n",
       "Cpk = 0.7233  95% CI (0.589, 0.858)   (Cpu=5.965, Cpl=0.7233)\n",
       "Pp  = 3.344    Ppk = 0.7233   (Ppu=5.965, Ppl=0.7233)\n",
       "Cpm =   n/a\n",
       "\n",
       "Assumption checks:\n",
       "  [PASS] normality (Anderson-Darling): AD=0.343, p=0.481; est. Cpk impact 15.1%; n=80
" ], "text/plain": [ "Process Capability (method=normal)\n", "==================================\n", "n = 80 mean = 1.7437\n", "sigma (within) = n/a\n", "sigma (overall) = 0.57318\n", "Cp/Cpk sigma = overall\n", "\n", "Cp = 3.344 95% CI (2.82, 3.86)\n", "Cpk = 0.7233 95% CI (0.589, 0.858) (Cpu=5.965, Cpl=0.7233)\n", "Pp = 3.344 Ppk = 0.7233 (Ppu=5.965, Ppl=0.7233)\n", "Cpm = n/a\n", "\n", "Assumption checks:\n", " [PASS] normality (Anderson-Darling): AD=0.343, p=0.481; est. Cpk impact 15.1%; n=80" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "import dataclasses as dc\n", "import numpy as np, pandas as pd, mfgqc\n", "\n", "rng = np.random.default_rng(11)\n", "df = pd.DataFrame({\n", " \"cycles\": np.round(rng.lognormal(mean=1.2, sigma=0.35, size=80), 3),\n", "})\n", "\n", "qc = mfgqc.load(df, measure=\"cycles\").spec(lower=0.5, upper=12.0)\n", "cap = qc.transform(\"boxcox\").capability()\n", "cap" ] }, { "cell_type": "markdown", "id": "2ab81c55", "metadata": {}, "source": [ "## 1. Run the analysis and read its lineage\n", "\n", "Every result carries the full chain of operations that produced it.\n", "`lineage()` returns one dict per step. Pull the operation names to see the shape\n", "of the computation:" ] }, { "cell_type": "code", "execution_count": 3, "id": "acc1f7e3", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.652877Z", "iopub.status.busy": "2026-06-26T13:09:08.652790Z", "iopub.status.idle": "2026-06-26T13:09:08.654823Z", "shell.execute_reply": "2026-06-26T13:09:08.654599Z" } }, "outputs": [ { "data": { "text/plain": [ "['load', 'spec', 'transform', 'capability', 'assumption:normality']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[s[\"operation\"] for s in cap.lineage()]" ] }, { "cell_type": "markdown", "id": "48b8f2ef", "metadata": {}, "source": [ "That is the whole derivation: the frame was loaded, spec limits were attached,\n", "the measure was Box-Cox transformed, capability was computed, and a normality\n", "assumption check ran. Nothing happened that is not on this list." ] }, { "cell_type": "markdown", "id": "d4553eff", "metadata": {}, "source": [ "## 2. Record the digest when you report the number\n", "\n", "When you write the reported value down (into a report, a LIMS, a Certificate of\n", "Analysis), capture the provenance digest next to it:" ] }, { "cell_type": "code", "execution_count": 4, "id": "4dccb92e", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.655871Z", "iopub.status.busy": "2026-06-26T13:09:08.655820Z", "iopub.status.idle": "2026-06-26T13:09:08.657388Z", "shell.execute_reply": "2026-06-26T13:09:08.657120Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7cb845af09aa053b023f88fb972d8901ee1d6eaca6123919eca0b7ffd8279a07\n" ] } ], "source": [ "digest = cap.provenance_digest()\n", "print(digest)" ] }, { "cell_type": "markdown", "id": "08fe510d", "metadata": {}, "source": [ "That SHA-256 string pins the *computation* that produced the number: the\n", "operations, their parameters (including the fitted Box-Cox lambda), and how many\n", "rows each step touched. The timestamp is deliberately not in the digest, so it\n", "is reproducible run to run.\n", "\n", "Store the digest as a sibling field of the reported value, not instead of it.\n", "The digest is a fingerprint, not the data. Keeping it next to the reported Cpk\n", "gives anyone re-deriving the number later something to check against." ] }, { "cell_type": "markdown", "id": "9ad45be0", "metadata": {}, "source": [ "## 3. Export the full result as JSON\n", "\n", "`to_dict()` is the canonical payload. It carries the fields, the flat summary,\n", "the assumption checks, and the lineage plus the digest: everything a downstream\n", "report builder needs, with no `report()` text to parse." ] }, { "cell_type": "code", "execution_count": 5, "id": "4971ba18", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.658529Z", "iopub.status.busy": "2026-06-26T13:09:08.658448Z", "iopub.status.idle": "2026-06-26T13:09:08.661105Z", "shell.execute_reply": "2026-06-26T13:09:08.660759Z" } }, "outputs": [ { "data": { "text/plain": [ "['result_type',\n", " 'title',\n", " 'summary',\n", " 'fields',\n", " 'assumptions',\n", " 'history',\n", " 'provenance_digest']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = cap.to_dict()\n", "list(d.keys())" ] }, { "cell_type": "markdown", "id": "41d056e2", "metadata": {}, "source": [ "The two provenance keys are `history` (the lineage, each step carrying its\n", "running digest) and `provenance_digest` (the head digest from step 2). They are\n", "the same digest you recorded above, stamped into the export by construction:" ] }, { "cell_type": "code", "execution_count": 6, "id": "0cf2bbb6", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.661931Z", "iopub.status.busy": "2026-06-26T13:09:08.661884Z", "iopub.status.idle": "2026-06-26T13:09:08.663558Z", "shell.execute_reply": "2026-06-26T13:09:08.663292Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "provenance_digest: 7cb845af09aa053b023f88fb972d8901ee1d6eaca6123919eca0b7ffd8279a07\n", "matches step 2: True\n", "history step keys: ['operation', 'params', 'n_affected', 'digest']\n" ] } ], "source": [ "print(\"provenance_digest:\", d[\"provenance_digest\"])\n", "print(\"matches step 2: \", d[\"provenance_digest\"] == digest)\n", "print(\"history step keys:\", list(d[\"history\"][0].keys()))" ] }, { "cell_type": "markdown", "id": "2487a838", "metadata": {}, "source": [ "The transform step in `history` shows that the fitted lambda and its confidence\n", "interval are recorded in the provenance, not buried in a log:" ] }, { "cell_type": "code", "execution_count": 7, "id": "2d9217a6", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.664463Z", "iopub.status.busy": "2026-06-26T13:09:08.664394Z", "iopub.status.idle": "2026-06-26T13:09:08.666300Z", "shell.execute_reply": "2026-06-26T13:09:08.665872Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"operation\": \"transform\",\n", " \"params\": {\n", " \"method\": \"boxcox\",\n", " \"lambda\": 0.5379697633151592,\n", " \"lambda_ci\": [\n", " -0.14218010515761142,\n", " 1.2255339555066445\n", " ]\n", " },\n", " \"n_affected\": 80,\n", " \"digest\": \"d64b236be9447d2bfa7672f3f56b9e12fda50bf93daad8f5406cfe0948535125\"\n", "}\n" ] } ], "source": [ "transform_step = next(s for s in d[\"history\"] if s[\"operation\"] == \"transform\")\n", "print(json.dumps(transform_step, indent=2))" ] }, { "cell_type": "markdown", "id": "051e0938", "metadata": {}, "source": [ "The assumption checks ride along too. Here is the normality check that justifies\n", "the normal-method capability:" ] }, { "cell_type": "code", "execution_count": 8, "id": "a7df9ed3", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.667244Z", "iopub.status.busy": "2026-06-26T13:09:08.667194Z", "iopub.status.idle": "2026-06-26T13:09:08.669013Z", "shell.execute_reply": "2026-06-26T13:09:08.668661Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"name\": \"normality\",\n", " \"test\": \"Anderson-Darling\",\n", " \"statistic\": 0.3432987956436193,\n", " \"p_value\": 0.4812437156608955,\n", " \"passed\": true,\n", " \"magnitude\": 0.15081585539601827,\n", " \"magnitude_label\": \"est. Cpk impact\",\n", " \"reliability\": \"ok\",\n", " \"n\": 80,\n", " \"recommendation\": null\n", "}\n" ] } ], "source": [ "print(json.dumps(d[\"assumptions\"][0], indent=2))" ] }, { "cell_type": "markdown", "id": "52ddb9fb", "metadata": {}, "source": [ "Write it to a file and you have a self-describing, archivable record. The\n", "`provenance_digest` stamped into the file equals the digest you reported, so the\n", "export and the reported number agree by construction." ] }, { "cell_type": "code", "execution_count": 9, "id": "172fbec5", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.669831Z", "iopub.status.busy": "2026-06-26T13:09:08.669782Z", "iopub.status.idle": "2026-06-26T13:09:08.671751Z", "shell.execute_reply": "2026-06-26T13:09:08.671443Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wrote result.json, 3494 bytes\n" ] } ], "source": [ "import pathlib\n", "payload = json.dumps(cap.to_dict(), indent=2)\n", "pathlib.Path(\"result.json\").write_text(payload)\n", "print(\"wrote result.json,\", len(payload), \"bytes\")" ] }, { "cell_type": "markdown", "id": "65bc1297", "metadata": {}, "source": [ "Frontends and report builders should consume `to_dict()` (or the flat\n", "`summary()`), never parse `report()` text. The JSON is the stable contract; the\n", "text report is for humans." ] }, { "cell_type": "markdown", "id": "3147a2bc", "metadata": {}, "source": [ "## 4. Verify later\n", "\n", "Months later, someone reopens the archived result (or recomputes it from the\n", "same inputs) and checks it against the digest you recorded:" ] }, { "cell_type": "code", "execution_count": 10, "id": "3f61db38", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.672678Z", "iopub.status.busy": "2026-06-26T13:09:08.672635Z", "iopub.status.idle": "2026-06-26T13:09:08.674344Z", "shell.execute_reply": "2026-06-26T13:09:08.674037Z" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cap.verify_provenance(digest)" ] }, { "cell_type": "markdown", "id": "26056159", "metadata": {}, "source": [ "`verify_provenance(expected)` recomputes the digest over the current history and\n", "compares it to the one you pass in. True means the recorded computation is\n", "intact." ] }, { "cell_type": "markdown", "id": "583de963", "metadata": {}, "source": [ "### Tamper-evidence, demonstrated honestly\n", "\n", "The chain is tamper-evident: changing the `operation`, `params`, or `n_affected`\n", "of any recorded step changes the head digest, so verification fails.\n", "\n", "The result and its history are frozen, so there is no in-place edit to make. To\n", "show this we construct an *altered copy* with `dataclasses.replace`. We are not\n", "mutating the original `cap`; we build a new object whose recorded transform step\n", "has its fitted lambda bumped by 1.0, then verify that copy against the original\n", "digest." ] }, { "cell_type": "code", "execution_count": 11, "id": "53fa8913", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.675329Z", "iopub.status.busy": "2026-06-26T13:09:08.675254Z", "iopub.status.idle": "2026-06-26T13:09:08.677619Z", "shell.execute_reply": "2026-06-26T13:09:08.677320Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original digest: 7cb845af09aa053b023f88fb972d8901ee1d6eaca6123919eca0b7ffd8279a07\n", "tampered digest: e5e9dc334f030262e8cb9fc43f80e16bb0e78bc2f4978baf64a38193b3feb0db\n", "verify tampered: False\n", "original intact: True\n" ] } ], "source": [ "hist = list(cap.history)\n", "for i, s in enumerate(hist):\n", " if s.operation == \"transform\":\n", " bad = dict(s.params)\n", " bad[\"lambda\"] = bad[\"lambda\"] + 1.0 # alter a recorded parameter\n", " hist[i] = dc.replace(s, params=bad)\n", "\n", "tampered = dc.replace(cap, history=tuple(hist)) # a new, altered copy\n", "\n", "print(\"original digest: \", digest)\n", "print(\"tampered digest: \", tampered.provenance_digest())\n", "print(\"verify tampered: \", tampered.verify_provenance(digest))\n", "print(\"original intact: \", cap.verify_provenance(digest))" ] }, { "cell_type": "markdown", "id": "f034f803", "metadata": {}, "source": [ "One altered parameter, in one step, three steps deep, and the head digest moves\n", "and verification returns `False`. The original `cap` is untouched and still\n", "verifies `True`: we built a new object rather than editing it, because the\n", "history is append-only by construction." ] }, { "cell_type": "markdown", "id": "0e84e244", "metadata": {}, "source": [ "## 5. Trace a number back to raw data\n", "\n", "`lineage()` is the audit trail. Each step gives you its `operation`, its\n", "`params`, its `n_affected`, and the running digest folded in up to and including\n", "that step:" ] }, { "cell_type": "code", "execution_count": 12, "id": "19d3e786", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.678475Z", "iopub.status.busy": "2026-06-26T13:09:08.678430Z", "iopub.status.idle": "2026-06-26T13:09:08.680220Z", "shell.execute_reply": "2026-06-26T13:09:08.679900Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "load | n_affected: 80 | digest: a69636b71f0bddc7 ...\n", "spec | n_affected: None | digest: 72c30c2486ecf8c3 ...\n", "transform | n_affected: 80 | digest: d64b236be9447d2b ...\n", "capability | n_affected: 80 | digest: 4aae3812003203cd ...\n", "assumption:normality | n_affected: None | digest: 7cb845af09aa053b ...\n" ] } ], "source": [ "for s in cap.lineage():\n", " print(s[\"operation\"], \"| n_affected:\", s[\"n_affected\"], \"| digest:\", s[\"digest\"][:16], \"...\")" ] }, { "cell_type": "markdown", "id": "ee17e79a", "metadata": {}, "source": [ "Read it bottom-up to walk the reported number back to the raw frame. Each step's\n", "`params` records exactly what it did:" ] }, { "cell_type": "code", "execution_count": 13, "id": "7fc02c64", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.680996Z", "iopub.status.busy": "2026-06-26T13:09:08.680934Z", "iopub.status.idle": "2026-06-26T13:09:08.682578Z", "shell.execute_reply": "2026-06-26T13:09:08.682314Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "load\n", " {'measure': 'cycles', 'roles': {}, 'units': None, 'subgroup_size': None, 'spec': {'lower': None, 'upper': None, 'target': None}}\n", "spec\n", " {'lower': 0.5, 'upper': 12.0, 'target': None}\n", "transform\n", " {'method': 'boxcox', 'lambda': 0.5379697633151592, 'lambda_ci': [-0.14218010515761142, 1.2255339555066445]}\n", "capability\n", " {'method': 'normal', 'sigma_used': 'overall', 'cp': 3.343918554734314, 'cpk': 0.7232996917103978, 'pp': 3.343918554734314, 'ppk': 0.7232996917103978, 'cpm': None}\n", "assumption:normality\n", " {'test': 'Anderson-Darling', 'passed': True, 'magnitude': 0.15081585539601827, 'reliability': 'ok', 'p_value': 0.4812437156608955, 'statistic': 0.3432987956436193}\n" ] } ], "source": [ "for s in cap.lineage():\n", " print(s[\"operation\"])\n", " print(\" \", s[\"params\"])" ] }, { "cell_type": "markdown", "id": "5a1c017c", "metadata": {}, "source": [ "So the reported capability was computed after a Box-Cox transform with the fitted\n", "lambda above, against spec limits [0.5, 12.0], on 80 loaded rows, and the\n", "normality check that justifies the normal-method capability is right there in the\n", "chain. No step is hidden, and each step's `digest` lets you confirm where in the\n", "chain a difference first appears.\n", "\n", "The running digest also lets you cross-check intermediate state. The `QCData`\n", "after the transform exposes the same provenance surface, and its digest equals\n", "the transform step's running digest in the result's lineage:" ] }, { "cell_type": "code", "execution_count": 14, "id": "b5f114e7", "metadata": { "execution": { "iopub.execute_input": "2026-06-26T13:09:08.683442Z", "iopub.status.busy": "2026-06-26T13:09:08.683401Z", "iopub.status.idle": "2026-06-26T13:09:08.693871Z", "shell.execute_reply": "2026-06-26T13:09:08.693548Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "QCData-after-transform digest: d64b236be9447d2bfa7672f3f56b9e12fda50bf93daad8f5406cfe0948535125\n", "transform step running digest: d64b236be9447d2bfa7672f3f56b9e12fda50bf93daad8f5406cfe0948535125\n", "equal: True\n" ] } ], "source": [ "qct = qc.transform(\"boxcox\")\n", "transform_running = next(s[\"digest\"] for s in cap.lineage() if s[\"operation\"] == \"transform\")\n", "\n", "print(\"QCData-after-transform digest:\", qct.provenance_digest())\n", "print(\"transform step running digest:\", transform_running)\n", "print(\"equal: \", qct.provenance_digest() == transform_running)" ] }, { "cell_type": "markdown", "id": "41c786f4", "metadata": {}, "source": [ "`lineage()`, `provenance_digest()`, and `verify_provenance()` exist on both\n", "`QCData` and every result object. The trail is continuous from the loaded frame\n", "through to the final number." ] }, { "cell_type": "markdown", "id": "ac137f36", "metadata": {}, "source": [ "## What passing and failing verify actually mean\n", "\n", "A passing `verify_provenance()` means the recorded result is intact: the\n", "archived analysis has not been edited since the digest was captured. A failing\n", "one means the history no longer matches, so something in the recorded chain\n", "changed.\n", "\n", "What it does not do, on its own: it does not stop an actor who controls the\n", "Python interpreter at runtime from recomputing the whole analysis over\n", "fabricated inputs and stamping a fresh, self-consistent digest. The digest is a\n", "content hash, not a cryptographic signature. It defends against accidental\n", "corruption and post-hoc tampering with a stored result, not against an adversary\n", "who controls the process that produces it.\n", "\n", "Closing that gap requires anchoring the head digest outside the process: signing\n", "it with a key the operator does not hold, or writing it to an append-only\n", "external log. That is out of scope for the core library and left to the\n", "deployment. The full scope statement is in\n", "[Provenance model](/reference/provenance/)." ] }, { "cell_type": "markdown", "id": "1c6b383d", "metadata": {}, "source": [ "## Next\n", "\n", "- [Provenance model](/reference/provenance/): the data model, the hash-chain\n", " algorithm, and the honest scope of the guarantee.\n", "- [Reference](/reference/): the formula, assumptions, and source standard behind\n", " every method, plus the full result surface." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }