# Changelog All notable changes to the `.cv` format and reference tooling are documented here. The project follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). ## [Unreleased] ### Added - **`@cvfile/mcp` 0.1.0, an MCP server for the format.** AI agents (Claude, Cursor, any MCP client) get five tools over local .cv files: `list_cvs`, `validate_cv`, `read_cv`, `search_cvs`, and `pack_cv`. Semantic search needs no database or index: every .cv file already carries its own vectors, so the server embeds only the query (hosted HF API when `HF_TOKEN` is set, local transformers otherwise), auto-detecting the corpus's embedding model. Run it with `npx -y @cvfile/mcp`. ## [0.3.2] (2026-07-02, Python only) ### Fixed - **`cbor2` is a core dependency of `cvfile`** instead of living in the `embed` extra. Reading `embeddings.cbor` is first class format functionality; without `cbor2` the RAG loaders' chunks mode silently yielded documents with no vectors (caught by the new integrations CI job on its first run). The `embed` extra now only adds `httpx` for the network embedding backend. The three integrations move to 0.3.2 with a `cvfile>=0.3.2` floor; the JS and Go packages are unaffected and stay at 0.3.1. ## [0.3.1] (2026-07-02) Launch hardening sweep across every component, driven by a full end-to-end audit. All SDKs, the CLI, and the three RAG integrations move to 0.3.1. ### Security - **Decompression bomb protection in `extract()` (JS + Python).** Payload inflation is now capped at 16 MiB by default (`maxPayloadBytes` / `max_payload_bytes`, overridable). The JS SDK streams through `pako.Inflate` and aborts the moment output crosses the cap instead of inflating fully and checking after; `validate()` keeps reporting `payload-too-large` as an issue. Python enforces the cap on the encoded stream size before decoding and on the decoded size after (pypdf has no incremental decode), raising a typed `PayloadTooLargeError`. - **RAG loaders validate untrusted files by default.** `langchain-cvfile`, `llama-index-readers-cvfile`, and `cvfile-haystack` gain `verify=True`: `validate()` runs before extraction (forbidden constructs, integrity digests, size caps) and raises with the issue codes on failure. `verify=False` restores the old behavior for trusted files. - **`cv extract` warns on files that fail validation** (one line on stderr, stdout untouched) and gains `--require-valid` to refuse extraction with exit 65. - README and integration docs now state explicitly that `extract()` skips the security scan and integrity digests: call `validate()` first on untrusted input. ### Fixed - **cvfile.org now demos the shipped SDK.** The site depended on registry `@cvfile/sdk` `^0.1.0`, so the live `/create/` tool packed `cv:version 0.1` files; it now uses the workspace SDK. The live sample `jane-doe.cv` was a stale 0.1 file with no embeddings; replaced with the current 1.0 fixture (BGE-M3 embeddings, integrity digests). - **The embed snippet works.** `cdn.cvfile.org` never existed; the snippet now points at `https://cvfile.org/embed/1/cv-embed.js`, backed by a new fully self-contained browser bundle of `` (all deps inlined, pdf.js worker loaded from a same-origin blob URL) published by the docs build. - **Install instructions match reality.** The Go install path is `github.com/cvfile/cv/sdks/go/cmd/cv` (the advertised `cvfile/cv-go` repo never existed); the unpublished `winget install` command is removed (WinGet planned); the detector's npm name is `@cvfile/cv-detector` (`cvfile-cv-detector` is PyPI only); the README no longer undersells the Go SDK as reader-only (Go `pack` shipped in 0.2.1). - **CJS consumers get correct types.** The exports maps of `@cvfile/sdk`, `@cvfile/embed`, and `@cvfile/server` now nest `types` per condition (`.d.cts` for `require`), with `typesVersions` fallbacks; arethetypeswrong reports no problems on any entrypoint. - **Published tarballs now contain the Apache-2.0 LICENSE** (the `files` arrays referenced a file that did not exist in any package directory). - **`cv version` can no longer lie.** The CLI version moved from a hand-bumped constant to ldflags-backed variables actually injected by GoReleaser; dev builds visibly report `-dev`. The Go generator string now stamps the SDK version (`cv-go/0.3.1`) instead of the spec version. - Content negotiation is demonstrated live: `https://cvfile.org/samples/jane-doe.cv` serves the Markdown payload to `Accept: text/markdown` and the PDF (as `application/pdf`) otherwise; `/sitemap.xml` and `/llms.txt` resolve; the site's JSON-LD `softwareVersion` is read from the SDK package at build time instead of a stale hardcode. - Stale pre-1.0 wording removed from validator messages (JS, Python, Go) and from the spec's non-goals, examples, and iframe-sandbox phrasing (spec 1.0 text, no normative changes). ### Added - **Official test-vector corpus** under `spec/test-vectors/valid/` (minimal, full-with-embeddings, multilingual, integrity-mismatch, oversized-payload, future-major, missing-XMP) with a manifest of expected outcomes, generated by `packages/sdk-js/tools/build-valid.ts`, consumed by the JS, Python, and Go suites, and runnable via `pnpm test:interop`. - **CI coverage for everything published**: the three RAG integrations and all three `cv-detector` variants now run in CI; every publish workflow runs tests before publishing; CodeQL (JS/TS, Python, Go, Actions) and Dependabot (npm, pip, gomod, actions) are enabled. - **Real linting**: a root ESLint flat config (typescript-eslint) wired into `pnpm lint` per package; `turbo run lint` previously executed zero tasks. - Contributor Covenant 2.1 code of conduct, issue templates, and a PR template. ### Changed - `@cvfile/embed`, `@cvfile/server`, and `@cvfile/viewer-web` are released against `@cvfile/sdk ^0.3.1` (their published 0.2.0 builds pinned sdk 0.2.0, which still stamped `cv:version 0.1`). - GoReleaser publishes releases directly instead of drafts; Go CI/release workflows build on Go 1.26, matching `go.mod`. ## [0.3.0] (2026-06-08) Promote the format to spec **1.0**. The SDKs and CLI now emit `cv:version 1.0`, matching the stable spec at `spec/cv-1.0.md` and the cvfile.org documentation. The 0.x and 1.x lines share the same field set, so 0.1 files remain readable without a warning; only a MAJOR of 2+ triggers the spec §8.3 forward-compatibility notice. ### Changed - **Emitted spec version is now `1.0`** across all three SDKs (`SpecVersion`, `CV_SPEC_VERSION`). `cv version` reports `spec 1.0`; every freshly packed file declares `cv:version 1.0`. Tooling package versions move to 0.3.0, distinct from the format version. - **Committed fixtures and demo samples regenerated at 1.0** (`jane-doe.cv`, `python-produced.cv`, viewer and middleware samples, unicode integration fixture). veraPDF PDF/A-3u conformance reverified on the regenerated output (148 rules, 1559 checks, 0 failures). ## [0.2.1] (2026-06-07) Coherent, working, veraPDF-gated pack and validate across all three SDKs. Every producer now emits files that pass veraPDF PDF/A-3u and round-trip byte-identically through every reader (full 3x3 matrix verified). ### Added - **Go `Pack()` is now implemented and lossless.** It writes a PDF increment (ISO 32000 §7.5.6) instead of a full rewrite, so objects living in the input object streams (the embedded font above all) survive verbatim. The CLI `cv pack` produces a single file that opens in any PDF viewer, passes veraPDF, and is readable by the JS and Python SDKs. Replaces the prior "writer not implemented" stub. - **Honest in-process PDF/A-3u structural check** in every SDK (`pdfa.ts`, `_pdfa.py`, `pdfa.go`): verifies embedded fonts, sRGB output intent, pdfaid markers, and trailer `/ID`, returning `failed` or `structural-pass`. Replaces the placeholder that returned a strict PASS on veraPDF-failing files. veraPDF remains the authoritative gate. ### Fixed - **Go filespec names readable by every reader.** `/F` is now a portable ASCII literal and `/UF` a UTF-16BE hex string (ISO 32000 §7.11.2), so the payload name resolves in pd-lib and elsewhere (previously a UTF-16 literal that read as mojibake, breaking `extractMarkdown` on Go-packed files). - **JS reader decodes MIME `/Subtype` from byte content**, so a lowercase name hex escape (`text#2fmarkdown`, as pdfcpu emits) resolves the same as uppercase, keeping the reader conformant to the PDF spec rather than to one library's quirk. - **Python pack adds the sRGB GTS_PDFA1 output intent** (byte-identical profile shared with the Go and JS SDKs), so device-dependent colour in the input PDF stays conformant under veraPDF. - **veraPDF runner** no longer crashes on the empty-array expansion under bash 3.2. ## [0.2.0] (2026-05-29) End-to-end audit sweep across every component. All fixes verified by exercising the real code paths (CLI round-trips, live server negotiation, cross-SDK fixtures), not only the test suites. ### Fixed - **Go `Pack()` no longer silently corrupts.** It previously returned no error while pdfcpu dropped every embedded payload, emitting a file that passed `IsCvFile` but had zero payloads (and panicked on minimal PDFs). The Go writer is deferred to a later release, so `Pack()` now returns an explicit "writer not implemented" error before any mutation. - **Validator inline-action bypass (JS + Go).** `scanForbiddenConstructs` only walked indirect objects, so an inline `/OpenAction` JavaScript action (or `/AA`, annotation `/A`, AcroForm action) passed validation. Both now recursively walk the catalog/trailer object graph, matching the Python implementation. - **Server served HTML to browsers.** A normal browser request (`*/*` or `text/html` with a wildcard) now returns the visual PDF as documented; markdown is served only as an explicit top preference. Also: `q=0` is honored, `Content-Disposition` filenames are sanitized (no header injection), `defaultFormat` is a true fallback rather than a forced format, Hono reaches header parity with the other adapters, and `ETag`/`Last-Modified`/`304` are supported. - **Chunk offsets are UTF-8 bytes (JS + Python).** The chunkers emitted UTF-16/code-point offsets, disagreeing with the spec (§5.1) and across SDKs on any non-ASCII résumé. Both now emit byte offsets; the RAG integrations slice on bytes. - **Embedded-file `/Params` now carries the spec-mandated `/CheckSum`** (JS + Python), with valid PDF date zones. - **Python `validate()` no longer crashes** on a malformed encrypted file (broadened parse handling; encryption detected via the parsed reader). Python `pack()` now emits the `cv:embeddings` XMP summary it previously omitted. ### Changed - **RAG integrations now load embeddings.** `langchain-cvfile`, `llama-index-readers-cvfile`, and `cvfile-haystack` previously dropped `embeddings.cbor`; they now expose a chunks mode that attaches per-chunk vectors, delegating to a single SDK helper. - **viewer-web**: portable pdf.js worker (was a Vite-only `?url` import that broke other bundlers), real lazy pdf.js loading, crawler-facing clean text projected into the light DOM, language-aware payload selection, hardened `src` fetch. - **JS SDK**: portable-filename validation on pack and read, CBOR decode-path validation parity, `/DecodeParms` predictor rejection, `newer-format-version` warning (spec §8.3). - **docs**: accessible file pickers, lazily loaded and sanitized `marked` on `/create`. - `cv-detector` (Go/Python/TS, 0.1.1) recognizes the RDF attribute form of `cv:version`. - Spec §6.3 corrected: `cv:alternates`/`integrity`/`embeddings` are XMP `Text` holding a JSON-encoded array (as all SDKs implement), not `rdf:Bag` of struct. ## [Unreleased] ### Added - Initial monorepo scaffold (pnpm + Turborepo). - Spec draft `cv-0.1.md`. - `@cvfile/sdk` with `pack`, `extract`, `extractMarkdown`, `extractHtml`, `extractEmbeddings`, `extractEmbeddingsParsed`, `inspect`, `validate`, `isCvFile`, `encodeEmbeddings`, `decodeEmbeddings`. - `@cvfile/viewer-web` with `` Lit web component (PDF/MD/HTML tabs, lazy PDF.js worker) and drag-drop demo. - `@cvfile/server` with vanilla Node http handler + Express, Fastify, Hono adapters; HTTP `Link` header content negotiation. - Embeddings CBOR data layer per spec §5; round-trip verified. - PDF/A-3u conformance support: SDK now emits PDF/A Identification XMP, cv: extension schema declaration, sRGB ICC profile + OutputIntent, trailer `/ID`. veraPDF reports `PASS /3u` for the demo `.cv` when input PDF has embedded fonts. - veraPDF Docker runner at `tools/verapdf-runner/`. - **Python SDK `cvfile` 0.1.0** (sdks/python): pack, extract, extract_markdown, extract_html, inspect, validate, is_cv_file. Built on pypdf. 12 tests passing. - **Go SDK `cv-go` 0.1.0** (sdks/go): reader-path complete (Extract, ExtractMarkdown, ExtractHTML, ExtractEmbeddings, Inspect, Validate, IsCvFile). Pack present but writer integration with pdfcpu's page tree is brittle on minimal inputs and is deferred to v0.2. Built on pdfcpu. - **`cv` CLI binary** (sdks/go/cmd/cv): `cv inspect / extract / validate` operational across both JS- and Python-produced fixtures. - **Three-language interop verified**: JS, Python, and Go all read each other's `.cv` files byte-identical; integrity hashes verify across SDKs. Both JS- and Python-produced files pass veraPDF PDF/A-3u. - **Validator security hardening (spec §3.4 + §7.3)**: each SDK now rejects PDF JavaScript actions (`/JS`, `/JavaScript`, names-tree entries), `/Launch` and `/ImportData` actions, non-`mailto:` `/SubmitForm` targets, `/Encrypt` trailer entries, and external `/Filespec` references. Each construct surfaces a stable, documented error code. Decompressed payloads above the configurable cap (default 16 MiB, spec §7.3) are rejected with `payload-too-large`. - **Shared malicious-corpus fixtures** under `spec/test-vectors/malicious/`, generated by `packages/sdk-js/tools/build-malicious.ts` from the canonical valid file. JS, Python, and Go security suites all consume the same 7 fixtures + manifest. - **`@cvfile/embed` 0.1.0** (packages/embed-js): markdown chunker (section / paragraph / document), pluggable embedding backends, `embed()` builds a spec-§5 `EmbeddingsPayload` ready for `pack({ embeddings })`, `searchSemantic()` runs cosine/dot/euclidean search over the chunks. Two backends ship: `createTransformersBackend` (transformers.js, downloads model from HF Hub on first run, runs locally) and `createHuggingFaceBackend` (HF Inference API, requires `HF_TOKEN`). Default model is `BAAI/bge-m3` (1024-dim, multilingual, MIT). Live BGE-M3 round-trip verified end-to-end via the HF Inference API: chunk → embed → CBOR round-trip → cosine search ranks the semantically-correct section first in both directions. 6 tests passing (5 chunker + 1 live BGE-M3 with directional semantic-search assertions). - **`cv search` CLI subcommand** (sdks/go/cmd/cv): `cv search file.cv "query" [--k 5] [--model BAAI/bge-m3]` reads embeddings.cbor from a `.cv`, embeds the query through HF Inference in the same model space, prints ranked chunks with previews. Built on new Go SDK pieces: `DecodeEmbeddings` (CBOR via fxamacker/cbor), `SearchSemantic` (cosine/dot/euclidean), `NewHuggingFaceClient` (Bearer-token HTTP). Skip-when-no-token live test verifies directional ranking against a `jane-doe-with-bge-m3.cv` fixture. - **`cvfile[embed]` Python extra** (sdks/python): full feature parity with `@cvfile/embed`. `cvfile.embed.chunk_markdown` / `embed` / `search_semantic` / `encode_embeddings` / `decode_embeddings` / `HuggingFaceBackend`. Default model BAAI/bge-m3. Cross-language CBOR interop verified: Python decodes the JS-produced `embeddings.cbor` byte-identical (handles cbor-x's RFC 8746 Tag 64 typed-array wrapping). 6 tests passing including live BGE-M3 directional search. - **`cvfile.server` 0.1.0** (sdks/python): Python HTTP middleware with the same conneg algorithm as `@cvfile/server`. `cvfile.server.asgi.build_cv_asgi_app` (FastAPI/Starlette/native ASGI), `cvfile.server.wsgi.build_cv_wsgi_app` (Flask/Django/plain WSGI), framework-agnostic `serve_cv_bytes` core. Filesystem `root` and async/sync `loader` modes, path-traversal protection, Link/Vary/ETag headers per spec. 28 tests passing. - **`cv-go/middleware` 0.1.0** (sdks/go/middleware): Go `net/http` handler with the same conneg algorithm as JS and Python. `middleware.Handler(Options{Root|Loader})`, parses Accept/Accept-Language with q-values, advertises Link alternates, sets Content-Disposition, supports HEAD, blocks path traversal. 14 tests passing — completes the conneg triple across all three languages. - **AI tooling integrations** (sdks/python/src/cvfile/integrations): `CvFileLoader` (LangChain) and `CvFileReader` (LlamaIndex). `mode="document"` yields one Document per file; `mode="chunks"` emits one Document per embedded section with the pre-computed BGE-M3 vector attached, ready for direct ingest into a vector DB without re-embedding. - **Viewer polish** (packages/viewer-web): full ARIA tablist with keyboard nav (Arrow/Home/End), `theme="auto|light|dark"` with `prefers-color-scheme` honour, animated loading skeleton, error state with Retry button + ``, mobile layout under 480 px, `prefers-reduced-motion` support, `::part()` exposure for `toolbar`, `tab`, `tab-active`, `stage`, `pdf-canvas`, `md`, `html`, `pager`, `meta`, `skeleton`. - **Spec v1.0 frozen** (spec/cv-1.0.md): promoted from pre-stable to stable, no normative changes from 0.1. IANA registration template ready to email at `spec/iana-registration-application-vnd-cv+pdf.txt`. - **Release tooling** (sdks/go/.goreleaser.yml + tools/release-binaries/): GoReleaser config for 6-target cross-compile + SBOMs + checksums + Homebrew tap formula + Scoop manifest + .deb/.rpm/.apk with file-association payloads baked in. WinGet manifest templates + per-release runbook in tools/release-binaries/README.md. - **File-association payloads** (tools/installer-payloads/): macOS `Info.plist.snippet` declaring UTI `org.cvfile.cv` conforming to `com.adobe.pdf`; Windows `.reg` template registering `.cv → CVFile.Document`; Linux `cvfile.desktop` + shared-mime-info XML for `application/vnd.cv+pdf`. - **`cvfile.org` docs site** (docs/): Astro 5 single-page site with `/`, `/spec/`, `/install/`, `/view/` (live `` drag-drop demo), `/ecosystem/`. Builds 5 static pages, dark/light theme, ready to deploy to any static host. ### Notes - For `cv-strict` conformance, the input PDF must have all fonts embedded. Standard PDF builtin fonts (Helvetica, Times, Courier) are referenced by name, not embedded, and will fail PDF/A-3u rule 6.2.11.4.1. Most modern PDF generators (LaTeX, browser print-to-PDF, Word export) embed fonts by default.