---
title: "Memory Pipeline"
description: "LLM-based memory extraction and processing pipeline."
order: 4
section: "Core Concepts"
---

Memory Pipeline v2
==================

Overview and Philosophy
---

Pipeline v2 exists because the original [[memory]] system was purely reactive:
callers wrote whatever they wanted, the database accepted it, and recall
quality depended entirely on how well the caller chose what to store. That
model worked for bootstrapping but doesn't scale — memories accumulate
noise, contradict each other, and fragment across overlapping phrasings of
the same fact.

The pipeline introduces a background extraction layer. When a memory
arrives, it is persisted immediately (raw-first safety), and a job is
enqueued to analyze it asynchronously. The job runs extraction and
decision passes using a local LLM, then optionally writes derived facts
back into the memory store. This means the caller's raw content is never
lost — it is always durably committed before any LLM call runs — and
derived facts are layered on top rather than replacing the original.

This is substrate work. The pipeline's job is to turn raw interaction
data into cleaner, more structured material the rest of the system can
use for retrieval, repair, and eventually learned context selection.

The central constraint governing every design decision here is: **no LLM
calls inside write-locked transactions.** SQLite write locks are exclusive,
and a blocking HTTP call to Ollama inside one would stall the entire [[daemon]].
The pipeline enforces a strict two-phase discipline: fetch and embed outside
the lock, then commit atomically inside `withWriteTx`. Any violation of this
rule introduces unbounded latency into every other writer.


Pipeline Modes
---

Three operational modes are composed from five boolean flags.

**Shadow mode** is active when `enabled` is true but `shadowMode` is also
true, or when `mutationsFrozen` is true. In this mode the pipeline runs the
full extraction and decision sequence, records all proposals to
`memory_history` for audit, but makes no writes to the memories table.
Shadow mode is useful for validating extraction quality without affecting
production data.

**Controlled-write mode** is active when `enabled` is true, `shadowMode` is
false, and `mutationsFrozen` is false. In this mode, ADD and NONE decisions
are applied. ADD creates new memory rows and embeddings; NONE is recorded
for audit only. UPDATE and DELETE proposals are blocked unless
`autonomous.allowUpdateDelete` is true.

**Full mode** is controlled-write mode with `allowUpdateDelete` set to true.
In this mode UPDATE proposals modify the referenced memory through the mutation
API path, and DELETE proposals soft-delete the referenced memory through the
forget path. The previous target state is archived to the cold tier first, and
pinned memories are skipped rather than deleted.

The five config flags in detail:

- `enabled` — Master switch. When false, no extraction jobs are processed.
- `shadowMode` — Run extraction and decisions without writing any facts.
- `allowUpdateDelete` — Permit UPDATE/DELETE decisions to mutate existing
  memories through guarded modify/forget paths.
- `mutationsFrozen` — Emergency brake. Disables all writes even if
  `shadowMode` is false.
- `autonomous.frozen` — Disables the maintenance worker's scheduled interval
  even if `autonomous.enabled` is true.


Extraction Stage
---

Extraction is the first LLM pass. Its job is to decompose a raw memory
string into a list of discrete, reusable facts and a list of entity
relationship triples.

The extraction prompt instructs the model to return a JSON object with two
arrays. Each fact carries a `content` string, a `type` discriminant
(`fact`, `preference`, `decision`, `procedural`, or `semantic`), and a
floating-point `confidence` in [0, 1]. Each entity triple carries `source`,
`relationship`, `target`, and `confidence`. The prompt includes worked
examples and explicitly tells the model to skip ephemeral details and return
only the JSON object — no surrounding text.

The model's output is post-processed before validation. `<think>` blocks
emitted by chain-of-thought models like qwen3 are stripped first. Then
Markdown code fences are removed if present. The resulting string is
parsed as JSON.

Validation is strict and partial-failure safe. Facts are capped at 20 per
input. Any fact shorter than 10 characters is rejected. Any fact longer
than 2000 characters is truncated. An unknown type string is coerced to
`fact` with a warning recorded. Entities are capped at 50 per input; each
must have non-empty `source` and `target` strings and a non-empty
`relationship`. Input longer than 12,000 characters is truncated before the
prompt is built.

Validation failures produce warnings that are accumulated in the
`ExtractionResult` and surfaced in the job's result payload. They never
throw — partial results are always returned.


Decision Stage
---

The decision stage evaluates each extracted fact independently against the
existing memory store. For each fact, the engine retrieves the top-5
candidate memories via hybrid search, then asks the LLM which of four
actions to take: ADD, UPDATE, DELETE, or NONE.

This stage is intentionally conservative. It is better understood as a
proposal and curation layer than as autonomous semantic rewriting. Its
output improves memory quality and auditability; it does not eliminate
the need for downstream relevance learning.

Candidate retrieval uses the same BM25 + vector hybrid search that powers
recall. The BM25 leg queries `memories_fts` with the fact's content as the
full-text query; scores are normalized to [0, 1] via `1 / (1 + |score|)`.
The vector leg embeds the fact content and calls `vectorSearch` against the
embeddings table. Results from both legs are merged by ID, then combined
with a weighted sum: `alpha × vector + (1 - alpha) × bm25` when both legs
returned a score, or the single available score otherwise. Candidates below
`min_score` are dropped. The top 5 are fetched from the memories table.

When no candidates are found, the engine immediately proposes ADD without an
LLM call, using the fact's own confidence as the proposal confidence and a
fixed reason string.

When candidates exist, the decision prompt presents the fact and a numbered
list of candidates with their IDs, types, and content. The model is asked
to return a JSON object with `action`, `targetId` (required for UPDATE and
DELETE), `confidence`, and `reason`. The response is parsed with the same
`<think>`-strip and fence-removal logic as extraction.

Validation on the decision output ensures that UPDATE and DELETE decisions
reference an ID that actually appears in the candidate set. Proposals with
missing or hallucinated IDs are dropped with a warning. An empty `reason`
string is also rejected.

The function is named `runShadowDecisions` regardless of mode — "shadow"
here means the function itself makes no writes. Whether the proposals are
applied or merely recorded is a concern of the worker that calls this
function.


Controlled Writes
---

When controlled-write mode is active, the worker applies ADD decisions
inside a single `withWriteTx` call after all LLM and embedding work has
completed. The write path is implemented in `applyPhaseCWrites`.

Before entering the transaction, the worker pre-fetches embeddings for all
ADD proposals in parallel. Each fact content is passed through
`normalizeAndHashContent` to compute a `contentHash`, and the storage
content (original casing) and hash are used as the key for caching the
vector. The embedding fetch is intentionally outside the transaction lock.

Inside the transaction, each ADD proposal passes through a sequence of
safety gates. First, the fact's confidence is compared to
`minFactConfidenceForWrite` (default 0.7); facts below this threshold are
skipped with reason `low_fact_confidence`. Second, the normalized content
is checked for zero length; empty facts are skipped with reason
`empty_fact_content`. Third, the `content_hash` is checked against the
memories table to detect exact duplicates — both at the pre-insert check
and defensively on UNIQUE constraint collision. Duplicates are recorded with
the existing memory's ID and counted as `deduped`.

For facts that clear all gates, `txIngestEnvelope` creates the memory row
in a single insert, with `who` set to `pipeline-v2`, `why` to
`extracted-fact`, and the pipeline's extraction model name in
`extractionModel`. If a pre-fetched embedding vector is available for this
content hash, it is upserted into the embeddings table in the same
transaction.

Audit records are written for every proposal in every outcome: ADD
(created), ADD (deduped), ADD (skipped), NONE (recorded), and destructive
(blocked). Each record lands in `memory_history` with enough metadata to
reconstruct the decision context: proposal action, fact content, confidence,
the source memory ID, the extraction model, and fact and entity counts.

The contradiction detector runs on UPDATE and DELETE proposals before they
are blocked. It tokenizes both the fact content and the target memory's
content, checks for lexical overlap of at least two tokens, and then looks
for either a negation-polarity difference (one has a negation token, the
other doesn't) or an antonym pair conflict (enabled/disabled, allow/deny,
etc.). Proposals that trigger the detector are flagged `reviewNeeded: true`
in their audit record.


Content Normalization
---

All content passes through `normalizeAndHashContent` before storage or
hashing. The function is deterministic and produces three derived values.

`storageContent` is the text after trimming and whitespace collapsing
(`/\s+/g → " "`). This is what gets written to the database. Original
casing is preserved.

`normalizedContent` takes `storageContent`, lowercases it, and strips
trailing punctuation (`[.,!?;:]+$`). This is used for FTS indexing and as
the hash basis when non-empty.

`contentHash` is a SHA-256 digest of the hash basis (normalized content if
non-empty, otherwise lowercased storage content). This 64-character hex
string is the deduplication key. Upserts on the embeddings table use it as
the unique key, and memory inserts check it to avoid exact-content
duplicates.


Inline Entity Linker
---

Before any async pipeline job runs, the inline entity linker
(`platform/daemon/src/inline-entity-linker.ts`) performs a fast,
synchronous mention-linking pass at memory write time. This is a
mechanical helper, not a semantic author.

The linker runs without an LLM call. It scans the memory's content text
for candidate proper nouns and links only entities that already exist
for the same `agent_id`. It writes `memory_entity_mentions` rows so a
new memory can be discovered from known entity pages immediately, but it
does not create entities, aspects, attributes, or dependencies.

Structured graph writes come from `POST /api/memory/remember` with a
`structured` payload, explicit user/agent actions, or reviewed
normalization passes. This keeps the default background path cheap,
predictable, and hard to poison: incidental capitalization can attach a
memory to an existing known entity, but it cannot invent graph structure.

Because the linker runs inside the write transaction, it must stay fast
and deterministic. There are no network calls, no LLM inference, and no
blocking I/O, only candidate matching and SQLite writes against existing
graph rows.


Structural Classification
---

When explicitly enabled, after extraction writes facts to the database, the
structural classification worker (`structural-classify.ts`) runs a second LLM
pass to assign each extracted fact to its entity's aspect hierarchy. Jobs are
enqueued as `structural_classify` entries in `memory_jobs` and processed by a
separate polling worker that batches by `entity_id`, all facts for the same
entity in one LLM call.

The prompt presents the entity name, type, existing aspects, and suggested
aspect names (from `ASPECT_SUGGESTIONS` keyed by entity type). The LLM
returns a JSON array of `{i, aspect, kind, new}` objects. Each fact is
assigned to a named aspect and classified as either `attribute` or
`constraint`. Aspects are upserted into `entity_aspects` on
`(entity_id, canonical_name)` conflict. The `entity_attributes` row written
during extraction has its `aspect_id` and `kind` filled in.

When an entity's type was not determinable during extraction (stored as
`"extracted"`), the classify prompt also asks the LLM to infer the type.
If a valid canonical type is returned (`person`, `project`, `system`,
`tool`, `concept`, `skill`, `task`, or `unknown`), the `entities` row is
updated in the same transaction.

The worker configuration lives under `structural` in the pipeline config:
`enabled` (default `false`), `pollIntervalMs` (how often to check for pending
jobs), and `classifyBatchSize` (max facts per entity per LLM call). The default
pipeline does not use a background LLM to author graph structure; structured
remember is the normal semantic write path.

For details on the knowledge graph persistence stage, see
[KNOWLEDGE-GRAPH.md](./KNOWLEDGE-GRAPH.md).


Knowledge Graph
---

When `graph.enabled` is true, graph reads, traversal, and recall boosting are
available. Background extraction only persists extracted entity triples when
`graph.extractionWritesEnabled` is also true. That write gate defaults to
`true` so new installs populate the graph from extraction. Set it to `false`
to keep graph navigation on without letting the async extractor author semantic
graph structure.

The daemon logs a startup warning when graph reads are enabled while extraction
writes are disabled. `/api/diagnostics` also reports
`graph.extractionWritesEnabled` and degrades graph health once enough active
memories exist but the graph still has no entities.

If extraction graph writes are explicitly enabled, they happen in a
**separate** transaction immediately after the main write transaction commits.
Graph persistence failure is non-fatal: it logs a warning but never reverts the
fact extraction results.

Entities are stored in the `entities` table with `name` (original casing),
`canonical_name` (lowercase, whitespace-normalized), `entity_type`, and
`mentions` (an integer count). New entities are inserted; existing entities
(matched by `canonical_name`) have their `mentions` counter incremented.
UNIQUE constraint collisions on the `name` column are handled gracefully by
falling back to the existing row and incrementing mentions there.

Relations are stored in the `relations` table linking two entity rows by
`source_entity_id`, `target_entity_id`, and `relation_type`. The `strength`
field is fixed at 1.0 for all pipeline-extracted relations. When a relation
already exists (same source, target, and type), `mentions` is incremented
and `confidence` is updated via a running average:
`(old_avg × n + new_confidence) / (n + 1)`.

Every source and target entity is linked back to the originating memory row
via `memory_entity_mentions`. The link stores `mention_text` (the raw
string before canonicalization) and `confidence`. Inserts use
`INSERT OR IGNORE` so re-processing the same memory is idempotent.


Aspect Feedback
---

After recall, `aspect-feedback.ts` feeds behavioral signals back to the
knowledge graph by measuring FTS overlap between retrieved content and
entity aspects. The function `applyFtsOverlapFeedback` is called at
session end with the session key and agent ID.

The feedback loop operates as follows. Memories that received at least one
FTS hit during the session (tracked in `session_memories.fts_hit_count`) are
looked up. For each confirmed memory, the `entity_attributes` table is
queried to find its parent `aspect_id`. Confirmation counts are summed per
aspect, and each aspect's `weight` column is incremented by
`delta × confirmations`, clamped to `[minWeight, maxWeight]`. This updates
which aspects were structurally "correct" for the session — aspects whose
memories were actively searched for gain weight, aspects whose memories were
ignored do not.

A separate `decayAspectWeights` function handles time-based decay. Aspects
that have not been updated in more than `staleDays` days have their weight
reduced by `decayRate`, floored at `minWeight`. Session decay is governed
by a counter so it runs every N sessions rather than on every call.

Telemetry is accumulated in an in-process snapshot (`getFeedbackTelemetry`)
and exposed on the pipeline status endpoint: `feedbackAspectsUpdated`,
`feedbackFtsConfirmations`, `feedbackDecayedAspects`, and
`feedbackPropagatedAttributes`.


Graph-Augmented Search
---

At query time, when `graph.enabled` is true and the caller requests a graph
boost, `getGraphBoostIds` is called synchronously against the read database.
The function returns a set of memory IDs that should receive a score boost
in the final recall ranking.

The lookup proceeds in three steps. First, query tokens (2+ character
alphanumeric runs, lowercased) are matched against `canonical_name LIKE ?`
for each token, with results ordered by `mentions` descending and capped at
20 entity hits. Second, the matched entity IDs are expanded one hop through
the `relations` table in both directions (source and target), collecting up
to 50 additional neighbor entity IDs. Third, the expanded entity ID set is
joined through `memory_entity_mentions` to collect up to 200 distinct
non-deleted memory IDs.

The entire function is deadline-bounded. A `Date.now()` cutoff is checked
after each step; if the deadline is exceeded, the function returns whatever
it has accumulated so far with `timedOut: true`. On any exception, it
returns an empty result. There is no degradation in recall correctness —
graph boosting is always additive.

The boost weight (default 0.15) is applied by the search layer on top of
the hybrid BM25 + vector score. IDs in the graph-linked set receive a score
increment of `graphBoostWeight`.


Worker Model
---

The extraction pipeline runs as a polling worker loop. A single
`startWorker` call starts a `setTimeout`-chain tick loop that leases one
job per tick from the `memory_jobs` table, processes it, and reschedules
itself. The use of `setTimeout` chains rather than `setInterval` allows
dynamic delay adjustment via exponential backoff on failure.

Job leasing is atomic. The tick calls `accessor.withWriteTx` to both select
and update the job row in one transaction: `SELECT ... LIMIT 1` on pending
extract jobs ordered by `created_at`, immediately followed by an `UPDATE`
setting `status = 'leased'`, `leased_at`, and incrementing `attempts`. This
ensures no two workers can lease the same job even if multiple processes
were running.

On failure, a job's `attempts` counter is already incremented (happens
during lease). If `attempts >= max_attempts` (default 3), the job is
moved to status `dead`; otherwise it returns to `pending` for retry on the
next tick. A dead job stays in the table for audit and cleanup purposes.

Job deduplication is enforced at enqueue time: `enqueueExtractionJob` checks
for any existing job for the same `memory_id` with status `pending` or
`leased` before inserting a new one.

A stale lease reaper runs on a fixed 60-second `setInterval`. Any job with
`status = 'leased'` and `leased_at` older than `leaseTimeoutMs` (default
300,000 ms / 5 minutes) is reset to `pending`. This handles worker crashes
that leave jobs leased indefinitely.

Backoff state tracks consecutive failures. On zero failures, the tick
interval is `workerPollMs` (default 2,000 ms). Each failure doubles the
delay (starting from 1,000 ms base) up to a 30,000 ms cap, with up to
500 ms of random jitter added.


Document Ingest
---

The document worker processes `document_ingest` jobs from the same
`memory_jobs` table. It runs as a fixed-interval polling loop separate from
the extraction worker, defaulting to 10,000 ms between ticks.

A document ingest job carries a `document_id` rather than a `memory_id`.
The referenced row in the `documents` table carries the source content and
type. Two source types are supported: `url` (content fetched via HTTP) and
anything else (content read from `raw_content`). URL fetch is bounded by
`documentMaxContentBytes` (default 10 MB). The URL fetcher accepts responses
with content types `text/html`, `text/*`, `application/json`, and
`application/xml`. For HTML, it extracts the page title and strips `<script>`
and `<style>` tags before passing text to the chunker. Non-matching content
types are rejected. The HTTP request timeout is 30 seconds, independent of
the byte limit. If the HTTP response provides a page title and the document
row has none, it is backfilled.

Processing advances through explicit status transitions recorded in the
`documents` table: `extracting` → `chunking` → `embedding` → `indexing`
→ `done`. These transitions serve as progress indicators visible via the
API.

Chunking splits the extracted content into overlapping windows.
`documentChunkSize` (default 2,000 chars) sets the window size;
`documentChunkOverlap` (default 200 chars) sets how many characters each
window shares with the previous one. A document shorter than one chunk is
not split.

Each chunk is independently embedded (outside any transaction), normalized
and hashed, deduplication-checked against existing memories already linked
to this document via `document_memories`, and then written as a memory row
in its own transaction. Embedding calls and write transactions alternate for
each chunk rather than batching. The chunk memory row has `type =
'document_chunk'`, `importance = 0.3`, and is tagged with the document
title if available.

The chunk-to-document relationship is recorded in `document_memories` with
the chunk index. This table allows the document's chunks to be enumerated
or deleted as a unit.

The document worker uses the same `workerMaxRetries` limit as the
extraction worker. On exhaustion, the document row status is set to
`failed` with the error string recorded.


Retention Worker
---

The retention worker purges expired data on a periodic schedule (default
6-hour interval). It runs independently of the extraction pipeline and is
started whenever the pipeline is active or as a standalone service for
users who don't run the full extraction pipeline.

Purges follow a strict ordering to maintain referential safety:

1. **Graph links** — `memory_entity_mentions` rows for memories that are
   soft-deleted and past the tombstone retention window are deleted. Entity
   mention counts are decremented; entities that reach zero mentions are
   orphaned and deleted along with their dangling relation rows.

2. **Embeddings** — Embedding rows for the same expired memories are
   deleted.

3. **Tombstones** — The memory rows themselves are hard-deleted. The
   SQLite `memories_ad` trigger handles FTS cleanup automatically.

4. **History** — `memory_history` rows older than the history retention
   window are purged.

5. **Completed jobs** — `memory_jobs` rows with `status = 'completed'`
   and `completed_at` older than the completed job retention window are
   deleted.

6. **Dead jobs** — `memory_jobs` rows with `status = 'dead'` and
   `failed_at` older than the dead job retention window are deleted.

Each step runs in its own short `withWriteTx` to avoid holding a write
lock across the full sweep. Each step is also batch-limited to 500 rows
per sweep to bound write latency. If more rows than the batch limit exist,
they will be caught in subsequent sweeps.

Default retention windows: tombstones 30 days, history 180 days, completed
jobs 14 days, dead jobs 30 days.


Maintenance Worker
---

The maintenance worker performs autonomous diagnostics and, optionally,
self-repair. It is governed by `autonomous.enabled` and `autonomous.frozen`.
If `autonomous.enabled` is false or `autonomous.frozen` is true, the interval
never starts, though the worker's `tick()` method remains callable for
on-demand inspection.

Each maintenance cycle runs three phases. First, `getDiagnostics` produces
a `DiagnosticsReport` that captures queue health (dead rate, stale lease
count), index health (FTS row count vs active memory count), storage health
(tombstone ratio and SQLite page size), and graph health. A composite score in
[0, 1] summarizes overall health, and when graph is enabled the composite
status propagates graph degradation when the graph has flatlined across many
active memories.

Second, `buildRecommendations` translates the report into a list of repair
actions:
- `requeueDeadJobs` when the dead job rate exceeds 1%.
- `releaseStaleLeases` when stale leases are detected.
- `checkFtsConsistency` when the FTS row count does not match active
  memories.
- `triggerRetentionSweep` when tombstones exceed 30% of total memories.

Third, if `maintenanceMode` is `observe`, the recommendations are logged and
the cycle returns. If `maintenanceMode` is `execute`, each recommendation
is executed through the corresponding repair action, subject to rate
limiting (cooldown and hourly budget per action type). After all repairs
run, diagnostics are re-evaluated and the health score delta is recorded.

The halt tracker prevents the maintenance worker from spinning on ineffective
repairs. Each repair action tracks consecutive non-improving runs. After 3
consecutive runs that do not improve the health score, the action is halted
for the lifetime of the worker. The tracker resets when a cycle produces no
recommendations (i.e., health is good).


Provider Abstraction
---

All LLM calls go through an `LlmProvider` interface with two methods:
`generate(prompt, opts?)` returning a `Promise<string>`, and `available()`
returning a `Promise<boolean>`.

Two implementations are shipped:

**LlamaCppProvider** calls the llama.cpp server via its OpenAI-compatible
`POST /v1/chat/completions` endpoint. The default base URL is
`http://localhost:8080` and the default model is `qwen3:4b`. No
authentication is required. The `available` check uses a 3-second timeout
against `GET /v1/models`.

**OllamaProvider** calls the Ollama HTTP API at `POST /api/generate` with
`stream: false`. The default base URL is `http://localhost:11434` and the
default model is `qwen3:4b` (deprecated — see below). `nemotron-3-nano:4b` is the
preferred local Ollama model going forward; Nemotron's superior reasoning produces
better extraction results and `qwen3:4b` will be removed in a future update. Each `generate` call sets an `AbortController`
timeout (default 45,000 ms) and throws a descriptive error on abort. HTTP
errors surface the status code and the first 200 characters of the response
body. The `available` check uses a 3-second timeout against `GET /api/tags`.
For live prompt harness commands, see
`platform/daemon/src/pipeline/README.md`.

**ClaudeCodeProvider** invokes the Claude Code CLI as a subprocess:
`claude -p <prompt> --model <model> --no-session-persistence --output-format text`.
The default model is `haiku`. Timeout is 60,000 ms. This provider is
available as a fallback when no local LLM server is running but the
Claude Code CLI is present on PATH.

The interface is intentionally minimal — no streaming, no chat history, no
tool use. Future providers can be added by implementing `LlmProvider` and
passing the instance to `startWorker`.


Predictor Schema Placeholders
---

The schema still carries predictor-oriented columns and historical comparison
tables, including nullable `session_memories.predictor_score`,
`session_memories.predictor_rank`, and `predictor_comparisons`. These fields
are retained so future scorer work can attach training and comparison data
without another migration churn.

The current daemon does not ship or start a predictive scorer sidecar.
`session-start` assembles candidates with hybrid search, graph traversal, and
baseline score ordering; predictor score and rank slots remain `null` unless a
future scorer path writes them. Dashboard predictor helpers currently return
empty slices, and entity health reads `predictor_comparisons` only when rows
exist.


Optional Reranking
---

After baseline hybrid search returns a scored candidate list, a reranking pass
can reorder the top-N entries. Reranking is enabled by default, but the default
provider is the pass-through `noopReranker` unless a concrete reranker path is
selected.

The `rerank` function accepts a query string, a mutable candidate list, a
`RerankProvider` callback, and a `RerankConfig`. It slices the list at
`topN` (default 20), passes the head to the provider, and appends the
untouched tail to the result. If the provider call exceeds `timeoutMs`
(default 2,000 ms) or throws, the original ordering is returned unchanged
via a `Promise.race` against a timeout promise. There is no secondary
attempt.

The `noopReranker` pass-through is provided for testing. Custom providers
implement the `RerankProvider` signature
`(query, candidates, cfg) => Promise<RerankCandidate[]>` and can call any
scoring backend.

Set `reranker.useExtractionModel: true` to run reranking through the
active extraction provider/model instead of the embedding reranker.
When enabled, recall also prepends a short synthesized summary card
grounded in the top recalled memories.

### Embedding-Based Reranker

An embedding-based reranker implementation is provided in
`reranker-embedding.ts`. It re-scores candidates using full-content cosine
similarity against the query embedding vector. Cached embeddings from the
database are used when available, avoiding extra provider calls in most
cases.

The factory function `createEmbeddingReranker` takes a `DbAccessor` and
a pre-computed `queryVector` (Float32Array) and returns a `RerankProvider`.
For each candidate with a cached embedding, the score is blended:
`0.7 × original_score + 0.3 × cosine_similarity`. Candidates without a
cached embedding keep their original score. Results are sorted by blended
score descending. This reranker is fast (no LLM call), deterministic, and
catches cases where BM25 candidates were not vector-compared at all.


Semantic Contradiction Detection
---

The pipeline includes two layers of contradiction detection for UPDATE and
DELETE proposals.

**Syntactic detection** (in `worker.ts`) is the fast path. It tokenizes
both the fact content and the target memory's content, checks for lexical
overlap of at least two tokens, then looks for either a negation-polarity
difference (one has a negation token, the other doesn't) or an antonym
pair conflict (enabled/disabled, allow/deny, etc.).

**Semantic detection** (in `contradiction.ts`) is the slow path. It uses
an LLM to catch semantic contradictions like "uses PostgreSQL" vs
"migrated to MongoDB". It is only called for update proposals with lexical
overlap >= 3 tokens where syntactic detection returned false. The LLM is
prompted to return a JSON object with `contradicts` (boolean), `confidence`
(0–1), and `reasoning` (string).

Semantic contradiction detection is gated by `semanticContradictionEnabled`
(default `true`). When enabled, the LLM call uses a configurable timeout
controlled by `semanticContradictionTimeoutMs` (default 120 seconds, range
5s-300s). On timeout or parse failure, the result defaults to "no
contradiction" — the check is advisory and never blocks a proposal.

These same detection primitives are reused by the retroactive supersession
system (`supersession.ts`), which applies contradiction detection to sibling
attributes on the same entity/aspect rather than to UPDATE/DELETE proposals.
See the [retroactive supersession spec](./specs/planning/retroactive-supersession.md)
and [KNOWLEDGE-GRAPH.md](./KNOWLEDGE-GRAPH.md#retroactive-supersession) for
details.

```yaml
memory:
  pipelineV2:
    semanticContradictionEnabled: true
    semanticContradictionTimeoutMs: 120000  # ms, range 5000-300000
```


URL Fetcher
---

The document ingest pipeline fetches web content through `url-fetcher.ts`.
The fetcher provides timeout and size guards, and strips HTML to plain text
for downstream chunking and embedding.

`fetchUrlContent(url, opts?)` accepts a URL and optional `FetchOptions`
(`timeoutMs` default 30,000 ms, `maxBytes` default 10 MB). It performs
a pre-flight size check from the `Content-Length` header, then stream-reads
the response body with a running byte counter. If total bytes exceed
`maxBytes` during streaming, the fetch is aborted.

Supported content types: `text/html`, `text/*`, `application/json`,
`application/xml`. Binary and unsupported types are rejected with an
error. For HTML responses, `<script>` and `<style>` blocks are stripped
entirely, remaining tags are removed, common HTML entities are decoded,
and the page title is extracted from the first `<title>` tag. The result
includes `content`, `contentType`, optional `title`, and `byteLength`.


Embedding Tracker
---

The embedding tracker (`platform/daemon/src/embedding-tracker.ts`) is a
background polling loop that detects stale or missing embeddings and
refreshes them in small batches. It is separate from the extraction
pipeline and runs alongside it.

Each cycle:

1. **Provider health check** — calls the embedding provider's health
   endpoint (uses existing 30-second cache). If the provider is
   unavailable, the cycle is skipped and `skippedCycles` is incremented.

2. **Stale detection query** — a read-only query finds memories where:
   - No embedding row exists (missing)
   - The embedding's `content_hash` differs from the memory's (stale)
   - The memory's `embedding_model` differs from the configured model
     (model switch)
   Results are ordered by `updated_at DESC` and capped at `batchSize`.

3. **Sequential embedding fetch** — each stale row's content is embedded
   one at a time, outside any transaction. Failed fetches increment the
   `failed` counter without aborting the cycle.

4. **Batch write** — all successful embeddings are upserted in a single
   `withWriteTx` call. For each result: stale embeddings are deleted by
   source (except the new hash), the new embedding row is upserted on
   `content_hash` conflict, the `vec_embeddings` virtual table is synced,
   and `embedding_model` is updated on the memory row.

At daemon startup, the `vec_embeddings` virtual table is checked against the
configured embedding dimensions. If the table was created with stale
`FLOAT[N]` dimensions, the daemon logs schema drift, recreates the virtual
table with the configured size, and backfills stored embeddings that match that
dimension.

The tracker uses `setTimeout` chains for natural backpressure. It
exposes a `getStats()` method returning `{ running, processed, failed,
skippedCycles, lastCycleAt, queueDepth }`.

Configuration lives under `embeddingTracker` in the pipeline config:

| Field | Default | Range | Description |
|-------|---------|-------|-------------|
| `enabled` | `true` | — | Master switch |
| `pollMs` | `5000` | 1000–60000 ms | Polling interval between cycles |
| `batchSize` | `8` | 1–20 | Max embeddings refreshed per cycle |


Session Checkpoints
---

Session checkpoints (`platform/daemon/src/session-checkpoints.ts`) capture
periodic snapshots of session state for continuity recovery. They store
a digest of the session's current focus, prompt count, memory queries,
and recent remembers.

Checkpoints are triggered by five event types:

- `periodic` — fired on a timer or prompt-count interval
- `pre_compaction` — fired when the harness signals context compaction
- `session_end` — fired when a session closes
- `agent` — fired by agent-initiated events
- `explicit` — fired by manual API calls

Each checkpoint row stores `session_key`, `harness`, `project`,
`project_normalized`, `trigger`, `digest`, `prompt_count`,
`memory_queries` (JSON array), and `recent_remembers` (JSON array).
Secrets are redacted before storage using pattern-based scrubbing
(Bearer tokens, API keys, base64 credential blobs, env variable values).

A buffered flush queue (`queueCheckpointWrite`) debounces writes at
2,500 ms intervals. If two triggers fire within the flush window for
the same session, queries and remembers are merged (union with caps:
20 queries, 10 remembers) and prompt counts are summed.

Per-session caps are enforced: when checkpoint count exceeds
`maxCheckpointsPerSession`, the oldest rows are deleted.

Digest formatters produce structured markdown for each trigger type:

- `formatPeriodicDigest` — project, prompt count, duration, recent
  prompts, memory activity
- `formatPreCompactionDigest` — same plus optional session context
- `formatSessionEndDigest` — same with total prompt count

Pruning is strict: `pruneCheckpoints(db, retentionDays)` hard-deletes
all checkpoints older than the retention window.

Configuration lives under `continuity` in the pipeline config:

| Field | Default | Range | Description |
|-------|---------|-------|-------------|
| `enabled` | `true` | — | Master switch |
| `promptInterval` | `10` | 1–1000 | Prompts between periodic checkpoints |
| `timeIntervalMs` | `900000` | 60s–1h | Time between periodic checkpoints (15 min) |
| `maxCheckpointsPerSession` | `50` | 1–500 | Per-session cap |
| `retentionDays` | `7` | 1–90 | Days before old checkpoints are pruned |
| `recoveryBudgetChars` | `2000` | 200–10000 | Max characters for recovery digest |


Continuity Scoring
---

At session end, the summary worker scores how effectively injected
memories were used during the session. The scoring flow:

1. **Load injected memories** — queries `session_memories` joined with
   `memories` for the session, filtered to `was_injected = 1`, ordered
   by `rank ASC`.

2. **LLM evaluation** — the injected memories and session transcript are
   sent to the LLM, which returns a JSON object with `score` (0–1),
   `confidence` (0–1), `memories_used` (count), `novel_context_count`,
   `reasoning`, and `per_memory` (array of `{ id, relevance }`).

3. **Per-memory relevance** — each entry in `per_memory` uses an 8-char
   prefix of the memory ID. The prefix is resolved to the full UUID via
   a map built from the injected memories. The `relevance_score` column
   on `session_memories` is updated for each matched memory.

4. **Score persistence** — the overall score, confidence, memory counts,
   reasoning, and continuity reasoning are written to `session_scores`.
   The `memories_recalled` field uses the actual injected count (not zero).

The scoring handles edge cases gracefully: markdown fences and `<think>`
blocks are stripped from LLM output, missing optional fields default to
zero/empty, out-of-range scores are clamped to [0, 1], and sessions
without `session_memories` data still get a valid score row.


Prospective Indexing (Hints)
---

After a memory is written, a `prospective_index` job is enqueued in
`memory_jobs`. The hints worker
(`platform/daemon/src/pipeline/prospective-index.ts`) processes these
jobs as a background polling loop, generating hypothetical future
queries — "hints" — that the memory might answer.

The approach is inspired by Kumiho (arXiv:2603.17244) prospective
indexing. Rather than relying solely on the memory's literal content
for retrieval, the pipeline asks the extraction LLM to imagine what
questions a user might ask that this memory would help answer. The LLM
returns up to `hints.max` (default 5) hint strings per memory.

Hints are stored in the `memory_hints` table, each linking back to the
source `memory_id`. A companion `memory_hints_fts` FTS5 index makes
hints searchable with BM25 scoring.

At search time, the hints FTS5 table is queried alongside the content
FTS5 table. When a hint matches, its BM25 score is merged with the
memory's content score using `Math.max` — a hint match elevates its
parent memory but does not stack additively with the content score.
This prevents a memory with both a content match and a hint match from
being double-boosted; instead, the stronger of the two signals wins.

Configuration lives under `hints` in the pipeline config:

| Field | Default | Range | Description |
|-------|---------|-------|-------------|
| `enabled` | `false` | — | Master switch |
| `max` | `5` | 1–20 | Max hints generated per memory |
| `timeout` | `45000` | 5000–300000 ms | LLM generation timeout |
| `poll` | `5000` | 1000–60000 ms | Worker polling interval |

```yaml
memory:
  pipelineV2:
    hints:
      enabled: true
      max: 5
```


Post-Fusion Dampening
---

After hybrid recall combines traversal, FTS, vector results, and
prospective hints into a candidate pool, structured evidence shaping
(`platform/daemon/src/pipeline/structured-evidence.ts`) scores candidates
across separate lexical, semantic, hint, and traversal channels. This is
the recall-side SEC layer: traversal can contribute structure, but
traversal-only memories are capped below directly anchored evidence;
prospective hints stay strong enough to recover class-to-instance
matches, such as "music streaming service" finding a memory that only
says "Spotify." A light facet-coverage pass then prefers top candidates
that cover different parts of multi-part queries instead of returning
near-duplicates for one facet.

After structured evidence shaping produces a fused score list, the
dampening pipeline
(`platform/daemon/src/pipeline/dampening.ts`) applies three corrections
before the final sort. The goal is to break score bunching where relevant
and irrelevant results land at similar fusion scores.

Structured currentness then applies a final correction before hydration.
Active attributes remain eligible as current evidence, while memories whose
structured attributes have been superseded are downweighted and annotated
with a `[Signet currentness]` note that points to the replacement
attribute when available. Structured supersession is grouped-claim-scoped: a newer
attribute can replace an older one only when it shares the same entity,
aspect, `group_key`, and `claim_key`. Sibling events under the same aspect stay active
unless the caller explicitly gives them the same group and claim key. This keeps stale
facts visible for historical questions without letting them win ordinary
"what is current?" recall.

**Stage 1: Gravity dampening** penalizes results that arrived via a
semantic path (vector, hybrid, or traversal) but share zero query-term
overlap with the actual content. These are "semantic hallucinations" —
the embedding model thinks they are related but the surface words have
nothing in common. Results with a score above 0.3 from a semantic source
are tokenized (lowercase, stop-word stripped) and checked against the
query tokens. Zero overlap halves the score (default `gravityPenalty:
0.5`).

**Stage 2: Hub dampening** penalizes results whose linked entities are
all high-degree hubs. Entity mention counts from
`memory_entity_mentions` are sorted to compute a P90 threshold (default
`hubPercentile: 0.9`). If every entity linked to a memory sits above
that threshold, the memory's score is multiplied by `hubPenalty` (default
0.7). This prevents popular entities like "Signet" or "Nicholai" from
dominating recall when the query targets something specific.

**Stage 3: Resolution boost** rewards actionable, specific memories.
Memories with type `constraint` or `decision` receive a 1.2x multiplier
(default `resolutionBoost: 1.2`). Other memories with temporal anchors
(ISO dates or month names) receive a lighter boost: `1 + (boost - 1) *
0.5`, which is 1.1x at default settings. Short or vague content (under
50 characters) receives no boost.

All three stages are independently togglable. After dampening, results
are re-sorted by adjusted score descending.


Lossless Session Transcripts
---

As hooks run, Signet stores the canonical cleaned conversation transcript as
JSONL at `$SIGNET_WORKSPACE/memory/{harness}/transcripts/transcript.jsonl`.
The `session_transcripts` table (migration 040) remains a compatibility and
indexing surface. Tool calls, tool outputs, and thinking traces are removed from
these memory surfaces so retrieval and summarization stay focused on the actual
conversation. Raw auditable traces may still be written to daemon logs outside
the memory lineage.

The table schema (`session_key TEXT PRIMARY KEY, content TEXT NOT NULL,
harness TEXT, project TEXT, agent_id TEXT, created_at TEXT`) is indexed
on `project` and `created_at`. The summary worker writes one row per
session via `INSERT OR IGNORE`, keyed on `session_key`.

The `/api/memory/remember` endpoint accepts an optional `transcript`
field. When present and a `sourceId` (session key) is available, the
transcript is written to `session_transcripts` in a separate write
transaction. This allows connectors to push cleaned conversation text
alongside memories without waiting for session-end summary processing.

At recall time, the `/api/memory/recall` endpoint supports `expand:
true`. When set, session keys from the result set are batch-looked up
in `session_transcripts` and the transcript content is joined into the
response. This lets callers retrieve the full conversation context
behind a recalled memory without a separate API call.


Canonical Markdown Lineage and MEMORY.md Projection
---

Rolling history now has an explicit authority split.

Canonical historical content lives as immutable markdown artifacts in
`$SIGNET_WORKSPACE/memory/`:

- `--transcript.md`
- `--summary.md`
- `--compaction.md`

Each session also has one mutable `--manifest.md` file. The manifest is the
only artifact that may gain new links after session end, such as a later
`compaction_path`.

`MEMORY.md` is no longer canonical history. It is a rebuildable projection over:

- durable memory rows for the Tier 1 head
- persisted thread heads plus temporal DAG state for Tier 2
- canonical artifact frontmatter for the strict 30-day session ledger

The renderer is programmatic. LLM output in this lane is limited to the single
`memory_sentence` stored in summary and compaction frontmatter, with a
deterministic fallback when the quality gate fails. The final `MEMORY.md`
projection always includes:

- `## Global Head (Tier 1)`
- `## Thread Heads (Tier 2)`
- `## Session Ledger (Last 30 Days)`
- `## Open Threads`
- `## Durable Notes & Constraints`
- `## Temporal Index`

Session-end jobs write canonical transcript artifacts immediately, then the
summary worker writes the matching canonical summary artifact for normal
`session_end` jobs. `compaction-complete` writes a canonical compaction
artifact and backfills the session manifest. Mid-session
`session-checkpoint-extract` jobs remain DB-native and only write checkpoint
nodes into `session_summaries`.


Decision Auto-Protection
---

The shared decision detector (`isDecisionContent`) runs a 14-pattern regex
battery on memory content. Structured graph writes use this detector when a
caller does not specify a stronger kind, so decision language can become a
`kind='constraint'` without requiring a background LLM.

The patterns cover common decision-indicating phrases:

- "chose/chosen to use X over Y", "decided to/on/against"
- "switched from/to", "migrated from/to/away"
- "picked X over Y", "went with", "sticking with"
- "committed to", "settled on", "will use/go with/stick with"
- "prefers X over/instead/rather", "adopted"
- "architecture decision", "design decision"

The detection function returns true if any pattern matches. This is a
write-time classification, no LLM call is involved. The regex battery is fast
and deterministic, consistent with the pipeline rule that default background
work should be mechanical and predictable.


Configuration Reference
---

Most pipeline config lives under `memory.pipelineV2` in `agent.yaml` (see
[[configuration]]). The config uses a nested structure with grouped
sub-objects. Legacy flat keys are also supported for backward
compatibility (nested keys take precedence).

Provider selection for extraction and session synthesis can also be bound to
the shared inference control plane through the top-level `inference.workloads`
config. If those workload bindings are present, the pipeline resolves its
inference target through the router. Legacy extraction and synthesis provider
fields are only used to build an implicit compatibility profile when no explicit
`inference:` block is configured.

### Top-level flags

```
enabled                         true
shadowMode                      false
mutationsFrozen                 false
semanticContradictionEnabled        true
semanticContradictionTimeoutMs      120000  # ms, range 5000-300000
telemetryEnabled                    false
```

### Nested sub-objects and defaults

Extraction safety note:

- intended usage is Claude Code on Haiku, Codex CLI on gpt-5.4-mini with a
  Pro/Max subscription, or local Ollama with at least `qwen3:4b`
- set `provider: none` on a VPS if you do not want background
  extraction
- remote API extraction can accumulate extreme fees quickly
  (`anthropic`, `openrouter`, `openai-compatible`, or remote OpenCode routes)

```yaml
extraction:
  provider: llama-cpp            # "none" | "llama-cpp" | "ollama" | "claude-code" | "codex" | "opencode" | "anthropic" | "openrouter" | "openai-compatible" | "command"
  model: qwen3:4b
  timeout: 90000                 # ms, range 5000–300000
  minConfidence: 0.7             # fraction 0.0–1.0
  structuredOutput: true         # send JSON schema in format field; set false for providers that reject it (e.g. GitHub Copilot)
  command:                       # required when legacy extraction.provider: command
    bin: node
    args: ["./extract.mjs", "--transcript", "$TRANSCRIPT", "--session", "$SESSION_KEY", "--agent", "$AGENT_ID"]
    # tokens: $TRANSCRIPT (temp file path), $SESSION_KEY, $PROJECT, $AGENT_ID, $SIGNET_PATH
    # command stdout/stderr are ignored; command writes memories to Signet state directly
    # top-level inference.targets.*.executor: command is the separate stdout-based inference-provider path

synthesis:
  enabled: true
  provider: ollama               # "none" | "llama-cpp" | "ollama" | "claude-code" | "codex" | "opencode" | "anthropic" | "openrouter" | "openai-compatible"
  model: qwen3:4b
  timeout: 120000                # ms, range 5000–300000
  # when omitted entirely, synthesis falls back to extraction provider/model
  # explicit top-level inference.workloads bindings override legacy provider selection

worker:
  pollMs: 2000                   # ms, range 100–60000
  maxRetries: 3                  # range 1–10
  leaseTimeoutMs: 300000         # ms, range 10000–600000
  maxLoadPerCpu: 0.8             # load-per-CPU threshold, range 0.1–8.0
  overloadBackoffMs: 30000       # ms, range 1000–300000

graph:
  enabled: true
  extractionWritesEnabled: true  # default; persists extracted entities when graph tables are available
  boostWeight: 0.15              # fraction 0.0–1.0
  boostTimeoutMs: 500            # ms, range 50–5000

structural:
  enabled: false
  classifyBatchSize: 8           # range 1–20
  dependencyBatchSize: 5         # range 1–10
  pollIntervalMs: 10000          # ms, range 2000–120000
  synthesisEnabled: false
  synthesisIntervalMs: 60000     # ms, range 10000–600000
  synthesisTopEntities: 20       # range 5–100
  synthesisMaxFacts: 10          # range 3–50
  synthesisMaxStallMs: 1800000   # 30 min, set 0 to disable
  supersessionEnabled: true
  supersessionSweepEnabled: true
  supersessionSemanticFallback: false
  supersessionMinConfidence: 0.7

reranker:
  enabled: true
  model: ""
  useExtractionModel: false
  topN: 20                       # range 1–100
  timeoutMs: 2000                # ms, range 100–30000

autonomous:
  enabled: true
  frozen: false
  allowUpdateDelete: true
  maintenanceIntervalMs: 1800000 # 30 min, range 60s–24h
  maintenanceMode: execute       # "observe" | "execute"

repair:
  reembedCooldownMs: 300000      # 5 min, range 10s–1h
  reembedHourlyBudget: 10        # range 1–1000
  requeueCooldownMs: 60000       # 1 min, range 5s–1h
  requeueHourlyBudget: 50        # range 1–1000
  dedupCooldownMs: 600000        # 10 min, range 10s–1h
  dedupHourlyBudget: 3           # range 1–100
  dedupSemanticThreshold: 0.92   # fraction 0.0–1.0
  dedupBatchSize: 100            # range 10–1000

documents:
  workerIntervalMs: 10000        # ms, range 1s–300s
  chunkSize: 2000                # chars, range 200–50000
  chunkOverlap: 200              # chars, range 0–10000
  maxContentBytes: 10485760      # 10 MB, range 1 KB–100 MB

guardrails:
  maxContentChars: 800           # range 50–100000
  chunkTargetChars: 600          # range 50–50000
  recallTruncateChars: 500       # range 50–100000
  contextBudgetChars: 4000

continuity:
  enabled: true
  promptInterval: 10             # range 1–1000
  timeIntervalMs: 900000         # 15 min, range 60s–1h
  maxCheckpointsPerSession: 50   # range 1–500
  retentionDays: 7               # range 1–90
  recoveryBudgetChars: 2000      # range 200–10000

telemetry:
  posthogHost: ""
  posthogApiKey: ""
  flushIntervalMs: 60000         # ms, range 5s–10min
  flushBatchSize: 50             # range 1–500
  retentionDays: 90              # range 1–365

embeddingTracker:
  enabled: true
  pollMs: 5000                   # ms, range 1s–60s
  batchSize: 8                   # range 1–20

hints:
  enabled: false
  max: 5                         # range 1–20
  timeout: 45000                 # ms, range 5000–300000
  poll: 5000                     # ms, range 1000–60000

dampening:
  gravityEnabled: true
  hubEnabled: true
  resolutionEnabled: true
  hubPercentile: 0.9             # fraction 0.0–1.0
  hubPenalty: 0.7                # fraction 0.0–1.0
  gravityPenalty: 0.5            # fraction 0.0–1.0
  resolutionBoost: 1.2           # multiplier
```

### Example configurations

A minimal configuration to enable the pipeline in shadow mode:

```yaml
memory:
  pipelineV2:
    enabled: true
    shadowMode: true
```

To enable controlled writes with graph support:

```yaml
memory:
  pipelineV2:
    enabled: true
    graph:
      enabled: true
      extractionWritesEnabled: true
    extraction:
      minConfidence: 0.75
```

To enable autonomous maintenance in execute mode:

```yaml
memory:
  pipelineV2:
    enabled: true
    autonomous:
      enabled: true
      maintenanceMode: execute
```

Full production configuration:

```yaml
memory:
  pipelineV2:
    enabled: true
    semanticContradictionEnabled: true
    extraction:
      provider: llama-cpp
      model: qwen3:4b
    graph:
      enabled: true
    autonomous:
      enabled: true
      maintenanceMode: execute
    continuity:
      enabled: true
      promptInterval: 10
    embeddingTracker:
      enabled: true
      pollMs: 5000
```


---

Multi-Agent Pipeline Notes
---------------------------

When multiple agents share a daemon, the pipeline tags each extracted memory
with the requesting agent's ID. The `agent_id` is resolved from the
session-start hook request: if the caller provides `agentId` in the body it
is used directly; otherwise the daemon parses OpenClaw's session key format
(`agent:{id}:{rest}`) as a fallback.

Extracted memories default to `visibility = 'global'`. Callers that want
private memories must set `visibility = 'private'` explicitly in the
remember request or via `signet remember --private`.

The pipeline worker itself is agent-agnostic: it operates on the `memory_jobs`
queue and reads `agent_id` from each job record. Entity graph operations
(extraction, traversal, aspect updates) all pass `agent_id` through to
ensure knowledge is scoped to the correct agent.