# Repository Analysis — ferret-scan

**Date:** 2026-05-28 · **Version analyzed:** 2.6.1 · **Scope:** fresh end-to-end
analysis covering architecture, positioning in the AI coding-agent landscape,
opportunities to be more useful, a deep security scan, and a documentation
review.

This document records both the findings and the concrete changes shipped in the
same change-set. Items are marked **[Fixed here]** when addressed in this PR and
**[Recommended]** when deferred (because they are architectural, behavior-changing,
or require a maintainer/legal decision).

---

## 1. Executive summary

`ferret-scan` is a mature (~20k LOC of TypeScript, 110 test files, enforced 80%+
coverage, multi-stage CI, LSP, VS Code extension, Docker) static security scanner
purpose-built for the **AI agent configuration surface** — the skills, agents,
hooks, MCP configs, rules files, and instruction markdown that AI coding CLIs read
and act on. It scans *the agent's own instruction/config surface* rather than
application code, which is a genuinely differentiated and timely niche.

The codebase is well-engineered and largely does what its docs claim. The most
significant finding of this review is a **silent failure of its headline ReDoS
defense**: the optional RE2 linear-time regex engine was never actually loading in
any published build, leaving a weak heuristic fallback as the only protection
(with confirmed exponential-time bypasses). This has been fixed here, along with
several other contained security hardening items and a batch of documentation
corrections.

One item requires a maintainer/legal decision and was deliberately **not** edited:
the presence of patent-prosecution material in a public MIT repo (§6).

---

## 2. Architecture & maturity

**Pipeline** (`src/scanner/Scanner.ts`):
`FileDiscovery` → `PatternMatcher` (regex rules + false-positive filters) →
per-file analyzers (`Entropy`, `Mcp`, `Dependency`, `Capability`, `Llm`,
`Semantic`, `ThreatIntel`) → cross-file `CorrelationAnalyzer` → post-processing
(ignore comments, MITRE ATLAS annotation, documentation dampening) → reporters.

- **Rules:** 80 built-in rules across 9 categories (24 CRITICAL / 38 HIGH / 16
  MEDIUM / 2 LOW): credentials, injection, exfiltration, backdoors, obfuscation,
  permissions, persistence, supply-chain, ai-specific — plus AST-driven `semantic`
  and cross-file `correlation` rules.
- **Differentiators:** true AST analysis of code blocks embedded in markdown
  (`AstAnalyzer.ts`, symbol-aware to avoid substring false positives); cross-file
  attack-chain correlation; MITRE ATLAS technique mapping; SBOM/AIBOM output;
  privacy-first (no network calls by default; air-gap capable).
- **Surface:** 17 CLI subcommands, 6 output formats (console/json/sarif/html/csv/
  atlas) plus sbom/aibom, an LSP server (Neovim/Emacs/Zed/Helix/Sublime), and a VS
  Code extension.
- **Maturity signals:** property tests (`fast-check`), dedicated ReDoS/SSRF tests,
  enforced coverage gates, `npm run quality` meta-gate, committed
  `npm-shrinkwrap.json`, `npm audit --production` clean (0 vulnerabilities), 20
  released versions with a detailed changelog.

---

## 3. Positioning in the AI coding-agent landscape

The 2025/2026 landscape — Claude Code, Cursor, Windsurf, Cline, Aider, Copilot,
ubiquitous MCP servers, and proliferating instruction files (`CLAUDE.md`,
`AGENTS.md`, `.cursorrules`, skills, hooks, slash commands) — turned the agent's
*configuration surface* into a real injection / exfiltration / persistence vector
that traditional tooling doesn't inspect.

| Tool | Focus | Relationship to ferret |
|------|-------|------------------------|
| gitleaks / trufflehog | secrets in source | overlaps only on credentials |
| semgrep | general code SAST | different target; ferret is narrow + deep on AI-config semantics |
| mcp-scan (Invariant/Snyk) | MCP runtime proxy, rug-pull, tool-shadowing, LLM guardrails | most direct competitor; see `docs/mcp-scan-comparison.md` |

**Defensible niche:** the only broad, local-first, multi-CLI scanner of the agent
configuration surface with SARIF/SBOM/LSP/CI integration. The acknowledged gaps vs
mcp-scan are runtime/proxy protection, MCP rug-pull detection, and obfuscated/
paraphrased-injection detection.

---

## 4. How it could be more useful (prioritized roadmap) — [Recommended]

**High value**
- **First-class `AGENTS.md` coverage** *(low effort).* The cross-tool `AGENTS.md`
  standard, nested `AGENTS.md` files, and `.github/copilot-instructions.md` should
  be explicit discovery targets in `FileDiscovery.ts`. Quick win.
- **Dedicated component types/rules for newer Claude Code surfaces** *(med effort):*
  `settings.json` hook command injection, `PreToolUse`/`PostToolUse` hook abuse,
  plugin `marketplace.json` trust, slash-command markdown injection. Rule `AI-011`
  is a start; expand it.
- **MCP rug-pull / tool-integrity tracking** *(med effort):* hash MCP tool
  descriptions on first scan, store in the baseline, alert on unauthorized changes.
  `baseline.ts` + `mcpTrustScore.ts` already provide the building blocks.
- **MCP runtime/proxy scanning** *(high effort):* the biggest functional gap vs
  mcp-scan; `runtimeMonitor.ts` (`--stdio`) is a foundation.

**Medium value**
- **`ferret mcp-serve` (agent-consumable mode)** *(low–med effort):* expose scanning
  as an MCP tool so agents can self-audit configs in-loop, plus a compact
  finding format tuned for LLM context windows.
- **Published reusable GitHub Action + pre-commit-framework hook** *(low effort).*
- **Fingerprint-based suppression + `--diff-only` CI mode** *(med effort):* stable
  finding hashes that survive line moves; scan only changed lines in CI.
- **Cross-origin / tool-shadowing detection** *(med effort):* extend
  `CorrelationAnalyzer` to flag when one MCP server's tool description overrides
  another's.

**Lower value / polish**
- Homoglyph / unicode-confusable normalization before regex matching (catches some
  obfuscated injections locally, without an LLM).
- JetBrains plugin (LSP already supports many editors; only VS Code ships a plugin).
- The live CI self-scan job runs `ferret scan .`, which scans only AI-config paths
  and reports 0 findings on this repo — i.e. it is effectively a no-op gate. Consider
  switching the dogfood to `--self` semantics (note: `scan --self --ci` exits
  non-zero by design because it includes the planted evil fixtures, so it needs a
  non-blocking step rather than a hard gate).

---

## 5. Security scan

The scanner ingests untrusted input (malicious config files, attacker-controlled
regex inputs, remote rule URLs, quarantine databases) and modifies files, so it was
audited as attacker-facing. Full audit results below; confirmed-impact items were
fixed in this change-set.

### 5.1 RE2 ReDoS protection silently disabled — **[Fixed here]**

**`src/utils/safeRegex.ts`.** The module loaded the optional linear-time RE2 engine
via a bare `require('re2')`. Because ferret-scan's own `package.json` declares
`"type": "module"`, the global `require` is undefined at runtime — the call always threw,
was swallowed by the surrounding `try/catch`, and `RE2` was permanently `null`. The
documented "linear-time engine that categorically eliminates ReDoS" therefore never
ran in any published build; every scan fell back to a 9-pattern heuristic screener.

Verified at runtime: `isRE2Active()` returned `false` even with `re2` installed and
loadable under CommonJS.

**Fix:** bridge to a working CommonJS `require` via
`createRequire(import.meta.url)` (isolated in `src/utils/esmRequire.ts` so the
CommonJS Jest runner can stub it). `isRE2Active()` now returns `true` in the real
CLI.

### 5.2 Fallback ReDoS screener bypasses — **[Fixed here]**

**`src/utils/safeRegex.ts`.** When RE2 is genuinely unavailable (native addon fails
to build on Alpine/musl/Windows), the heuristic screener is the only defense, and it
had confirmed exponential/polynomial bypasses — e.g. `(\d+)*$` (catastrophic on V8)
and `(.*a){20}` were accepted. These reach the engine via attacker-controlled custom
rules (`.ferret/rules.yml` in any scanned repo, or remote rules).

**Fix:** the screener now rejects the whole family of "group containing an inner
quantifier that is itself quantified" (`(\d+)*$`, `(\w+)*`, `([ab]+){2,}`, …), in
addition to the original patterns. RE2 (5.1) remains the real defense; this is
defense-in-depth.

**[Recommended]** The primary `PatternMatcher.findMatches` path runs built-in rule
regexes with `new RegExp().exec()` and a time budget that is only checked *between*
matches — a single catastrophic `exec()` cannot be interrupted. For a hard bound,
run matching in a worker thread with a wall-clock `terminate()`. (No exploitable
built-in ReDoS exists today — all 80 rules were fuzzed clean — so this is structural.)

### 5.3 SSRF in remote custom-rule fetching — **[Fixed here]**

**`src/features/customRules.ts`.** With `--allow-remote-rules`, `fetchText` issued
`fetch(url)` against any `http(s)` URL with no host validation and default redirect
following — allowing requests to `169.254.169.254` (cloud metadata), `localhost`,
and RFC1918 ranges, including via redirects from an allowed host.

**Fix:** `assertSafeRemoteUrl` rejects non-http(s) schemes and any host that
resolves (via DNS) to loopback/private/link-local/unique-local/CGNAT/metadata
addresses; redirects are followed manually (`redirect: 'manual'`, max 5 hops) and
re-validated at every hop. Verified live: a `--custom-rules http://169.254.169.254/…`
request is now blocked.

### 5.4 Secrets leaked unredacted in webhook payloads — **[Fixed here]**

**`src/features/webhooks.ts`.** The generic webhook formatter sent `match: f.match`
— the raw matched text, which can contain the secret that tripped the rule — to an
external URL with no redaction (reporters redact on their own path; the webhook path
did not).

**Fix:** `match` is now passed through `redactSecretsInString` before egress.

### 5.5 CSV formula injection — **[Fixed here]**

**`src/reporters/CsvReporter.ts`.** Cells derived from attacker-controlled file
content/paths beginning with `=`, `+`, `-`, `@`, tab, or CR were written verbatim,
so opening a CSV report in Excel/Sheets/LibreOffice could execute a formula.

**Fix:** such cells are prefixed with `'` to force text interpretation.

### 5.6 Remaining audit items — **[Recommended]**

- **Quarantine restore trusts `quarantine.json`** (`Quarantine.ts`): a poisoned DB
  can write attacker content anywhere under CWD (e.g. a git hook). Verify the stored
  `fileHash` and require an explicit restore base rather than defaulting to all of
  CWD.
- **Symlink / recursion-depth guards in `FileDiscovery`** (`FileDiscovery.ts`): uses
  `stat` (follows symlinks) with no visited-inode set or depth cap → symlink loops
  (DoS) and scan-scope escape. Use `lstat`, track visited realpaths, cap depth.
- **`npm audit` runs inside the untrusted scanned dir** (`dependencyRisk.ts`): honors
  a repo-local `.npmrc`/`registry=`. Pin `--registry`/`--userconfig` or parse the
  lockfile offline.
- **Defense-in-depth:** `runtimeMonitor` spawns the target with `shell: true`;
  `gitHooks.getChangedFiles` interpolates refs into a shell command; verbose logging
  can emit unredacted content fragments. None are reachable from scanned-file content
  today.

**Positive findings:** no `eval`/`new Function`/dynamic `require` on scanned content;
zod-validated parsing with size caps; explicit (pollution-safe) config merge; custom
rules cannot shadow built-in IDs; HTML reporter escapes all sinks (no XSS);
LRU-bounded content cache; clean production dependency tree.

---

## 6. Documentation review

### Fixed here
- **Broken `docs/RULES.md` reference** in `README.md` → now points to the inline
  "Custom Rules" section (no such file existed).
- **Exit codes undocumented:** added a full table (`0/1/2/3/4/5/130`) and all six
  `FERRET_EXIT_*` overrides (only three were listed).
- **`.ferretignore` undocumented:** added a Configuration subsection.
- **Undocumented subcommands:** added `check`, `mcp`, `deps`, `capabilities`,
  `policy`, and `webhook` to the command reference.
- **Security-contact mismatch:** `README.md` advertised `security@ferret-scan.dev`
  while `SECURITY.md` uses a different address; the README now defers to
  `SECURITY.md` as the single source of truth.
- **`--format` help** omitted `sbom`/`aibom` (both work) — now listed.
- **Discoverability:** the README Documentation section now links the full docs set;
  the "Documentation" link no longer points at a separate wiki.
- **Hygiene:** removed `.github/workflows/publish.yml.bak`; removed a duplicate
  `## [2.6.0]` CHANGELOG header; fixed the `typedoc.json` GitHub org link
  (`ferret-scan/ferret-scan` → `fubak/ferret-scan`); removed a dead, shadowed
  duplicate `self-scan` CI job (two `jobs:` keys shared the name, so YAML silently
  dropped the first).

### Recommended
- README is large (~35KB) with internal duplication (LLM analysis and "Planned
  Features" appear twice). Consider trimming to quick-start + command reference +
  links, moving deep-dives into dedicated `docs/` pages.
- `docs/TEST_RESULTS.md` and `docs/QUALITY_GATES.md` are point-in-time snapshots
  pinned to 2.6.0; auto-generate from CI or add staleness banners.
- `docs/publishing.md` and the VS Code `.vsix` filename in the README reference
  stale example version numbers.

---

## 7. Appendix — patent material removed from the public repo — **[Fixed here]**

The repo previously shipped, in a public MIT-licensed project:
`docs/PATENT_LANDSCAPE_ANALYSIS.md`, `docs/PATENT_ACTION_PLAN.md`, and
`docs/ip-submissions/` (five full provisional patent specification packages as
HTML). `PATENT_ACTION_PLAN.md` included commercial licensing tiers,
ROI/acquisition projections, and named licensing/cross-licensing targets
(including downstream consumers of the tool).

This material has been **removed from the working tree** at the maintainer's
direction. Reasons it did not belong in the public repo:

- Publishing patent-prosecution material is a public disclosure that can affect
  prosecution strategy (it becomes prior art / defensive publication once public).
- An aggressive monetization plan that names downstream consumers as licensing
  targets sits in tension with shipping under MIT and may chill adoption.
- These read as internal business/legal documents rather than user/operator docs.

**Note on git history:** removing the files from `HEAD` stops them from shipping
going forward, but they remain reachable in prior commits (they predate this
branch). Fully purging them requires rewriting published history
(`git filter-repo` + a coordinated force-push and re-clone), which is a
destructive operation that should be performed deliberately by the maintainer —
it is **not** done here. A prepared runbook and guarded helper script are
available: [`docs/patent-history-purge.md`](./patent-history-purge.md) and
`scripts/purge-patent-history.sh`.