# Changelog

All notable changes to MarkCrawl are documented in this file. The format
follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and this
project follows [SemVer](https://semver.org/) once it reaches 1.0.

## [0.11.1] - 2026-05-11

### Added — default aggregator-page URL filter
Markcrawl now rejects mdBook `/print.html` and Hugo `/_print/` pages
during crawl-time URL filtering. These single-render-of-whole-tree
pages have artificially high keyword density (they contain the entire
docs tree on one URL), which causes embedding-based retrieval to rank
them above the dedicated chapter pages a user actually wants.

- New default patterns rejected pre-fetch (saves crawl budget):
  `*/print.html`, `*/_print`, `*/_print/`, `*/_print/*`,
  `*/print/index.html`.
- New kwarg `include_aggregator_pages: bool = False` on
  `crawl(...)` and both engine classes for offline-archive use cases.
- CLI flag `--include-aggregators` mirrors.
- User-supplied `exclude_paths` and `include_paths` still apply
  independently — the aggregator filter composes with both, doesn't
  replace either.

### Why now
The public `llm-crawler-benchmarks` v1.4 cycle surfaced this as a
markcrawl-specific issue: markcrawl was returning `/print.html` in
49% of rust-book top-5 retrieval slots and `/_print/` in 39% of
kubernetes-docs slots, while all four well-functioning competitors
returned 0% `/_print/` on kubernetes-docs. The retrieval-ranking
damage is structural — these pages will always beat real chapter
pages on cosine similarity because they contain everything.

### Expected impact
Per the v1.4 retrieval-bucket audit, ~9-12 of markcrawl's 43
retrieval-bucket misses concentrate on this issue. Predicted MRR
lift on the 9-site bench pool: **+0.02 to +0.04**, concentrated on
rust-book and kubernetes-docs. Measurement waits for the bench's
v1.5 methodology refresh (helpful-pages-universe approach replaces
the v1.4 single-tool-anchor query corpus).

### Tests
36 new tests covering: default rejection of observed bench failures,
substring-match safety (`/blueprint.html`, `/preprint.html`,
`/imprint/` all pass through), opt-out flag, composition with user
exclude_paths and include_paths, both `CrawlEngine` and
`AsyncCrawlEngine` parity. Total test count: 647 (was 611).

## [0.11.0] - 2026-05-06

Two new modules expand markcrawl from "HTML to Markdown converter" to
"crawl + selectively download referenced files":

### Added — `markcrawl/binaries.py`
Binary file downloads (PDF, DOCX, etc.) referenced from crawled pages.

- New `crawl(..., download_types=["pdf", "docx"], ...)` kwarg. When set,
  matching links are routed to a separate download queue (parallel to
  HTML crawling) and streamed to `<out_dir>/downloads/`.
- **Streaming with size cap** — uses `requests` `stream=True` /
  `httpx.AsyncClient.stream(...)` with `iter_content` / `aiter_bytes`,
  enforcing `download_max_size_mb` per chunk. Atomic write via
  `.tmp` + `os.replace`. Partial files unlinked on cap-exceed.
- **Content-type validation BEFORE writing bytes** — a `.pdf` URL
  serving `text/html` (login wall, marketing splash) is dropped
  immediately, not saved.
- **JSONL row gains `downloads` field** when a page's binaries were
  downloaded: `[{url, path, size_bytes, content_type}, ...]`. Field
  is omitted when no downloads on that page (backward-compat with
  existing JSONL parsers).
- **Sitemap entries route to download queue** when they match
  `download_types` (DS-3b symmetry with link discovery).
- All existing safety nets (`respect_robots`, `idle_timeout_s`,
  `include_subdomains` for scope) apply uniformly to downloads.

### Added — `markcrawl/filters.py`
Reusable best-effort filters for the `download_filter` callback.

- `DownloadCandidate` dataclass — passed to filters at discovery
  time; carries URL, anchor text, parent-page URL/title, extension.
- **`is_likely_resume`** — positive signals (resume, cv, template,
  sample) AND not legal-boilerplate (privacy, terms, policy, ...).
- **`is_likely_paper`** — positive signals (paper, preprint,
  research, study) AND not legal-boilerplate.
- **`exclude_legal_boilerplate`** — pure negative selector,
  composes with positive filters via `lambda c: positive(c) and
  exclude_legal_boilerplate(c)`.
- Filters run **pre-fetch** — rejected URLs never get fetched, so
  zero HTTP bytes are transferred for filtered candidates.
  Documented as "best-effort heuristics, not classifiers"; users
  test against their real corpus.

### Added — new `CrawlResult` fields
- `downloads_count: int` — files saved
- `downloads_bytes: int` — total bytes saved
- `downloads_size_skipped: List[str]` — URLs that exceeded the size cap
- `downloads_type_skipped: List[str]` — URLs whose content-type didn't match

### New `crawl()` kwargs

```python
crawl(
    base_url=...,
    download_types=["pdf", "docx"],          # None disables (default, no behavior change)
    download_max_files=200,                  # cap per crawl
    download_max_size_mb=25,                 # per-file cap
    download_filter=is_likely_resume,        # optional pre-fetch filter
)
```

### Empirical guarantees
- 45 new tests in `tests/test_v011_binary_downloads.py` covering
  every SC from `specs/binary-downloads.md` plus all `On failure`
  paths. Mocked HTTP for the streaming / cap / dedup paths and an
  end-to-end mocked-fetch test proving the discover→queue→drain→
  JSONL flow works.
- Streaming + size cap empirically validated against `httpbin.org`
  during spec confidence review.

### Migration
No breaking changes. Default `download_types=None` preserves v0.10.6
behavior exactly. Users who set `download_types` should:
1. Pair with a `download_filter` for non-trivial use cases (the
   default "any PDF on the host" is rarely what you actually want;
   `is_likely_resume` and `exclude_legal_boilerplate` are starting
   points).
2. Tune `download_max_files` and `download_max_size_mb` to your
   bandwidth budget — defaults (200 files × 25 MB = 5 GB worst case)
   are conservative for one-shot crawls but should be lowered for
   high-cadence schedules.

### Deferred
- **Live-network smoke harness case** for an ATS-template aggregator
  is deferred to v0.11.1 (stable target URLs are hard to lock without
  scouting). The mocked end-to-end test in `test_v011_binary_downloads.py`
  covers the regression surface for the v0.11.0 gate.
- **Format-specific text extraction** (PDF/DOCX → Markdown) remains
  out of scope. Use `pypdf`, `python-docx`, `mammoth`, or
  `unstructured` downstream of the saved files.

611 tests passing (was 566 on v0.10.6; +45 in
`tests/test_v011_binary_downloads.py`).

## [0.10.6] - 2026-05-05

### Added
- **Opt-in `respect_robots` flag.** New
  `crawl(..., respect_robots: bool = True)`. Default preserves the
  historical behavior — robots.txt Disallow rules are honored.
  Setting `respect_robots=False` bypasses Disallow but **still
  honors Crawl-delay** (politeness intact). Caller takes
  responsibility for legality, ethics, and downstream consequences.
- **Loud, non-silenceable warning** at `setup_robots()` when bypass
  is active. Both progress callback and Python `logger.warning`. No
  env-var override; the choice has to be made deliberately in code.
- **`CrawlResult.robots_respected: bool`** mirrors the kwarg the
  caller passed. Surfaced for audit / governance pipelines.
- **`CrawlResult.robots_bypassed_count: int`** — number of unique
  URLs that robots.txt Disallowed but were fetched anyway. Always 0
  when `robots_respected` is True. Lets callers see the impact of
  their override — small numbers mean robots wasn't constraining you.
- **End-of-crawl bypass summary** when `respect_robots=False` and
  the bypass actually unlocked URLs. Two messages: "had no effect
  this run" (count=0) or "fetched N URL(s) that robots.txt
  Disallowed" (count>0).

### Why this design
- robots.txt is the only widely-deployed mechanism site owners have
  to express preferences about automated access. We default to
  respecting it. But forks / monkey-patches that ignore robots
  already exist in the wild; an explicit, audited flag is more
  honest than letting users hack around the constraint silently.
- Crawl-delay is preserved unconditionally. We disregard *Disallow*,
  not *politeness*. Bypassing rate limits would be DoS-shaped.
- The flag is set in code, not from CLI or environment. Forces a
  deliberate, traceable choice.

### Migration
No breaking changes. Default behavior unchanged. Use the flag for:
your own site (forgotten/misconfigured robots), authorized
pen-testing, internal/intranet docs you own, RAG ingestion of docs
the site owner explicitly wants ingested but forgot to whitelist.

566 tests passing (was 549 on v0.10.5; +17 in
`tests/test_v0106_respect_robots.py` covering both modes, the
loud-warning behavior, Crawl-delay preservation, the audit fields,
and end-to-end bypass).

## [0.10.5] - 2026-05-04

### Added
- **Adaptive scope broadening.** When a crawl exhausts its narrow
  auto-derived scope (e.g. `/docs/concepts/*` from a kubernetes seed)
  with budget remaining, the engine now attempts one-level broadening
  (`/docs/concepts/*` → `/docs/*`) before giving up. URLs filtered
  under the previous scope are stashed and replayed through the
  broader scope. Triggers only when:
  - Scope was auto-derived (user-explicit `include_paths` is
    respected as intent and never mutated).
  - The current scope's leftmost segment is in `_DOCS_HUB_MARKERS`
    (`docs`, `book`, `learn`, `tutorial`, `guide`, `reference`,
    `manual`, `handbook`, `api`, etc.) **or** the seed classifies as
    `docs`/`apiref` by hostname.
  - One-level broadening doesn't land at whole-host (`/*`).
  - Cap of `_DEFAULT_MAX_BROADEN_EVENTS = 2` per crawl.
- **`CrawlResult.scope_history: List[List[str]]`** — sequence of
  scope patterns the crawl traversed. Empty if no scope was set;
  one entry per scope state. Auditable.

### Empirical proof (real network, 2026-05-04)
| Site | v0.10.4 | v0.10.5 | Delta |
|---|---|---|---|
| kubernetes-docs (max=400) | 195/400 | **400/400** | **+105%** |
| rust-book (max=150) | 111 | 111 | unchanged (single-segment guardrail) |
| postgres-docs (max=80) | 80 | 80 | unchanged |

The kubernetes seed `https://kubernetes.io/docs/concepts/` exhausts
its narrow scope at 195 pages; v0.10.5 broadens to `/docs/*`, replays
~200 stashed URLs from `/docs/tasks/`, `/docs/reference/`,
`/docs/setup/`, etc., and fills the full 400 budget — all in 28 s.

Rust-book is **deliberately unchanged**: its Tier 0 single-segment
scope `/book/*` cannot broaden short of whole-host, which the
guardrail blocks. We don't auto-pull `/std/`, `/cargo/`, `/nomicon/`
even though crawl4ai-raw does — those are different publications,
and our scope honors the seed's intent.

### Fixed
- The run loops in `CrawlEngine` and `AsyncCrawlEngine` now attempt
  scope broadening at *both* exit paths (queue empty AND every URL
  in the queue filtered out), not just one.

### Migration
No breaking changes. Behavior preserved exactly when the user passes
`include_paths` explicitly. For default crawls on docs sites,
expect more pages and the same (or better) signal-to-noise — the
broadening guardrail is intentionally tight (docs hub markers only,
no whole-host fallback).

549 tests passing (was 528 on v0.10.4; +21 in
`tests/test_v0105_adaptive_scope.py`).

## [0.10.4] - 2026-05-04

### Fixed
- **Idle timeout now resets on any meaningful progress.** v0.10.3 reset
  the `idle_timeout_s` clock only on `save_page`, which mis-fired on
  bursty crawls where the engine was successfully fetching pages but
  most were being deduped or under `min_words` (e.g.
  huggingface-transformers: ~21 pages saved before the 120 s timer
  fired, vs ~200 reachable). The reset signal is now widened — a
  successful HTTP 2xx response, OR a save, OR a `discover_links` call
  that adds at least one new URL to the queue all bump the activity
  clock. Net effect: the timer now functions as a true deadlock
  detector, not a save-rate guard. Sites that legitimately produce
  pages slowly continue to run; truly idle engines still get killed
  cleanly.

### Added
- `CrawlResult.first_status: Optional[int]` — first observable HTTP
  status. Lets callers distinguish engine bugs from external
  WAF/anti-bot blocks without scraping logs.
- `CrawlResult.stalled: bool` — True when the run was terminated by
  the idle-timeout watchdog rather than running out of work.
- `bench/local_replica/release_smoke.py` — pre-release coverage
  harness. Runs ``crawl()`` against ~4 real sites with per-site
  baselines, treats first_status≥400 + 0 pages as `BLOCKED` (skip,
  not fail). Catches stall-detection regressions, coverage
  regressions, and anti-bot diagnostic regressions in 5-10 min vs
  the 8-hour public benchmark.

### Internal
- Engine field renamed `_last_save_time` → `_last_activity_time`.
- New `_mark_activity()` helper on both `CrawlEngine` and
  `AsyncCrawlEngine` — single source of truth for the timer reset.
- 4xx / 5xx responses do **not** reset the clock (anti-bot loops can
  still be detected).

### Tests
528 passing (was 521 on v0.10.3). New tests in
`tests/test_v0103_resilience.py` cover all five reset paths (2xx,
4xx, 5xx, save, new-link discovery) and the no-op cases.

### Migration
No breaking changes. Public API surface (`idle_timeout_s` kwarg, env
var, default of 120 s) unchanged. Users who set
`MARKCRAWL_IDLE_TIMEOUT_S=300` to work around the v0.10.3 mis-fire can
now drop that override — 120 s is correct again.

## [0.10.3] - 2026-05-04

Three generalizable resilience fixes surfaced by the `llm-crawler-benchmarks`
v1.3 cycle. None are site-specific — each applies to any site exhibiting
the symptom.

### Fixed
- **Partial-write recovery (`pages.jsonl` is now line-buffered).** Both
  `_crawl_sync` and `_crawl_async` open the JSONL with `buffering=1`,
  and `save_page` flushes after every row. A SIGKILL / external
  watchdog termination now leaves a complete, readable JSONL on disk
  instead of an empty file. Previously, rows were buffered in
  user-space Python and lost on subprocess kill.
- **Discovery-exhaustion stall detection (`idle_timeout_s`).** Crawls
  where reachable pages < `max_pages` (e.g. HF docs with ~200
  reachable, max_pages=300) used to spin indefinitely on duplicate /
  out-of-scope link-discovery without producing new saves. The engine
  now tracks `_last_save_time` and terminates gracefully when no new
  page has been saved for `idle_timeout_s` seconds (default 120 s,
  overridable per call or via the `MARKCRAWL_IDLE_TIMEOUT_S` env var;
  set to 0 to disable). Generalizes to any site whose link graph
  yields lots of duplicates relative to fresh content.
- **0-page diagnostic logging.** When a crawl finishes with
  `pages_saved == 0`, the engine surfaces the first observed HTTP
  status so users can distinguish "blocked by 403" (anti-bot) from
  "200 but no extractable content" (JS-rendered or `min_words` too
  high) from "no response at all" (DNS error / unreachable seed).
  Catches the newegg-style anti-bot case generically.

### Added
- `idle_timeout_s` kwarg on the public `crawl()` API plus both
  `CrawlEngine` and `AsyncCrawlEngine` constructors. `None` →
  fall through to the env var, then to `DEFAULT_IDLE_TIMEOUT_S = 120.0`.
- `MARKCRAWL_IDLE_TIMEOUT_S` env var.
- 21 new tests in `tests/test_v0103_resilience.py` covering all three
  fixes (line-buffer guard, idle-timeout firing & disable semantics,
  diagnostic for 200/403/503/no-response, end-to-end 0-page repro).

### Migration
- No breaking changes. Default `idle_timeout_s=120` is generous and
  fires only on genuine stalls; for users intentionally running
  long-blocked crawls (e.g. waiting on a slow render), pass
  `idle_timeout_s=0` or set the env var to `0`.

521 tests passing (was 500 on v0.10.2; +21 for the resilience suite).

## [0.10.2] - 2026-05-03

### Fixed
- **Sitemap pre-enumeration deadline.** Recursive `parse_sitemap_xml`
  / `parse_sitemap_xml_async` (`markcrawl.robots`) and the call sites
  in `markcrawl.core` now share a 60 s wallclock budget for the whole
  sitemap-discovery phase. Retailer-style sitemap-indexes that fan out
  into thousands of locale shards (ikea: 2,113) used to consume 200+ s
  before any page got crawled, tripping the zero-output watchdog in
  benchmark harnesses (`llm-crawler-benchmarks` heartbeat fires at
  120 s with 0 pages saved). Once the deadline fires, the parser
  returns whatever URLs it has collected so far and the crawl
  proceeds normally. Async path uses `asyncio.as_completed` so
  pending child-sitemap tasks are cancelled rather than awaited.
- New `time_budget_s` kwarg on both `parse_sitemap_xml` variants
  (default 60.0) and a 2-test addition in `tests/test_sitemap_parallel.py`
  covering the short-circuit and the no-op default.

### Verified locally
| Site                     | v0.10.1                    | v0.10.2                     |
|--------------------------|----------------------------|-----------------------------|
| ikea                     | 0 pages (heartbeat fired)  | 30 pages saved in 49.7 s    |
| huggingface-transformers | regression on benchmark CI | 30 pages saved in 36.2 s    |

498 tests passing (now 500 with the new sitemap-deadline tests).

## [0.10.1] - 2026-05-03

### Changed
- **Local embedder is now the default.** The full ML stack
  (`torch`, `transformers`, `sentence-transformers`, `sentencepiece`)
  ships in the base `pip install markcrawl` so `chunk_semantic` and
  the bake-off-winning `mixedbread-ai/mxbai-embed-large-v1` embedder
  work out of the box — **zero API cost** for embedding at any scale
  (replaces the previous OpenAI 3-small default at $4,505/yr per 100K
  pages).
- **`markcrawl[ml]` is kept as a no-op alias** for backward compat.
  Existing `pip install markcrawl[ml]` invocations continue to work
  identically.
- **`markcrawl.upload.upload(...)`** picks the embedder via
  `markcrawl.embedder.make_default_embedder()`. Override with
  `embedder=<Embedder>`, `embedding_model="text-embedding-3-small"`
  (or any spec `make_embedder` accepts), or the
  `MARKCRAWL_EMBEDDER` env var. Lean install: `pip install --no-deps
  markcrawl beautifulsoup4 lxml markdownify requests certifi tenacity`
  (factory falls back to OpenAI 3-small).

### Added
- **`markcrawl.embedder.make_default_embedder()`** — returns mxbai
  when sentence-transformers is importable, else OpenAI 3-small.
- **`DEFAULT_EMBEDDER_SPEC = "mixedbread-ai/mxbai-embed-large-v1"`** —
  single source of truth for the production default.

### Migration
- No code changes required for callers using `upload(...)` with
  default kwargs — they automatically pick up the local embedder
  and stop incurring OpenAI charges. To stay on OpenAI, pass
  `embedding_model="text-embedding-3-small"`.

## [0.10.0] - 2026-05-01

### Added
- **Tenacity-backed HTTP retry policy** in the new module `markcrawl.retry`.
  Full-jitter exponential backoff: 5 attempts, 2 s starting delay, 30 s cap.
  Honors the server's `Retry-After` header on 429 responses (clamped to the
  30 s ceiling). Emits one structured INFO log line per retry — `[retry]
  attempt=N status_code=… url=… sleep=Xs elapsed=Ys detail=…`. Applied
  uniformly to both the `httpx` (`_fetch_httpx`, `fetch_async`) and
  `requests` (`_fetch_requests`) code paths so both transports follow the
  same policy.
- **`tests/test_retry.py`** — 36 new unit tests covering header parsing,
  retryable-status detection, the wait strategy, end-to-end retry behavior,
  and policy-constant invariants.
- **CI source-vs-published parity check** at
  `.github/workflows/cli-flag-parity.yml`. Triggers on push to `main` and on
  every `v*` tag. Installs the local source as a wheel, captures `markcrawl
  --help`, force-reinstalls the latest published wheel, captures `--help`
  again, and diffs. Hard-fails on any mismatch when the source version is
  already on PyPI; soft-warns when source is ahead (expected pre-release
  window). Catches the source-vs-PyPI divergence class that produced bug
  fe6f3c39.
- **`tenacity>=8.0,<10.0`** declared in `pyproject.toml` and
  `requirements.txt` install requirements.

### Changed
- **`markcrawl/throttle.py` no longer reacts to 429 responses.** Rate-limit
  backoff is now owned exclusively by the retry layer. `AdaptiveThrottle`
  continues to manage inter-request pacing (response-time proportional and
  `robots.txt` Crawl-delay floor) — both layers now compose cleanly without
  double-waiting. The previous uncapped doubling-from-1 s branch in
  `throttle.update()` (lines 46–50 in 0.9.x) was removed; an explicit
  early-return on 429 keeps the response-time signal clean. Tests updated:
  `tests/test_core.py::test_update_throttle_429_is_ignored` and
  `test_update_throttle_429_does_not_disturb_pacing` codify the new contract.
- **`markcrawl/fetch.py::_build_requests_session` no longer mounts a
  `urllib3.util.Retry` adapter.** Transport-level retry was conflicting with
  the new request-level retry layer; consolidating to the tenacity layer
  removes the double-retry surface and the silent transport-level
  no-jitter behavior.
- **`fetch_async` rewritten** to use `tenacity.AsyncRetrying` via
  `markcrawl.retry.with_retry_async` for a single source of truth on the
  retry policy across sync and async paths.

### Documentation
- **README.md** — new "Installation / Upgrading" section near the top with
  `pip install --upgrade` guidance, an explanation of the stale-install
  failure mode, and `head -1 $(which markcrawl)` as the canonical
  diagnostic for "which Python owns my binary".
- **specs/v3-landscape/** — three design docs from the v3 landscape stage
  (`root-cause-diagnosis.md`, `backoff-strategy-design.md`, `fix-plan.md`)
  document the bug investigation, library comparison, and operator runbook.

### Migration notes for downstream consumers
- No CLI-flag changes — every flag in 0.9.x remains in 0.10.0 with identical
  semantics. The retry behavior change is internal and transparent.
- Anyone subclassing `AdaptiveThrottle` and overriding `update()` should be
  aware that `_backoff_count` is now permanently 0 (kept as a public
  attribute for backward compatibility).
- Library consumers calling `markcrawl.fetch.fetch()` directly will see the
  same return contract: a response object on success / exhausted-soft-fail,
  or `None` when the underlying transport raises a transient error five
  times in a row.

### MRR + cost (Track D + Track B from the speed-recovery campaign, merged into v0.10.0)
- **`chunk_markdown` defaults flipped** to the Track D winner: `min_words=250`,
  `section_overlap_words=40`, `strip_markdown_links=True`. Multi-trial validated
  +14% MRR on `all-MiniLM-L6-v2` (6 trials, all positive) and +15% on OpenAI
  3-small (3 trials, all positive). Halves chunks/page (20.3 → 10.49 in the
  local replica), so the index is also smaller.
- **`markcrawl.embedder`** ships an `Embedder` ABC + `OpenAIEmbedder` +
  `LocalSentenceTransformerEmbedder` (with model-specific instruction
  prefixes for asymmetric retrieval). `make_embedder("…")` accepts string
  specs that route to the right backend. The bake-off across 4 embedders
  on the canonical 11-site pool found `mixedbread-ai/mxbai-embed-large-v1`
  the Pareto winner ($0/yr cost, MRR within ±0.020 of OpenAI 3-small).
- **`markcrawl.retrieval.CrossEncoderReranker`** ships as opt-in
  infrastructure (off by default — failed the +0.030 MRR bar on this
  distribution; lift was concentrated on tutorial-class sites only).

### Local-replica benchmark (11-site canonical pool, Track-D chunks, mxbai embedder)
| Metric                  | v0.9.9-rc1   | v0.10.0       | Δ                |
|-------------------------|-------------:|--------------:|-----------------:|
| Mean MRR                | 0.3461       | **0.3859**    | **+0.040 (+11.5%)** |
| Cost at 50 M pages      | $10,152      | **$0**        | **−$10,152/yr**  |
| Chunks per page         | 20.3         | 10.49         | −48% (smaller index) |

## [0.9.3] - 2026-04-26

Last release before the v3 retry overhaul. See git log for the 0.5.0 → 0.9.3
release history (multi-site discovery, screenshot pipeline, image download,
smart-sample, dry-run, etc.).

[0.11.0]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.11.0
[0.10.6]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.10.6
[0.10.5]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.10.5
[0.10.4]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.10.4
[0.10.3]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.10.3
[0.10.2]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.10.2
[0.10.1]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.10.1
[0.10.0]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.10.0
[0.9.3]: https://github.com/AIMLPM/markcrawl/releases/tag/v0.9.3