--- name: translate-book description: Translate books (PDF/DOCX/EPUB) into any language using parallel sub-agents. Converts input -> Markdown chunks -> translated chunks -> HTML/DOCX/EPUB/PDF. allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion metadata: {"openclaw":{"requires":{"bins":["python3","pandoc","ebook-convert"],"anyBins":["calibre","ebook-convert"]},"homepage":"https://github.com/deusyu/translate-book"}} --- # Book Translation Skill You are a book translation assistant. You translate entire books from one language to another by orchestrating a multi-step pipeline. ## Workflow ### 1. Collect Parameters Determine the following from the user's message: - **file_path**: Path to the input file (PDF, DOCX, or EPUB) — REQUIRED - **target_lang**: Target language code (default: `zh`) — e.g. zh, en, ja, ko, fr, de, es - **concurrency**: Number of parallel sub-agents per batch (default: `8`) - **custom_instructions**: Any additional translation instructions from the user (optional) If the file path is not provided, ask the user. ### 2. Preprocess — Convert to Markdown Chunks Run the conversion script to produce chunks: ```bash python3 {baseDir}/scripts/convert.py "" --olang "" ``` This creates a `{filename}_temp/` directory containing: - `input.html`, `input.md` — intermediate files - `chunk0001.md`, `chunk0002.md`, ... — source chunks for translation - `manifest.json` — chunk manifest for tracking and validation - `config.txt` — pipeline configuration with metadata ### 3. Discover Chunks Use Glob to find all source chunks and determine which still need translation: ``` Glob: {filename}_temp/chunk*.md Glob: {filename}_temp/output_chunk*.md ``` Calculate the set of chunks that have a source file but no corresponding `output_` file. These are the chunks to translate. If all chunks already have translations, skip to step 5. ### 3.5. Build Glossary (term consistency) A separate sub-agent translates each chunk with a fresh context. Without shared state, the same proper noun can drift across multiple translations. The glossary makes every sub-agent see the same canonical translation for the terms that appear in its chunk. If `/glossary.json` already exists, skip the rebuild — re-running the skill must not overwrite a hand-edited glossary. To force a rebuild, delete the file. Otherwise: 1. **Sample chunks**: read `chunk0001.md`, the last chunk, and 3 evenly-spaced middle chunks. If `chunk_count < 5`, sample all of them. 2. **Extract terms**: from the samples, identify proper nouns and recurring domain terms that need consistent translation across the book — typically people, places, organizations, technical concepts. Translate each into the target language. Skip generic vocabulary that any translator would render the same way. 3. **Write `glossary.json`** in the temp dir, matching this v2 schema: ```json { "version": 2, "terms": [ {"id": "Manhattan", "source": "Manhattan", "target": "曼哈顿", "category": "place", "aliases": [], "gender": "unknown", "confidence": "medium", "frequency": 0, "evidence_refs": [], "notes": ""} ], "high_frequency_top_n": 20, "applied_meta_hashes": {} } ``` Existing v1 `glossary.json` files are auto-upgraded to v2 on first load. v2 forbids the same surface form (source or alias) appearing in two different terms; if a v1 file has polysemous duplicate sources, the upgrade aborts with a disambiguation message. 4. **Count frequencies** by running: ```bash python3 {baseDir}/scripts/glossary.py count-frequencies "" ``` This scans every `chunk*.md` (excluding `output_chunk*.md`), updates each term's `frequency` field, and writes back atomically. The glossary is hand-editable. If the user edits a `target` field after a partial run, that's fine for this commit — affected chunks won't auto-re-translate yet (commit 3 adds precise re-translation). ### 4. Parallel Translation with Sub-Agents **Each chunk gets its own independent sub-agent** (1 chunk = 1 sub-agent = 1 fresh context). This prevents context accumulation and output truncation. Launch chunks in batches to respect API rate limits: - Each batch: up to `concurrency` sub-agents in parallel (default: 8) - Wait for the current batch to complete before launching the next **Spawn each sub-agent with the following task.** Use whatever sub-agent/background-agent mechanism your runtime provides (e.g. the Agent tool, sessions_spawn, or equivalent). The output file is `output_` prefixed to the source filename: `chunk0001.md` → `output_chunk0001.md`. > Translate the file `/chunk.md` to {TARGET_LANGUAGE} and write the result to `/output_chunk.md`. Follow the translation rules below. Output only the translated content — no commentary. Each sub-agent receives: - The single chunk file it is responsible for - The temp directory path - The target language - The translation prompt (see below) - A per-chunk term table (see "Term table assembly" below) - Any custom instructions **Term table assembly** — before spawning a sub-agent, run: ```bash python3 {baseDir}/scripts/glossary.py print-terms-for-chunk "" "chunk.md" ``` Capture stdout. The CLI emits a 3-column markdown table (`原文 | 别名 | 译文`) of every term that either appears in this chunk (by source OR any alias) OR is in the top-N most-frequent terms book-wide. Inject the table as `{TERM_TABLE}` in rule #13 of the translation prompt. **If stdout is empty (no glossary, or no relevant terms), omit rule #13 from this chunk's prompt entirely** — do not leave a dangling `{TERM_TABLE}` placeholder. **Each sub-agent's task**: 1. Read the source chunk file (e.g. `chunk0001.md`) 2. Translate the content following the translation rules below 3. Write the translated content to `output_chunk0001.md` 4. Write observations to `output_chunk0001.meta.json` matching the schema below. **Non-blocking** — leave fields empty if unsure; do not invent entities. Always emit the file (even if all arrays are empty), because its presence + content hash is how the main agent tracks whether feedback was already merged. **Sub-agent meta schema** (`output_chunk.meta.json`): ```json { "schema_version": 1, "new_entities": [ {"source": "Taig", "target_proposal": "泰格", "category": "person", "evidence": "<≤200-char quote from the chunk>"} ], "alias_hypotheses": [ {"variant": "Taig", "may_be_alias_of_source": "Tai", "evidence": "<≤200-char quote>"} ], "attribute_hypotheses": [ {"entity_source": "Tai", "attribute": "gender", "value": "male", "confidence": "high", "evidence": "<≤200-char quote>"} ], "used_term_sources": ["Tai", "Manhattan"], "conflicts": [ {"entity_source": "Tai", "field": "target", "injected": "泰", "observed_better": "太一", "evidence": "<≤200-char quote>"} ] } ``` **Do NOT include a `chunk_id` field** — chunk identity is derived from the filename. Putting it in the payload creates a hallucination hole and validation will reject the file. The meta file is read by the main agent later and merged into `glossary.json` (see `merge_meta.py`). Sub-agents should fill the schema honestly: cite real quotes from the chunk, never invent entities to "look productive". An empty meta is a perfectly valid output. **IMPORTANT**: Each sub-agent translates exactly ONE chunk and writes the result directly to the output file. No START/END markers needed. #### Translation Prompt for Sub-Agents Include this translation prompt in each sub-agent's instructions (replace `{TARGET_LANGUAGE}` with the actual language name, e.g. "Chinese"): --- 请翻译markdown文件为 {TARGET_LANGUAGE}. IMPORTANT REQUIREMENTS: 1. 严格保持 Markdown 格式不变,包括标题、链接、图片引用等 2. 仅翻译文字内容,保留所有 Markdown 语法和文件名 3. 删除空链接、不必要的字符和如: 行末的'\\'。页码已由 convert.py 上游处理,不要再删除独立的数字行(可能是年份 1984、章节编号、引用编号等正文内容)。 4. 保证格式和语义准确翻译内容自然流畅 5. 只输出翻译后的正文内容,不要有任何说明、提示、注释或对话内容。 6. 表达清晰简洁,不要使用复杂的句式。请严格按顺序翻译,不要跳过任何内容。 7. 必须保留所有图片引用,包括: - 所有 ![alt](path) 格式的图片引用必须完整保留 - 图片文件名和路径不要修改(如 media/image-001.png) - 图片alt文本可以翻译,但必须保留图片引用结构 - 不要删除、过滤或忽略任何图片相关内容 - 图片引用示例:![Figure 1: Data Flow](media/image-001.png) -> ![图1:数据流](media/image-001.png) - **原始 HTML 标签(如 `...`、``)必须保持合法**:翻译 `alt`、`title` 等属性值内部文本时,下列字符会破坏 HTML 结构,必须替换为安全形式(仅适用于**原始 HTML 标签的属性值内部**;普通 Markdown 正文、代码块、URL 不要主动转义): | 字符 | 在属性值内的危险 | 替换为 | |------|---------------|--------| | `"` | 闭合 `attr="..."` | 目标语言合适的弯引号(如中文 `“` `”`)或 `"` | | `'` | 闭合 `attr='...'` | 目标语言合适的弯引号(如中文 `‘` `’`)或 `'` | | `<` | 被解析为新标签 | `<` | | `>` | 被解析为标签结束 | `>` | | `&` | 被解析为实体起始(除非已是 `&xxx;`) | `&` | 不要修改 `src`、`href` 等结构性属性的值,只翻译可见文本属性(`alt`、`title`)。 - 错误示例:`alt="爱丽丝拿着标着"喝我"的瓶子"` ← 内层英文 `"` 把外层 alt 撑断了 - 正确示例:`alt="爱丽丝拿着标着“喝我”的瓶子"` 或 `alt="爱丽丝拿着标着"喝我"的瓶子"` 8. 智能识别和处理多级标题,按照以下规则添加markdown标记: - 主标题(书名、章节名等)使用 # 标记 - 一级标题(大节标题)使用 ## 标记 - 二级标题(小节标题)使用 ### 标记 - 三级标题(子标题)使用 #### 标记 - 四级及以下标题使用 ##### 标记 9. 标题识别规则: - 独立成行的较短文本(通常少于50字符) - 具有总结性或概括性的语句 - 在文档结构中起到分隔和组织作用的文本 - 字体大小明显不同或有特殊格式的文本 - 数字编号开头的章节文本(如 "1.1 概述"、"第三章"等) 10. 标题层级判断: - 根据上下文和内容重要性判断标题层级 - 章节类标题通常为高层级(# 或 ##) - 小节、子节标题依次降级(### #### #####) - 保持同一文档内标题层级的一致性 11. 注意事项: - 不要过度添加标题标记,只对真正的标题文本添加 - 正文段落不要添加标题标记 - 如果原文已有markdown标题标记,保持其层级结构 12. {CUSTOM_INSTRUCTIONS if provided} 13. 术语一致性:以下术语必须严格使用指定译法,不要自行变换。表格中"原文"列**或"别名"列**任一形式出现在正文中时,都必须翻译为"译文"列对应的形式。 {TERM_TABLE} markdown文件正文: --- ### 4.5. Merge Sub-Agent Meta Into Glossary (after each batch) Each sub-agent emitted an `output_chunk.meta.json` alongside its translated chunk. After every batch completes, the main agent merges these observations into the canonical glossary so subsequent batches see an enriched glossary. 1. Run prepare-merge: ```bash python3 {baseDir}/scripts/merge_meta.py prepare-merge "" ``` Capture stdout JSON. It contains four arrays: - `auto_apply` — new entities with no glossary collision and unanimous (target, category) across all proposing chunks. - `decisions_needed` — items requiring main-agent judgment. Each has `id`, `kind`, an `options` array, and the data needed to pick. Kinds: - `alias` — `{variant, candidate_source, evidence}`. Choices: `yes_alias` / `no_separate_entity` / `skip`. - `conflict` — `{entity_source, field, current, proposed, evidence}`. Choices: `keep_current` / `accept_proposed` / `record_in_notes`. - `new_entity_existing_alias` — sub-agents propose `proposed_source` as a new entity, but it's already someone's alias. `{proposed_source, currently_alias_of, promoted_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: one `use_variant_N` per distinct (target, category) promotion variant (promote `proposed_source` to standalone with that target+category, removing it from the host's aliases) / `keep_as_alias` / `skip`. - `existing_entity_conflict` — sub-agents proposed a (target, category) for `entity_source` that differs from the canonical. Multiple distinct differing proposals all get exposed. `{entity_source, current_target, current_category, proposed_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: `keep_current` / one `use_variant_N` per competing proposal (overwrites both target AND category, stamps the prior values into notes) / `record_in_notes` (canonical unchanged; every proposed variant gets logged to notes). - `alias_or_new_entity` — `variant` has multiple competing options that can't all coexist under v2's surface-form uniqueness rule. Triggered when (a) `variant` was proposed both as a new standalone entity AND as an alias of one or more candidates, OR (b) `variant` was proposed as an alias of two or more different candidates with no standalone competitor. `{variant, alias_candidates: [{candidate_source, evidence, evidence_chunks}, ...], standalone_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: one `use_alias_N` per candidate (attach as alias of that candidate), one `use_standalone_N` per competing standalone proposal (add as standalone with that target+category), or `skip`. - `conflicting_new_entity_proposals` — `{source, variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: `use_variant_0`, `use_variant_1`, ..., `skip`. - `consumed_chunk_ids` — every meta file scanned this round (regardless of whether it produced a finding). These hashes get recorded in `applied_meta_hashes` on apply. - `malformed_meta_chunk_ids` — meta files that failed validation. Quarantined: not consumed, not crashing the run. Surface them in your batch progress. 2. **If `consumed_chunk_ids` is empty** → nothing was scanned; skip to Step 5. 3. **If `consumed_chunk_ids` is non-empty but both `auto_apply` and `decisions_needed` are empty** → still pipe `{"auto_apply": [], "decisions": [], "consumed_chunk_ids": [...]}` into `apply-merge` so the hashes get recorded. **Skipping this is the bug** — no-op metas would re-scan forever otherwise. 4. **Otherwise, resolve each decision**: - Read its evidence quotes inline. - Pick one option from its `options` array. - Build a `decisions` entry that round-trips the original decision plus your choice. The entry MUST include the original `kind` and (for `conflicting_new_entity_proposals`) the `variants` array, so apply-merge can validate and act: ```json {"id": "d1", "kind": "alias", "variant": "Taig", "candidate_source": "Tai", "choice": "yes_alias"} ``` 5. Pipe the decisions JSON into apply-merge: ```bash echo '{"auto_apply": [...], "decisions": [...], "consumed_chunk_ids": [...]}' \ | python3 {baseDir}/scripts/merge_meta.py apply-merge "" ``` Surface the summary JSON (`auto_applied`, `decisions_resolved`, `consumed_chunks`, `errors`) in your batch progress message. **apply-merge is transactional.** If any decision is malformed (wrong choice for kind, missing fields, references a non-existent entity), the entire batch aborts with a non-zero exit and stderr details — no glossary mutation, no hashes recorded. On non-zero exit, fix the offending decision and re-pipe; `prepare-merge` will surface the same proposals because nothing was consumed. **Decision order in the input list is not significant.** `apply-merge` internally dispatches entity-creating decisions before alias-attaching ones, so `yes_alias` decisions whose candidate is created by another decision in the same batch (a `use_standalone_N`, `use_variant_N`, or `promote_to_separate_entity`) succeed regardless of the order you pass them in. Alias chains (e.g. `Taighi → Taig` where `Taig → Tai` is also a pending alias decision) resolve via a fixed-point loop within the alias-attacher pass — you don't need to topo-sort or sequence chained aliases manually. On a fresh run after a previous interrupted batch, `prepare-merge` will pick up any meta files left behind. Don't manually delete them. ### 5. Verify Completeness and Retry After all batches complete, use Glob to check that every source chunk has a corresponding output file. If any are missing, retry them — each missing chunk as its own sub-agent. Maximum 2 attempts per chunk (initial + 1 retry). Also read `manifest.json` and verify: - Every chunk id has a corresponding output file - No output file is empty (0 bytes) Then run the meta-merge observability snapshot: ```bash python3 {baseDir}/scripts/merge_meta.py status "" ``` Surface a one-line summary in the verification report: > Translated chunks: 50 • Meta files: 48 found / 47 consumed • Malformed: 1 (chunk0099 — see stderr) • Chunks missing meta: chunk0017, chunk0042 Severity rules (none of these fail the run — meta is non-blocking): - `unmerged_meta_files > 0` after Step 4.5 ran → bug, flag prominently. Resume should have caught this. - `malformed_meta_files > 0` → sub-agent emitted invalid meta; print chunk_ids and a "fix the file by hand and re-run if you want this chunk's feedback merged" note. - `meta_files_found < translated_chunks` → sub-agent-compliance issue (some chunks didn't emit meta at all). Print missing chunk_ids. Report any chunks that failed translation after retry. ### 6. Translate Book Title Read `config.txt` from the temp directory to get the `original_title` field. Translate the title to the target language. For Chinese, wrap in 书名号: `《translated_title》`. ### 7. Post-process — Merge and Build Run the build script with the translated title: ```bash python3 {baseDir}/scripts/merge_and_build.py --temp-dir "" --title "" --cleanup ``` The `--cleanup` flag removes intermediate files (chunks, input.html, etc.) after a fully successful build. If the user asked to keep intermediates, omit `--cleanup`. The script reads `output_lang` from `config.txt` automatically. Optional overrides: `--lang`, `--author`. This produces in the temp directory: - `output.md` — merged translated markdown - `book.html` — web version with floating TOC - `book_doc.html` — ebook version - `book.docx`, `book.epub`, `book.pdf` — format conversions (requires Calibre) ### 8. Report Results Tell the user: - Where the output files are located - How many chunks were translated - The translated title - List generated output files with sizes - Any format generation failures