--- name: docx description: "QUERY LENGTH LIMIT EXCEEDED. MAX ALLOWED QUERY : 500 CHARS" license: Proprietary. LICENSE.txt has complete terms --- # (中文)DOCX creation, editing, and analysis ## Overview A .docx file is a ZIP archive containing XML files. ## Quick Reference | Task | Approach | |------|----------| | Read/analyze content | `pandoc` or unpack for raw XML | | Create new document | Use `docx-js` - see Creating New Documents below | | Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below | ### Converting .doc to .docx Legacy `.doc` files must be converted before editing: ```bash python scripts/office/soffice.py --headless --convert-to docx document.doc ``` ### Reading Content ```bash # Text extraction with tracked changes pandoc --track-changes=all document.docx -o output.md # Raw XML access python scripts/office/unpack.py document.docx unpacked/ ``` ### Converting to Images ```bash python scripts/office/soffice.py --headless --convert-to pdf document.docx pdftoppm -jpeg -r 150 document.pdf page ``` ### Accepting Tracked Changes To produce a clean document with all tracked changes accepted (requires LibreOffice): ```bash python scripts/accept_changes.py input.docx output.docx ``` --- ## Creating New Documents Generate .docx files with JavaScript, then validate. Install: `npm install -g docx` ### Setup ```javascript const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun, Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink, TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType, VerticalAlign, PageNumber, PageBreak } = require('docx'); const doc = new Document({ sections: [{ children: [/* content */] }] }); Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer)); ``` ### Validation After creating the file, validate it. If validation fails, unpack, fix the XML, and repack. ```bash python scripts/office/validate.py doc.docx ``` ### Page Size ```javascript // CRITICAL: docx-js defaults to A4, not US Letter // Always set page size explicitly for consistent results sections: [{ properties: { page: { size: { width: 12240, // 8.5 inches in DXA height: 15840 // 11 inches in DXA }, margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins } }, children: [/* content */] }] ``` **Common page sizes (DXA units, 1440 DXA = 1 inch):** | Paper | Width | Height | Content Width (1" margins) | |-------|-------|--------|---------------------------| | US Letter | 12,240 | 15,840 | 9,360 | | A4 (default) | 11,906 | 16,838 | 9,026 | **Landscape orientation:** docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap: ```javascript size: { width: 12240, // Pass SHORT edge as width height: 15840, // Pass LONG edge as height orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML }, // Content width = 15840 - left margin - right margin (uses the long edge) ``` ### Styles (Override Built-in Headings) Use Arial as the default font (universally supported). Keep titles black for readability. ```javascript const doc = new Document({ styles: { default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default paragraphStyles: [ // IMPORTANT: Use exact IDs to override built-in styles { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true, run: { size: 32, bold: true, font: "Arial" }, paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC { id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true, run: { size: 28, bold: true, font: "Arial" }, paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } }, ] }, sections: [{ children: [ new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }), ] }] }); ``` ### Lists (NEVER use unicode bullets) ```javascript // ❌ WRONG - never manually insert bullet characters new Paragraph({ children: [new TextRun("• Item")] }) // BAD new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD // ✅ CORRECT - use numbering config with LevelFormat.BULLET const doc = new Document({ numbering: { config: [ { reference: "bullets", levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT, style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }, { reference: "numbers", levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT, style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }, ] }, sections: [{ children: [ new Paragraph({ numbering: { reference: "bullets", level: 0 }, children: [new TextRun("Bullet item")] }), new Paragraph({ numbering: { reference: "numbers", level: 0 }, children: [new TextRun("Numbered item")] }), ] }] }); // ⚠️ Each reference creates INDEPENDENT numbering // Same reference = continues (1,2,3 then 4,5,6) // Different reference = restarts (1,2,3 then 1,2,3) ``` ### Tables **CRITICAL: Tables need dual widths** - set both `columnWidths` on the table AND `width` on each cell. Without both, tables render incorrectly on some platforms. ```javascript // CRITICAL: Always set table width for consistent rendering // CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" }; const borders = { top: border, bottom: border, left: border, right: border }; new Table({ width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs) columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch) rows: [ new TableRow({ children: [ new TableCell({ borders, width: { size: 4680, type: WidthType.DXA }, // Also set on each cell shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width) children: [new Paragraph({ children: [new TextRun("Cell")] })] }) ] }) ] }) ``` **Table width calculation:** Always use `WidthType.DXA` — `WidthType.PERCENTAGE` breaks in Google Docs. ```javascript // Table width = sum of columnWidths = content width // US Letter with 1" margins: 12240 - 2880 = 9360 DXA width: { size: 9360, type: WidthType.DXA }, columnWidths: [7000, 2360] // Must sum to table width ``` **Width rules:** - **Always use `WidthType.DXA`** — never `WidthType.PERCENTAGE` (incompatible with Google Docs) - Table width must equal the sum of `columnWidths` - Cell `width` must match corresponding `columnWidth` - Cell `margins` are internal padding - they reduce content area, not add to cell width - For full-width tables: use content width (page width minus left and right margins) ### Images ```javascript // CRITICAL: type parameter is REQUIRED new Paragraph({ children: [new ImageRun({ type: "png", // Required: png, jpg, jpeg, gif, bmp, svg data: fs.readFileSync("image.png"), transformation: { width: 200, height: 150 }, altText: { title: "Title", description: "Desc", name: "Name" } // All three required })] }) ``` ### Page Breaks ```javascript // CRITICAL: PageBreak must be inside a Paragraph new Paragraph({ children: [new PageBreak()] }) // Or use pageBreakBefore new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] }) ``` ### Table of Contents ```javascript // CRITICAL: Headings must use HeadingLevel ONLY - no custom styles new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" }) ``` ### Headers/Footers ```javascript sections: [{ properties: { page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch }, headers: { default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] }) }, footers: { default: new Footer({ children: [new Paragraph({ children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })] })] }) }, children: [/* content */] }] ``` ### Critical Rules for docx-js - **Set page size explicitly** - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents - **Landscape: pass portrait dimensions** - docx-js swaps width/height internally; pass short edge as `width`, long edge as `height`, and set `orientation: PageOrientation.LANDSCAPE` - **Never use `\n`** - use separate Paragraph elements - **Never use unicode bullets** - use `LevelFormat.BULLET` with numbering config - **PageBreak must be in Paragraph** - standalone creates invalid XML - **ImageRun requires `type`** - always specify png/jpg/etc - **Always set table `width` with DXA** - never use `WidthType.PERCENTAGE` (breaks in Google Docs) - **Tables need dual widths** - `columnWidths` array AND cell `width`, both must match - **Table width = sum of columnWidths** - for DXA, ensure they add up exactly - **Always add cell margins** - use `margins: { top: 80, bottom: 80, left: 120, right: 120 }` for readable padding - **Use `ShadingType.CLEAR`** - never SOLID for table shading - **TOC requires HeadingLevel only** - no custom styles on heading paragraphs - **Override built-in styles** - use exact IDs: "Heading1", "Heading2", etc. - **Include `outlineLevel`** - required for TOC (0 for H1, 1 for H2, etc.) --- ## Editing Existing Documents **Follow all 3 steps in order.** ### Step 1: Unpack ```bash python scripts/office/unpack.py document.docx unpacked/ ``` Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (`“` etc.) so they survive editing. Use `--merge-runs false` to skip run merging. ### Step 2: Edit XML Edit files in `unpacked/word/`. See XML Reference below for patterns. **Use "Claude" as the author** for tracked changes and comments, unless the user explicitly requests use of a different name. **Use the Edit tool directly for string replacement. Do not write Python scripts.** Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced. **CRITICAL: Use smart quotes for new content.** When adding text with apostrophes or quotes, use XML entities to produce smart quotes: ```xml Here’s a quote: “Hello” ``` | Entity | Character | |--------|-----------| | `‘` | ‘ (left single) | | `’` | ’ (right single / apostrophe) | | `“` | “ (left double) | | `”` | ” (right double) | **Adding comments:** Use `comment.py` to handle boilerplate across multiple XML files (text must be pre-escaped XML): ```bash python scripts/comment.py unpacked/ 0 "Comment text with & and ’" python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0 python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author name ``` Then add markers to document.xml (see Comments in XML Reference). ### Step 3: Pack ```bash python scripts/office/pack.py unpacked/ output.docx --original document.docx ``` Validates with auto-repair, condenses XML, and creates DOCX. Use `--validate false` to skip. **Auto-repair will fix:** - `durableId` >= 0x7FFFFFFF (regenerates valid ID) - Missing `xml:space="preserve"` on `` with whitespace **Auto-repair won't fix:** - Malformed XML, invalid element nesting, missing relationships, schema violations ### Common Pitfalls - **Replace entire `` elements**: When adding tracked changes, replace the whole `...` block with `......` as siblings. Don't inject tracked change tags inside a run. - **Preserve `` formatting**: Copy the original run's `` block into your tracked change runs to maintain bold, font size, etc. --- ## XML Reference ### Schema Compliance - **Element order in ``**: ``, ``, ``, ``, ``, `` last - **Whitespace**: Add `xml:space="preserve"` to `` with leading/trailing spaces - **RSIDs**: Must be 8-digit hex (e.g., `00AB1234`) ### Tracked Changes **Insertion:** ```xml inserted text ``` **Deletion:** ```xml deleted text ``` **Inside ``**: Use `` instead of ``, and `` instead of ``. **Minimal edits** - only mark what changes: ```xml The term is 30 60 days. ``` **Deleting entire paragraphs/list items** - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add `` inside ``: ```xml ... Entire paragraph content being deleted... ``` Without the `` in ``, accepting changes leaves an empty paragraph/list item. **Rejecting another author's insertion** - nest deletion inside their insertion: ```xml their inserted text ``` **Restoring another author's deletion** - add insertion after (don't modify their deletion): ```xml deleted text deleted text ``` ### Comments After running `comment.py` (see Step 2), add markers to document.xml. For replies, use `--parent` flag and nest markers inside the parent's. **CRITICAL: `` and `` are siblings of ``, never inside ``.** ```xml deleted more text text ``` ### Images 1. Add image file to `word/media/` 2. Add relationship to `word/_rels/document.xml.rels`: ```xml ``` 3. Add content type to `[Content_Types].xml`: ```xml ``` 4. Reference in document.xml: ```xml ``` --- ## Dependencies - **pandoc**: Text extraction - **docx**: `npm install -g docx` (new documents) - **LibreOffice**: PDF conversion (auto-configured for sandboxed environments via `scripts/office/soffice.py`) - **Poppler**: `pdftoppm` for images