# Technical Design [English](./technical-design.md) | [中文](./zh/technical-design.md) --- ## Design Philosophy — AI as Your Designer, Not Your Finisher The generated PPTX is a **design draft**, not a finished product. Think of it like an architect's rendering: the AI handles visual design, layout, and content structure — delivering a high-quality starting point. For truly polished results, **expect to do your own finishing work** in PowerPoint: swapping shapes, refining charts, adjusting colors, replacing placeholder graphics with native objects. The goal is to eliminate 90% of the blank-page work, not to replace human judgment in the final mile. Don't expect one AI pass to do everything — that's not how good presentations are made. **A tool's ceiling is your ceiling.** PPT Master amplifies the skills you already have — if you have a strong sense of design and content, it helps you execute faster. If you don't know what a great presentation looks like, the tool won't know either. The output quality is ultimately a reflection of your own taste and judgment. --- ## System Architecture ``` User Input (PDF/DOCX/XLSX/URL/Markdown) ↓ [Source Content Conversion] → source_to_md/pdf_to_md.py / doc_to_md.py / excel_to_md.py / ppt_to_md.py / web_to_md.py ↓ [Create Project] → project_manager.py init --format ↓ [Template (optional)] — default: skip, proceed with free design User names a template: copy template files into the project Need a new global template: use /create-template workflow separately ↓ [Strategist] - Eight Confirmations & Design Specifications → design_spec.md + spec_lock.md ↓ [Image Acquisition] (when any row in the resource list needs AI generation or web search) ↓ [Executor] ├── Visual construction: generate all SVG pages → svg_output/ ├── [Quality Check] svg_quality_checker.py (mandatory — must pass with 0 errors) └── Notes generation: complete speaker notes → notes/total.md ↓ [Chart calibration (optional)] → verify-charts workflow (for decks containing data charts) ↓ [Visual self-check (optional, opt-in)] → visual-review workflow (only when the user explicitly requests it) ↓ [Post-processing] → total_md_split.py (split notes) → finalize_svg.py → svg_to_pptx.py ↓ Output: exports/ ├── presentation_.pptx ← Native shapes (DrawingML) — canonical output, edit & deliver from here └── presentation__svg.pptx ← SVG snapshot pptx — pixel-perfect visual reference (opt-in via --svg-snapshot) # Always written in default-flow mode (no -o) backup// └── svg_output/ ← Archived Executor SVG source (rerun finalize_svg → svg_to_pptx to rebuild) ``` --- ## Technical Pipeline **The pipeline: AI generates SVG → post-processing converts to DrawingML (PPTX).** The full flow breaks into three stages: **Stage 1 — Content Understanding & Design Planning** Source documents (PDF/DOCX/URL/Markdown) are converted to structured text. The Strategist role analyzes the content, plans the slide structure, and confirms the visual style, producing a complete design specification. **Stage 2 — AI Visual Generation** The Executor role generates each slide as an SVG file. The output of this stage is a **design draft**, not a finished product. **Stage 3 — Engineering Conversion** Post-processing scripts convert SVG to DrawingML. Every shape becomes a real native PowerPoint object — clickable, editable, recolorable — not an embedded image. --- ## Why SVG? SVG sits at the center of this pipeline. The choice was made by elimination. **Direct DrawingML generation** seems most direct — skip the intermediate format, have AI output PowerPoint's underlying XML. But DrawingML is extremely verbose; a simple rounded rectangle requires dozens of lines of nested XML. AI has far less training data for it than SVG, output is unreliable, and debugging is nearly impossible by eye. **HTML/CSS** is one of the formats AI knows best. But HTML and PowerPoint have fundamentally different world views. HTML describes a *document* — headings, paragraphs, lists — where element positions are determined by content flow. PowerPoint describes a *canvas* — every element is an independent, absolutely positioned object with no flow and no context. This isn't just a layout calculation problem; it's a structural mismatch. Even if you solved the browser layout engine problem (what Chromium does in millions of lines of code), an HTML `` still has no natural mapping to a set of independent shapes on a slide. **WMF/EMF** (Windows Metafile) is Microsoft's own native vector graphics format and shares direct ancestry with DrawingML — the conversion loss would be minimal. But AI has essentially no training data for it, so this path is dead on arrival. Notably, even Microsoft's own format loses to SVG here. **SVG as embedded images** is the simplest path — render each slide as an image and embed it. But this destroys editability entirely: shapes become pixels, text cannot be selected, colors cannot be changed. No different from a screenshot. SVG wins because it shares the same world view as DrawingML: both are absolute-coordinate 2D vector graphics formats built around the same concepts: | SVG | DrawingML | |---|---| | `` | `` | | `` | `` | | `` / `` | `` | | `transform="translate/scale/rotate"` | `` | | `linearGradient` / `radialGradient` | `` | | `fill-opacity` / `stroke-opacity` | `` | The conversion is a translation between two dialects of the same idea — not a format mismatch. SVG is also the only format that simultaneously satisfies every role in the pipeline: **AI can reliably generate it, humans can preview and debug it in any browser, and scripts can precisely convert it** — all before a single line of DrawingML is written. --- ## Source Content Conversion Source documents (PDF / DOCX / EPUB / XLSX / PPTX / web pages) are normalized into Markdown before the pipeline starts — this is the source of truth Strategist reads from. Two design choices shape the converters: **Native-Python first, external binaries as fallback.** Common formats are handled by pure-Python wheels; pandoc is only invoked for the long tail of niche formats. Forcing every user to install system binaries they may not have permission for is a usability tax that doesn't pay off when 95% of inputs are docx / pdf / html. **TLS fingerprint impersonation for high-security sites.** Web fetching impersonates a Chrome TLS fingerprint by default. WeChat Official Accounts and several CDNs block Python's default `requests` handshake outright, and a single dependency that handles them is preferable to maintaining a parallel Node.js fetcher as the primary path. --- ## Project Structure & Lifecycle The non-obvious bit of the project layout is `import-sources`'s **asymmetric default**: files outside the repo are *copied* in (preserving the user's original), files inside the repo are *moved* in (so intermediate artifacts don't get committed by accident). The asymmetry tracks the natural risk profile — outside-repo files are typically user assets we shouldn't disturb, inside-repo files are typically transient artifacts that should be cleaned up. A single uniform default would get one or the other case wrong every time. --- ## Canvas Format System PPT Master is not PPT-only — the same SVG → DrawingML pipeline produces square posters, 9:16 stories, A4 prints. Format-specific conventions (ratios, safe zones, brand areas) live in [`references/canvas-formats.md`](../skills/ppt-master/references/canvas-formats.md). The architectural choice worth flagging: **viewBox is in pixels, not absolute units.** Pixel space makes layout reasoning unambiguous for the AI Executor (`x="100"` is unambiguously left + 100px) and inspectable in any browser. Conversion to PowerPoint's EMU happens once at export — picking pixels means the rest of the pipeline (Strategist, Executor, quality checker, post-processing) never thinks in EMU, which would be hostile both to AI generation and to human debugging. --- ## Template System & Optional Path Templates are **opt-in, not default**. The default Strategist flow is free design — AI invents the visual system from the source content alone. The template path activates only on explicit user trigger. **Why default to free design.** Templates are floors that easily become ceilings: they lock the deck into the template's visual idioms regardless of how the content actually wants to be presented. Free-design layouts derive structure from the source content rather than imposing it from a fixed grammar, so the visual rhythm tracks the content rather than fighting it. Constrained mode is genuinely better in narrow cases (brand-locked decks, strongly-typed scenarios like academic defense or government report), so it stays available — but the AI doesn't proactively reach for it; the user does. **No proactive matching.** The AI does not suggest, hint at, or auto-map content to a template. Even when a deck looks like an obvious fit for an existing template, the AI stays silent and proceeds with free design unless the user has named the template. The reason is reliability over discoverability: matching content to templates is a judgment call that drifts as the library evolves, and a wrong "you might want X" pushes the user toward a commitment the AI cannot reliably make. Discoverability is handed to docs (the three `templates/{brands,layouts,decks}/README.md` per-kind indexes) and to the explicit query path ("what templates are available?"), not the runtime prompt. **Layouts are opt-in; charts and icons are not.** The asymmetry isn't an inconsistency — *layout* is what locks visual idiom (the floor/ceiling problem above), while charts and icons are reusable primitives that don't impose deck-wide style. Same `templates/` directory, different role in the visual contract. --- ## Role System: Three Specialized Agents in a Single Pipeline PPT Master uses **role switching within one main agent** rather than parallel sub-agents. The choice has three connected reasons: **Why one agent, not parallel sub-agents.** Page design depends on the full upstream context — Strategist's color choices, the image resources that actually got acquired (vs failed and substituted), prior pages' visual rhythm. Sub-agents would start with a stale partial snapshot of that context and produce visually drifting decks. The same logic forbids batched page generation (e.g., five pages per turn): batching accelerates context compression and the deck's visual consistency degrades faster than the speed gain is worth. **Why role-specialized references, not one mega prompt.** Strategist runs in "negotiate with user" mode (open-ended, conversational, willing to back up); Executor runs in "produce strict XML" mode (no improvisation, no missing attributes). Mixing both into one prompt forces the model to hold incompatible discipline in the same turn — every prompt-engineering pathology of mode-mixing shows up. Splitting into per-role files lets each role load only what it needs and discard the rest. **Eight Confirmations as the only blocking gate.** Strategist ends with eight bundled user confirmations (canvas / page count / audience / style / color / icon / typography / image) presented as one blocking decision point. After confirmation, the pipeline runs to completion without further interrupts. The reason it's bundled and singular: design choices are correlated (color affects icon library affects typography), so resolving them together produces coherent decisions, while spreading confirmations across phases would invite contradictory user inputs and force backtracking. **User-provided image analysis goes through metadata, not pixels.** When the user supplies images, Strategist runs an extractor that summarizes dimensions, EXIF orientation, dominant color, and subject — and reasons over that text. Opening image bytes directly is forbidden because the LLM doesn't need pixels to make layout decisions; it needs facts that fit on a page (aspect ratio for placement, color tone for palette compatibility, subject for slide assignment). Pixel reading would burn context for no decision-quality gain. **Per-page spec_lock re-read** is the long-deck anti-drift mechanism — full rationale in § Spec Propagation below. --- ## Execution Discipline The pipeline is enforced by an 8-rule set in [`SKILL.md` § Global Execution Discipline](../skills/ppt-master/SKILL.md) — that file is authoritative; the rules live there. They look bureaucratic but exist because LLMs default to "let me solve the whole problem in this turn", which is exactly the wrong shape for a serial pipeline where each step's output is bounded, checkpointed, and consumed by the next. The rules collectively close failure modes that surfaced repeatedly in practice: out-of-order execution, AI proxying user design decisions, cross-phase bundling, missing prerequisites, speculative pre-work, sub-agent context loss, page-batching drift, and long-deck color/font drift. The Role Switching Protocol (mandated read of `references/.md` before mode change) serves two reinforcing purposes: forcing fresh role instructions into context overrides drift from the previous mode, and the visible marker in the conversation transcript creates an audit trail so the user can see when the agent moved between modes — critical when reviewing why a particular decision was made. --- ## Spec Propagation: spec_lock.md as Execution Contract The Strategist phase produces two artifacts that look redundant but serve different masters: - `design_spec.md` — human-readable narrative; the "why" of the design (target audience, style objective, color rationale, page outline) - `spec_lock.md` — machine-readable execution contract; the "what" Executor must literally use (HEX colors, exact font family string, icon library choice, image resource list with status) Why both? Without `spec_lock.md`, the Executor would re-read `design_spec.md` per page during long decks and the LLM's context-compression drift would gradually mutate colors and fonts mid-deck. `spec_lock.md` is the **anti-drift mechanism** — the SKILL.md mandates `read_file /spec_lock.md` before every page, so values stay verbatim across 20+ slides. `update_spec.py` propagates a post-generation change in two coordinated steps: write the new value to `spec_lock.md`, then literal-replace it across every `svg_output/*.svg`. The tool's scope is deliberately narrow — only `colors.*` (HEX values, case-insensitive replacement) and `typography.font_family` (attribute-scoped). Other fields (font sizes, icons, images, canvas) are intentionally **not supported** because their replacements would need attribute-scoped or semantic awareness whose risk/benefit doesn't justify bulk propagation. For those, edit `spec_lock.md` and re-author the affected pages. The tool refuses to back up: it relies on git for revert. Adding a backup mechanism would just duplicate git's job and create stale snapshots. --- ## Image Acquisition & Embedding Two architectural decisions shape this phase: **Provider-specific config keys, not a generic `IMAGE_API_KEY`.** Every backend takes its own `OPENAI_API_KEY` / `MINIMAX_API_KEY` / etc. and the active one is selected by an explicit `IMAGE_BACKEND=`. A unified `IMAGE_API_KEY` field looks tidier on first glance but causes silent confusion when a user has multiple providers configured at once and isn't sure which one is active — the kind of fault that surfaces only as "image generation gives weird results" with no clear failure point. Forcing per-provider keys makes "which backend am I using" a config-readable fact, not an inference. **Permissive-by-default license filter, with strict mode for credit-incompatible layouts.** Web image search defaults to allowing CC BY / CC BY-SA images with inline attribution — most slides have visual room for a credit element. `--strict-no-attribution` is the escape hatch for full-bleed hero images and tight composition where there's no place to put a credit without breaking the design. Non-commercial (CC BY-NC*) and no-derivatives (CC BY-ND*) licenses are auto-rejected because the typical PPT Master output is shared in commercial or modified contexts; a permissive default with that floor is the failure mode users actually want. **External refs during development, two divergent embedding strategies for delivery.** While editing in `svg_output/`, images are external file references — fast iteration, single-source-of-truth replacement. The two delivery artifacts then diverge: `svg_final/` Base64-inlines (a folder of self-contained SVGs that IDE preview, browser, and the preview pptx can all open without missing the bitmap dependencies); native pptx instead copies bitmaps into the PPTX media folder and uses `` to express the cropping. The split exists because Base64 inside DrawingML works but bloats file size 3-4×, while file-referenced bitmaps are PowerPoint's native idiom for which `` is the canonical crop expression — wrong tool in either direction would cost editability or file size. **Three-dimensional AI image lock at Strategist time.** When the deck includes AI-generated images, Strategist decides three orthogonal dimensions up front — `rendering` (visual style family: vector-illustration / editorial / 3d-isometric / sketch-notes / …), `palette` (how the deck's HEX values are *used*: proportion + role + temperament), `type` (per-image internal composition: background / hero / framework / comparison / …). The first two are deck-wide and written into `spec_lock.md`; Image_Generator then assembles every per-image prompt from the single locked rendering + palette plus a per-image type, instead of re-deciding style per image. Without this, every image gets its own style drift and the deck reads as a stack of unrelated illustrations. This is the visual-cohesion dual of `spec_lock`'s typography/color anti-drift mechanism, just one level upstream of pixels. Strategist surfaces ≥3 candidate `rendering × palette` combinations to the user during the Eight Confirmations — never auto-locking a single combination silently, because the choice has far-reaching deck-wide consequences and the user's taste is the only oracle for it. --- ## Image-Text Layout: Primary Structures + Modifier Layers The catalog of *how an image is placed on a slide* (full vocabulary in [`references/image-layout-patterns.md`](../skills/ppt-master/references/image-layout-patterns.md)) splits 72 numbered techniques into two layers that compose freely: - **Primary Structures** (container layouts / image-as-canvas + native overlay / multi-image compositions) — the page's bones. One or more per page; cross-Primary combinations like *side-by-side comparison + image-as-canvas annotation* are legitimate. - **Modifier Layers** (non-rectangular clips / overlays & masks / texture / special techniques) — finish. Any number per page, stacked on top of the Primary. **Why explicit composition, not "one primary per page".** The AI failure mode this catalog fights isn't *over-combining*, it's *under-using*: defaulting every image page to bare `#2 left-third` or `#48 side-by-side` with no Modifier on top, producing visually flat, "AI-default" layouts. The earlier rule "one primary layout per page; modifiers compose" sounded principled but reinforced the under-use — the AI read it as permission to skip the Modifier layer entirely. The current framing flips the encouragement: combining is normal, single-Primary-no-Modifier is the case that needs justification. **Why the layers are physically separated, not just tagged.** Patterns are reorganized so all Primary structures appear first, then all Modifiers — a Strategist or Executor reading the file once internalizes the two-layer mental model from the table of contents alone. Numbers are stable identifiers (`#38` is still image-as-canvas + annotation cards regardless of where it sits in the file), so existing references across `spec_lock.md`, `design_spec.md §VIII`, executor logs, and historical examples all keep resolving. **Why composition flows through Strategist's resource list, not just Executor's improvisation.** The `Layout pattern` column in `§VIII Image Resource List` accepts a `# + # ...` expression — Primary id plus optional Modifier ids — so the composition is declared *before* SVG generation, audited by `svg_quality_checker`, and survives session re-entry. Pushing composition onto Executor alone would lose it on context compression in long decks; encoding it in the spec_lock-adjacent resource list makes it a piece of the design contract. **Why true hard constraints stay upstream.** Cross-cutting technical constraints (`` only on ``, `fill-opacity` instead of `rgba()`, no ``, alpha-effect routing) live exclusively in [`shared-standards.md`](../skills/ppt-master/references/shared-standards.md). The layout patterns file points at them with one-line references rather than restating — so when a constraint relaxes (e.g., a new DrawingML feature becomes reliable), only one file changes, and a stale duplicate in patterns can't silently keep enforcing the old rule. --- ## SVG Constraints: Banned Features and Conditional Allowances PowerPoint's DrawingML is a strict subset of what SVG can express. The Executor operates inside an empirically-grown blacklist (mask, style/class, `@font-face`, foreignObject, symbol+use, textPath, animate*, script/iframe …) plus narrow conditional allowances for `marker-start`/`marker-end` and image-only `clip-path`. The authoritative list and exact per-feature constraints — including the substitute-effect routing table for `` (gradient overlays, clipPath, filter shadow, source-image bake-in) — live in [`references/shared-standards.md`](../skills/ppt-master/references/shared-standards.md). The architectural reasons worth knowing here: - **Why a blacklist, not a whitelist.** SVG is a wide spec; enumerating allowed features would force constant maintenance as the Executor finds new useful constructs. The blacklist captures the narrow set whose semantics have no DrawingML representation, leaving everything else implicitly available. - **Why empirical, not derived from spec.** The list grew from real PPT export failures, not from reading the OOXML spec. Several features (e.g., ``) are theoretically expressible in DrawingML but practically unreliable across PowerPoint versions; the blacklist reflects the actually-shippable subset. - **XML well-formedness traps.** Two cross-cutting gotchas independent of DrawingML: typography must use raw Unicode (`—`, `→`, `©`, NBSP) since HTML named entities (`—`) are XML-illegal in SVG, and reserved XML chars (`& < >`) must be entity-escaped or `R&D` will abort the export. These bite often enough to flag at the architecture level. - **The blacklist runs before post-processing.** `svg_quality_checker.py` enforces it on `svg_output/`; post-processing rewrites SVG and would mask source-level violations. Fixes are always re-authoring in the Executor — there is intentionally no auto-fix mode (see Quality Gate). --- ## Quality Gate **Why a checker exists at all.** SVG generated by an LLM is not deterministic — banned features creep in over long decks and only surface when `svg_to_pptx` aborts mid-conversion or PowerPoint silently drops elements. The checker turns "PowerPoint export failed at page 14" into "the Executor used `