--- name: corpus-analysis description: Gap detection and knowledge mapping techniques for comparing BRD requirements against corpus coverage. Includes SurrealQL queries for analyzing sources, entities, and topic coverage, plus prioritization frameworks for research task generation. --- # Corpus Analysis and Gap Detection This skill provides methods for analyzing corpus coverage, detecting knowledge gaps, and generating prioritized research tasks. ## Coverage Analysis Methods ### 1. Source Distribution Analysis Analyze how research sources map to planned chapters/sections. **Questions to ask**: - Which chapters have the most/least source support? - Are sources evenly distributed or clustered? - Which topics have only one source (single point of failure)? **SurrealQL query**: ```surql -- Count sources per chapter SELECT chapter, count() as source_count FROM section->cites->source GROUP BY chapter ORDER BY source_count DESC; ``` ### 2. Entity Coverage Analysis Identify which characters, locations, concepts, events are well-documented vs underrepresented. **Questions to ask**: - Which entities appear in only one source? - Which entities lack descriptive detail? - Which relationships are missing supporting evidence? **SurrealQL queries**: ```surql -- Entity mention frequency SELECT name, count(<-supports<-source) as source_count FROM concept ORDER BY source_count ASC; -- Characters with sparse descriptions SELECT name, description, count(<-appears_in<-section) as appearances FROM character WHERE length(description) < 100 ORDER BY appearances DESC; -- Locations not yet introduced SELECT name, description FROM location WHERE introduced = false; ``` ### 3. Topic Coverage Analysis Map topics from BRD against corpus to find coverage gaps. **Questions to ask**: - Which BRD topics have zero corpus representation? - Which plot points lack factual grounding? - For nonfiction: which thesis components lack evidence? **SurrealQL queries**: ```surql -- Topics mentioned in sources SELECT name, count() as mentions FROM concept<-related_to<-source GROUP BY name ORDER BY mentions DESC; -- Timeline gaps (missing events) SELECT * FROM event ORDER BY sequence; -- Uncited knowledge gaps SELECT question, context, created_at FROM knowledge_gap WHERE resolved = false ORDER BY created_at ASC; ``` ### 4. Source Quality Analysis Evaluate reliability distribution across the corpus. **Questions to ask**: - What percentage of sources are high reliability? - Are critical claims supported by high-quality sources? - Which topics rely primarily on low-reliability sources? **SurrealQL queries**: ```surql -- Source reliability distribution SELECT reliability, count() as count FROM source GROUP BY reliability ORDER BY reliability DESC; -- Sources by type SELECT source_type, count() as count FROM source GROUP BY source_type; -- Low-reliability sources supporting key concepts SELECT <-supports<-source.title as source_title, <-supports<-source.reliability as reliability, name as concept FROM concept WHERE <-supports<-source.reliability IN ['low', 'very low']; ``` ## Knowledge Gap Detection Patterns ### Pattern 1: BRD Requirements Without Corpus Support **Method**: Compare BRD sections to corpus entities and sources. **Steps**: 1. Extract key requirements from each BRD section 2. Query corpus for matching concepts, characters, locations 3. Flag requirements with zero or low matches **Example**: - BRD mentions "wireless operator training protocols at Beaulieu 1942" - Query: `SELECT * FROM concept WHERE name CONTAINS 'wireless' OR name CONTAINS 'Beaulieu'` - If no results: FLAG as high-priority gap ### Pattern 2: Shallow Coverage (Single Source) **Method**: Identify topics mentioned in only one source. **Why it matters**: Single-source claims are fragile and hard to verify. **SurrealQL**: ```surql -- Topics with only one supporting source SELECT name, count(<-supports<-source) as source_count FROM concept WHERE count(<-supports<-source) = 1; ``` ### Pattern 3: Missing Relationships **Method**: Check for expected but missing graph edges. **Examples**: - Character mentioned but no `->knows->` relationships - Location exists but never `->located_in->` any section - Event with no `->precedes->` or `->follows->` temporal links **SurrealQL**: ```surql -- Characters with no relationships SELECT name FROM character WHERE count(->knows->character) = 0; -- Events with no temporal ordering SELECT name FROM event WHERE count(->precedes->event) = 0 AND count(->follows->event) = 0; ``` ### Pattern 4: Timeline Inconsistencies **Method**: Detect chronological gaps or conflicts. **SurrealQL**: ```surql -- Events without dates SELECT name, description FROM event WHERE date IS NONE; -- Sequence gaps (e.g., 1, 2, 5, 6 — missing 3 and 4) SELECT sequence FROM event ORDER BY sequence; ``` ### Pattern 5: Uncited Sections **Method**: Find written sections without source citations. **SurrealQL**: ```surql -- Sections with no citations SELECT * FROM section WHERE count(->cites->source) = 0; ``` ## Prioritization Framework Use this framework to prioritize research tasks based on impact and urgency. ### High Priority (Blocks Multiple Sections) **Criteria**: - Gap affects 3+ planned chapters/sections - Core to the BRD thesis/premise - Required for major plot point or key argument - Timeline-critical (early chapters need it) **Examples**: - "SOE training protocols" (affects multiple training scenes) - "Lyon resistance network structure" (entire middle section depends on it) - "Protagonist's historical timeline" (affects chronological consistency) **Research task template**: ``` Priority: HIGH Blocking: [list chapter/section IDs] Query: [specific research question] Context: [why this is needed, what we already know] Success criteria: [what would resolve this gap] ``` ### Medium Priority (Blocks One Section) **Criteria**: - Gap affects 1-2 sections - Adds depth but isn't critical to plot/argument - Can be worked around if research fails - Later chapters (writing not imminent) **Examples**: - "Daily life details in Lyon 1943" (enriches setting but not critical) - "German counter-intelligence methods" (adds realism to one scene) - "Specific wireless equipment specs" (detail-level enhancement) **Research task template**: ``` Priority: MEDIUM Blocking: [section ID] Query: [specific research question] Fallback: [how to proceed if research fails] ``` ### Low Priority (Nice to Have) **Criteria**: - Doesn't block any section - Enhances detail or authenticity - Can be added in editing pass - Background/contextual knowledge **Examples**: - "Period-accurate slang terms" - "Weather patterns in occupied France" - "Secondary character backstory details" **Research task template**: ``` Priority: LOW Enhancement for: [section or theme] Query: [research question] ``` ## Research Task Generation Templates ### Template 1: Factual Gap ```markdown **Task**: Research [specific topic] **Priority**: [HIGH/MEDIUM/LOW] **Blocks**: [chapter/section IDs] **Context**: - BRD requires: [what the BRD says] - Corpus has: [what we currently know] - Gap: [what's missing] **Research questions**: 1. [Specific question 1] 2. [Specific question 2] **Success criteria**: - [ ] Found 2+ reliable sources on [topic] - [ ] Extracted key facts: [list expected facts] - [ ] Resolved knowledge_gap:[id] **Search strategy**: - Academic databases: [keywords] - Primary sources: [archives, documents] - Web search: [specific queries] ``` ### Template 2: Character/Entity Gap ```markdown **Task**: Research [character/entity name] **Priority**: [HIGH/MEDIUM/LOW] **Blocks**: [section IDs] **Context**: - Mentioned in: [where entity appears in BRD/outline] - Current knowledge: [what corpus has] - Needed: [missing details] **Research questions**: 1. Background/history: [specifics] 2. Relationships: [who/what they connect to] 3. Timeline: [when they appear, key dates] **Success criteria**: - [ ] CREATE/UPDATE entity with full description - [ ] Establish relationships via RELATE statements - [ ] Add timeline anchors (dates, sequence) **Sources to check**: - [Specific books, archives, websites] ``` ### Template 3: Thematic/Conceptual Gap ```markdown **Task**: Research [theme/concept] **Priority**: [HIGH/MEDIUM/LOW] **Blocks**: [section IDs] **Context**: - BRD theme: [core theme/argument] - Current support: [sources that touch on this] - Gap: [missing evidence, examples, or depth] **Research questions**: 1. [Theoretical/conceptual question] 2. [Evidence/example question] 3. [Counter-argument/complexity question] **Success criteria**: - [ ] Found diverse perspectives on [concept] - [ ] Identified concrete examples/case studies - [ ] Created concept entity with supporting sources **Expected outcomes**: - 3+ sources with varied reliability levels - Clear link to BRD thesis ``` ## SurrealQL Queries for Gap Analysis ### Comprehensive Coverage Report ```surql -- Get overview of corpus completeness LET $total_sources = (SELECT count() FROM source)[0].count; LET $total_characters = (SELECT count() FROM character)[0].count; LET $total_concepts = (SELECT count() FROM concept)[0].count; LET $open_gaps = (SELECT count() FROM knowledge_gap WHERE resolved = false)[0].count; RETURN { sources: $total_sources, characters: $total_characters, concepts: $total_concepts, open_gaps: $open_gaps, source_reliability: (SELECT reliability, count() as count FROM source GROUP BY reliability), chapters_with_citations: (SELECT chapter, count() as cites FROM section->cites->source GROUP BY chapter) }; ``` ### Gap Detection by Section ```surql -- Find sections with weak source support SELECT id, chapter, sequence, count(->cites->source) as citation_count, word_count FROM section WHERE count(->cites->source) < 2 ORDER BY chapter, sequence; ``` ### Entity Relationship Completeness ```surql -- Characters without sufficient context SELECT name, count(->knows->character) as relationships, count(<-appears_in<-section) as appearances, length(description) as desc_length FROM character WHERE count(->knows->character) = 0 OR length(description) < 50 ORDER BY appearances DESC; ``` ## Example Gap Detection Workflow 1. **Load BRD**: Read BRD requirements for next chapter 2. **Query corpus**: Run coverage analysis queries 3. **Identify gaps**: Compare BRD needs vs corpus results 4. **Prioritize**: Apply HIGH/MEDIUM/LOW framework 5. **Generate tasks**: Use templates to create research tasks 6. **Store gaps**: `CREATE knowledge_gap SET question=..., context=..., resolved=false` 7. **Report**: Summarize findings with specific task IDs ## Output Format When performing corpus analysis, provide: ```markdown ## Corpus Analysis Report ### Coverage Summary - Total sources: [count] - Source reliability: [high: X, medium: Y, low: Z] - Entities extracted: [characters: X, locations: Y, events: Z] - Open knowledge gaps: [count] ### Gaps by Priority #### High Priority (Blocking) 1. [Gap description] — Blocks: [sections] — Research: [topic] 2. ... #### Medium Priority 1. [Gap description] — Blocks: [sections] — Research: [topic] 2. ... #### Low Priority 1. [Gap description] — Enhancement for: [context] 2. ... ### Recommended Next Action [Research these high-priority gaps / Continue to plan-write / etc.] ```