--- name: corpus-investigation description: Systematically investigate large corpus sections (100GB+) using stratified sampling, pattern recognition, and computational verification. Produces comprehensive section analyses with metadata schemas, chunking strategies, and RAG integration recommendations. Use when analyzing large datasets, investigating archive structures, studying corpus organization, conducting investigation and study tasks, or documenting dataset characteristics for RAG pipeline design. allowed-tools: Read, Grep, Glob, Bash, Write --- # Corpus Investigation Skill **Purpose**: Enable systematic, reproducible, token-efficient investigation of large corpus sections to inform RAG architecture design. **Methodology**: Based on the proven investigation framework used to analyze the 121GB Marxists Internet Archive, achieving 95% token reduction through computational verification and stratified sampling. --- ## When to Use This Skill Activate this skill when the user requests: - "investigate corpus section" - "analyze archive structure" - "study dataset organization" - "document corpus characteristics" - "investigation and study" (Maoist reference to systematic research) - Analysis of large file collections (>1GB) - Metadata schema extraction from document sets - Chunking strategy recommendations for RAG - Dataset preparation for knowledge base ingestion --- ## Core Investigation Framework Follow this **5-phase methodology** for all corpus investigations: ### Phase 1: Reconnaissance (10% of effort) **Goal**: Understand section scope without deep reading **Tasks**: 1. **Read the index page** for the section (if exists) 2. **Run directory structure analysis**: ```bash cd /path/to/section # Get directory tree with sizes (3 levels deep) find . -type d -maxdepth 3 | head -50 du -h --max-depth=2 | sort -h | tail -20 # Count files by type find . -type f -name "*.html" | wc -l find . -type f -name "*.htm" | wc -l find . -type f -name "*.pdf" | wc -l # Get size distribution by subdirectory du -sh */ 2>/dev/null | sort -h ``` 3. **Identify subsections** and document hierarchy 4. **Calculate total size** and file counts **Output**: Section overview with statistics in markdown format --- ### Phase 2: Stratified Sampling (40% of effort) **Goal**: Sample representative files across key dimensions **Stratification Dimensions**: 1. **Size-based**: Large (>1GB), medium (100MB-1GB), small (<100MB) subsections 2. **Temporal**: Early, mid, late periods (if time-based organization) 3. **Type-based**: HTML vs PDF, different file naming patterns 4. **Depth-based**: Index pages, category pages, content pages **Sampling Strategy**: ```bash # Sample large subsections (>1GB) - prioritize # Read 10-15 files from largest sections # Sample medium subsections (100MB-1GB) # Read 5-10 files from mid-size sections # Sample small subsections (<100MB) # Read 3-5 files total from small sections # Sample different time periods (if applicable) # Find files with year patterns find /path/to/section -name "*19[0-9][0-9]*" -o -name "*20[0-9][0-9]*" | head -20 # Sample different file depths find /path/to/section -name "index.htm*" | head -5 # Index pages find /path/to/section -type f -name "*.htm*" | shuf | head -10 # Random content ``` **Target Sample Size**: 15-25 files total across all dimensions **For each sampled file**: - Extract structure (headings, meta tags, first paragraph) - Do NOT read full content unless necessary - Document patterns observed **Token Optimization**: Extract structure only, not full content ```python # When reading HTML, extract structure not content: # - DOCTYPE and charset # - All meta tags (name and content) # - All heading tags (h1-h6) # - CSS classes used # - First paragraph only # - Link patterns (internal, external, anchors) # - Total word count estimate ``` --- ### Phase 3: Pattern Verification (30% of effort) **Goal**: Verify that patterns observed in samples hold across entire section **Use computational tools (grep/find), NOT exhaustive reading** **Verification Commands**: ```bash # 1. Meta tag consistency grep -r ']*>' /path/to/section | sort | uniq -c # 7. Character encoding grep -roh 'charset=[^"]*' /path/to/section | sort | uniq -c | sort -rn ``` **Document Confidence Levels**: - 100% = Universal pattern (all files) - 90-99% = Standard pattern (rare exceptions) - 75-89% = Common pattern (notable exceptions) - 50-74% = Frequent pattern (not standard) - <50% = Occasional pattern --- ### Phase 4: Edge Case Analysis (10% of effort) **Goal**: Identify exceptions and unusual patterns **Sample These Outliers**: ```bash # 1. Largest files (top 5) find /path/to/section -type f -name "*.htm*" -o -name "*.pdf" | xargs ls -lh | sort -k5 -hr | head -5 # 2. Smallest files (bottom 5) find /path/to/section -type f -name "*.htm*" -o -name "*.pdf" | xargs ls -lh | sort -k5 -h | head -5 # 3. Files with unusual names (no standard patterns) find /path/to/section -type f -name "*.htm*" | grep -v 'index\|chapter\|ch[0-9]\|[0-9]\{4\}' # 4. Deepest nested files find /path/to/section -type f -name "*.htm*" | awk '{print gsub(/\//,"/"), $0}' | sort -rn | head -10 # 5. Files without meta tags (if meta tags expected) for file in $(find /path/to/section -name "*.htm*" | head -100); do grep -q '` section for meta tags - Extract only `

-

` tags for structure - Read only first `

` for content sample - Count links/images, don't read them ### 4. Aggregate Statistics Over Individual Analysis **DON'T**: Describe each file individually **DO**: Compute aggregate statistics and describe patterns **Example**: ``` Instead of: "file1.htm has 3 meta tags" "file2.htm has 3 meta tags" "file3.htm has 2 meta tags" Write: "95% of files have 3 meta tags (author, description, classification)" "5% are missing description tag" ``` ### 5. Reference Examples, Don't Reproduce **DON'T**: Include full HTML of 10 example files (50k tokens) **DO**: Include 1-2 representative examples + reference paths (5k tokens) ### 6. Use Shell Commands for Verification **Always prefer**: - `grep -r` for pattern extraction - `find` for file discovery - `wc -l` for counting - `sort | uniq -c` for frequency analysis - `du -sh` for size calculations **Over**: - Reading files individually - Manual counting - Exhaustive sampling --- ## Metadata Extraction Protocol **Extract metadata in 5 layers** for comprehensive documentation: ### Layer 1: File System Metadata Extract from file paths using regex: ```python # Example patterns to extract: # - Section from /path/{section}/... # - Author from /archive/{author}/... # - Year from .../{year}/... or filename # - Work slug from path structure # - Chapter from ch##.htm filenames ``` ### Layer 2: HTML Meta Tags Parse `` tags in ``: ```bash # Extract all meta tag names grep -roh '` or similar): ```bash # Find breadcrumb patterns grep -r 'class="breadcrumb"' /path | head -10 grep -r 'class="path"' /path | head -10 grep -r ' tags) - Reading time (word_count / 250 words per minute) --- ## Sample Selection Algorithm Use this strategy to select representative samples: ```markdown For a section with N total files: 1. ALWAYS read top-level index.htm (if exists) 2. Identify subsections by size: - Large (>1GB): Sample 10-15 files - Medium (100MB-1GB): Sample 5-10 files - Small (<100MB): Sample 3-5 files total 3. Within each subsection, stratify by: - File type (HTML vs PDF) - Depth (index vs category vs content pages) - Time period (if applicable) 4. Use random sampling within strata: find /path -name "*.htm*" | shuf | head -10 5. Include edge cases: - Largest file - Smallest file - Unusual naming pattern - Deepest nested file Target: 15-25 files total for sections <10GB Target: 25-40 files total for sections >10GB ``` --- ## Verification Without Exhaustive Reading **Principle**: Trust but verify After identifying a pattern from samples, verify it holds using computational tools. ### Confidence Level Calculation ```bash # Pattern: "All files have author meta tag" # Count total files total=$(find /path -name "*.htm*" | wc -l) # Count files with pattern with_pattern=$(grep -rl '10GB: ```bash # Get random sample of 1000 files find /path -name "*.htm*" | shuf -n 1000 > sample_files.txt # Check pattern in sample while read file; do grep -q ']*>' /path | sort | uniq -c ``` ### Statistics ```bash # Count files by extension find /path -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn # Size distribution by directory du -sh /path/*/ | sort -h # Average file size find /path -type f -name "*.htm*" | xargs ls -l | awk '{sum+=$5; count++} END {print sum/count/1024 " KB average"}' # Word count distribution find /path -name "*.htm*" | head -100 | xargs wc -w | sort -n ``` ### Sampling ```bash # Random sample of 10 files find /path -name "*.htm*" | shuf | head -10 # Stratified sample (first, middle, last) files=$(find /path -name "*.htm*" | sort) total=$(echo "$files" | wc -l) echo "$files" | sed -n "1p; $(($total/2))p; ${total}p" # Sample by size (smallest, median, largest) find /path -name "*.htm*" | xargs ls -lh | sort -k5 -h | awk 'NR==1 || NR==total/2 || NR==total' ``` --- ## Investigation Tips ### Do's 1. **Start with reconnaissance** - Understand scope before sampling 2. **Use stratified sampling** - Cover all dimensions (size, time, type, depth) 3. **Verify patterns computationally** - grep/find are your friends 4. **Document confidence levels** - Specify % of files matching patterns 5. **Extract structure, not content** - Save tokens 6. **Reference file paths** - Use actual paths, enable reproduction 7. **Include bash commands** - Document what you ran 8. **Identify edge cases** - Sample outliers explicitly 9. **Be specific in recommendations** - "Chunk by paragraph" not "chunk appropriately" 10. **Ask clarifying questions** - If scope is unclear, ask user ### Don'ts 1. **Don't read exhaustively** - Sample strategically instead 2. **Don't include full HTML** - Extract structure only 3. **Don't describe individual files** - Aggregate statistics instead 4. **Don't make assumptions** - Verify patterns with grep/find 5. **Don't skip edge cases** - Outliers reveal important variations 6. **Don't use vague language** - Be specific and actionable 7. **Don't forget verification** - Every pattern needs confidence level 8. **Don't ignore errors** - Document broken files, encoding issues 9. **Don't reinvent chunking** - Use proven strategies from methodology 10. **Don't exceed token budget** - If approaching limit, increase aggregation --- ## Example Investigation Workflow Here's a concrete example workflow: ```markdown USER: "Investigate the /data/archive/philosophy section" AGENT RESPONSE: I'll systematically investigate the philosophy section using the 5-phase methodology. ## Phase 1: Reconnaissance [Run directory structure commands] [Document size, file counts, hierarchy] Section overview: - Size: 2.3GB - Files: 1,847 HTML, 234 PDFs - Structure: Organized by author (45 authors) - Time span: 1840-1950 (from filenames) ## Phase 2: Stratified Sampling [Select 20 representative files across:] - Large authors: Hegel (400MB), Kant (300MB), Marx (250MB) - Medium authors: Nietzsche (150MB), Spinoza (120MB) - Small authors: Feuerbach (40MB), Stirner (30MB) [For each sample, extract structure:] - Meta tags: author, title, classification - Headings: h1 (title), h2 (chapters), h3 (sections) - CSS classes: "title", "quoteb", "context" - First paragraph: [sample text] - Links: Internal chapter links, footnote anchors ## Phase 3: Pattern Verification [Run verification commands:] ```bash # Meta tag coverage grep -r '