--- name: corpus-investigation description: Systematically investigate large corpus sections (100GB+) using stratified sampling, pattern recognition, and computational verification. Produces comprehensive section analyses with metadata schemas, chunking strategies, and RAG integration recommendations. Use when analyzing large datasets, investigating archive structures, studying corpus organization, conducting investigation and study tasks, or documenting dataset characteristics for RAG pipeline design. allowed-tools: Read, Grep, Glob, Bash, Write --- # Corpus Investigation Skill **Purpose**: Enable systematic, reproducible, token-efficient investigation of large corpus sections to inform RAG architecture design. **Methodology**: Based on the proven investigation framework used to analyze the 121GB Marxists Internet Archive, achieving 95% token reduction through computational verification and stratified sampling. --- ## When to Use This Skill Activate this skill when the user requests: - "investigate corpus section" - "analyze archive structure" - "study dataset organization" - "document corpus characteristics" - "investigation and study" (Maoist reference to systematic research) - Analysis of large file collections (>1GB) - Metadata schema extraction from document sets - Chunking strategy recommendations for RAG - Dataset preparation for knowledge base ingestion --- ## Core Investigation Framework Follow this **5-phase methodology** for all corpus investigations: ### Phase 1: Reconnaissance (10% of effort) **Goal**: Understand section scope without deep reading **Tasks**: 1. **Read the index page** for the section (if exists) 2. **Run directory structure analysis**: ```bash cd /path/to/section # Get directory tree with sizes (3 levels deep) find . -type d -maxdepth 3 | head -50 du -h --max-depth=2 | sort -h | tail -20 # Count files by type find . -type f -name "*.html" | wc -l find . -type f -name "*.htm" | wc -l find . -type f -name "*.pdf" | wc -l # Get size distribution by subdirectory du -sh */ 2>/dev/null | sort -h ``` 3. **Identify subsections** and document hierarchy 4. **Calculate total size** and file counts **Output**: Section overview with statistics in markdown format --- ### Phase 2: Stratified Sampling (40% of effort) **Goal**: Sample representative files across key dimensions **Stratification Dimensions**: 1. **Size-based**: Large (>1GB), medium (100MB-1GB), small (<100MB) subsections 2. **Temporal**: Early, mid, late periods (if time-based organization) 3. **Type-based**: HTML vs PDF, different file naming patterns 4. **Depth-based**: Index pages, category pages, content pages **Sampling Strategy**: ```bash # Sample large subsections (>1GB) - prioritize # Read 10-15 files from largest sections # Sample medium subsections (100MB-1GB) # Read 5-10 files from mid-size sections # Sample small subsections (<100MB) # Read 3-5 files total from small sections # Sample different time periods (if applicable) # Find files with year patterns find /path/to/section -name "*19[0-9][0-9]*" -o -name "*20[0-9][0-9]*" | head -20 # Sample different file depths find /path/to/section -name "index.htm*" | head -5 # Index pages find /path/to/section -type f -name "*.htm*" | shuf | head -10 # Random content ``` **Target Sample Size**: 15-25 files total across all dimensions **For each sampled file**: - Extract structure (headings, meta tags, first paragraph) - Do NOT read full content unless necessary - Document patterns observed **Token Optimization**: Extract structure only, not full content ```python # When reading HTML, extract structure not content: # - DOCTYPE and charset # - All meta tags (name and content) # - All heading tags (h1-h6) # - CSS classes used # - First paragraph only # - Link patterns (internal, external, anchors) # - Total word count estimate ``` --- ### Phase 3: Pattern Verification (30% of effort) **Goal**: Verify that patterns observed in samples hold across entire section **Use computational tools (grep/find), NOT exhaustive reading** **Verification Commands**: ```bash # 1. Meta tag consistency grep -r ']*>' /path/to/section | sort | uniq -c # 7. Character encoding grep -roh 'charset=[^"]*' /path/to/section | sort | uniq -c | sort -rn ``` **Document Confidence Levels**: - 100% = Universal pattern (all files) - 90-99% = Standard pattern (rare exceptions) - 75-89% = Common pattern (notable exceptions) - 50-74% = Frequent pattern (not standard) - <50% = Occasional pattern --- ### Phase 4: Edge Case Analysis (10% of effort) **Goal**: Identify exceptions and unusual patterns **Sample These Outliers**: ```bash # 1. Largest files (top 5) find /path/to/section -type f -name "*.htm*" -o -name "*.pdf" | xargs ls -lh | sort -k5 -hr | head -5 # 2. Smallest files (bottom 5) find /path/to/section -type f -name "*.htm*" -o -name "*.pdf" | xargs ls -lh | sort -k5 -h | head -5 # 3. Files with unusual names (no standard patterns) find /path/to/section -type f -name "*.htm*" | grep -v 'index\|chapter\|ch[0-9]\|[0-9]\{4\}' # 4. Deepest nested files find /path/to/section -type f -name "*.htm*" | awk '{print gsub(/\//,"/"), $0}' | sort -rn | head -10 # 5. Files without meta tags (if meta tags expected) for file in $(find /path/to/section -name "*.htm*" | head -100); do grep -q '` section for meta tags - Extract only `
` for content sample - Count links/images, don't read them ### 4. Aggregate Statistics Over Individual Analysis **DON'T**: Describe each file individually **DO**: Compute aggregate statistics and describe patterns **Example**: ``` Instead of: "file1.htm has 3 meta tags" "file2.htm has 3 meta tags" "file3.htm has 2 meta tags" Write: "95% of files have 3 meta tags (author, description, classification)" "5% are missing description tag" ``` ### 5. Reference Examples, Don't Reproduce **DON'T**: Include full HTML of 10 example files (50k tokens) **DO**: Include 1-2 representative examples + reference paths (5k tokens) ### 6. Use Shell Commands for Verification **Always prefer**: - `grep -r` for pattern extraction - `find` for file discovery - `wc -l` for counting - `sort | uniq -c` for frequency analysis - `du -sh` for size calculations **Over**: - Reading files individually - Manual counting - Exhaustive sampling --- ## Metadata Extraction Protocol **Extract metadata in 5 layers** for comprehensive documentation: ### Layer 1: File System Metadata Extract from file paths using regex: ```python # Example patterns to extract: # - Section from /path/{section}/... # - Author from /archive/{author}/... # - Year from .../{year}/... or filename # - Work slug from path structure # - Chapter from ch##.htm filenames ``` ### Layer 2: HTML Meta Tags Parse `` tags in `
`: ```bash # Extract all meta tag names grep -roh '` or similar): ```bash # Find breadcrumb patterns grep -r 'class="breadcrumb"' /path | head -10 grep -r 'class="path"' /path | head -10 grep -r '