--- name: notebooklm-knowledge-base-organizer description: > Use when preparing files for NotebookLM, organizing documents into a knowledge base, converting formats for NotebookLM compatibility, or reducing a large document collection to fit NotebookLM's 50-source limit. Scores and prioritizes sources, performs strategic merging (time-series, topic-based, format consolidation), converts unsupported formats (PPTX to PDF, XLSX to CSV), applies flat structure with descriptive snake_case names, and optimizes for RAG retrieval performance. --- # NotebookLM Knowledge Base Organizer Prepares files for optimal use in NotebookLM by intelligently selecting and consolidating sources, converting formats, organizing structure, and ensuring compatibility. The primary constraint is NotebookLM's 50-source limit per notebook. When collections exceed this limit, systematic scoring, prioritization, and strategic merging reduce source count without losing valuable information. ## When to Use This Skill - You have 50+ files and need to optimize for NotebookLM's limit - Preparing documents for a new NotebookLM notebook - Converting a messy folder into NotebookLM-ready sources - Files are in unsupported formats (PPTX, XLSX, complex PDFs) - Documents exceed 500k words or 200MB per file - Building a knowledge base for research, projects, or learning - Large document collections (100-300 files) need intelligent prioritization ## What This Skill Does 1. **Scores and Prioritizes Sources** (when >50 detected) using Relevance, Recency, Uniqueness, and Information Density (0-40 scale) 2. **Strategic Merging** via time-series (daily to monthly), topic-based (related papers to comprehensive guides), and format consolidation (slides + transcript to unified PDF) 3. **Converts to Supported Formats** (PPTX to PDF, XLSX to CSV, scanned to OCR) 4. **Applies Flat Structure** with descriptive snake_case naming 5. **Removes Duplicates** across formats 6. **Splits Large Files** exceeding 500k words into parts 7. **Optimizes for RAG** with smaller, focused documents for better retrieval ## NotebookLM Supported Formats **Supported:** - PDF (text-selectable, not scanned images) - Google Docs, Sheets (<100k tokens), Slides (<100 slides) - Microsoft Word (.docx) - Text files (.txt, .md) - Images (PNG, JPEG, TIFF, WEBP) - Audio (MP3, WAV, AAC, OGG with clear speech) - URLs (websites, YouTube, Google Drive links) - Copy-pasted text **Convert These:** - PPTX to PDF - XLSX to CSV or Google Sheets - Scanned PDFs to OCR text-selectable PDF - Large Sheets to CSV (<100k tokens) ## File Limits **Per Source:** - 500,000 words max - 200MB file size max - No page limit (word limit matters) **Per Notebook (Free):** - 50 sources maximum -- HARD LIMIT - 100 notebooks total Prefer many smaller, focused documents over few large ones for better RAG retrieval. The 50-source limit is the primary optimization constraint. IMPORTANT: Preserve original file timestamps during all operations. Timestamps are essential for understanding latest additions, recent meeting minutes, and key decisions. Use `touch -r original converted` after conversions. Include dates in ISO format (YYYY-MM-DD) in all filenames. ## How to Use ``` Prepare these files for NotebookLM - convert formats and organize with descriptive names ``` ``` Convert all PPTX and XLSX files to NotebookLM-compatible formats ``` ``` Check if any files exceed NotebookLM's 500k word or 200MB limits ``` ``` Organize this research folder for a NotebookLM knowledge base ``` ``` Find duplicate content across different file formats ``` ``` Split this large PDF into NotebookLM-compatible chunks ``` ## Instructions When a user requests NotebookLM organization, follow these steps. ### Step 1: Assess and Prioritize Sources Count and evaluate before proceeding with any organization. ```bash total_sources=$(find . -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.txt" -o -name "*.md" -o -name "*.csv" \) | wc -l) echo "Total sources found: $total_sources" ``` If total exceeds 50: 1. **Score all sources** using the 4-dimension rubric (Relevance, Recency, Uniqueness, Density, each 0-10). See `references/scoring-system.md` for the full rubric, assessment commands, and batch scoring script. 2. **Rank and select top candidates** using the decision matrix. Target 35-40 auto-keep sources initially. See `references/prioritization-strategy.md` for the selection process and space-based adjustments. 3. **Identify merge candidates** -- find time-series patterns, topic clusters, and multi-format duplicates: ```bash # Time-series opportunities find . -name "*_20[0-9][0-9]_[0-9][0-9]_*" | \ sed 's/_20[0-9][0-9]_[0-9][0-9]_[0-9][0-9]//' | sort | uniq -c | sort -rn # Topic clusters find . -type f -name "*.pdf" | xargs -I {} basename {} .pdf | \ sed 's/_part_[0-9]*//;s/_[0-9][0-9]*$//' | sort | uniq -c | sort -rn | awk '$1 > 2' ``` 4. **Execute strategic merges** using appropriate patterns. See `references/merging-strategies.md` for time-series, topic-based, and format consolidation scripts. Preserve timestamps on all merged outputs. 5. **Recount and validate** the final total is at or below 50 (ideally 48 to reserve slots for future additions). ### Step 2: Understand the Scope Ask clarifying questions: - What is the topic/purpose of this knowledge base? - Which directory contains the source materials? - Target: single notebook or multiple related notebooks? - Any files that must stay in original format? - Is this for research, learning, project documentation, or reference? ### Step 3: Analyze Current State Review files for NotebookLM compatibility: ```bash find . -type f -exec file {} \; find . -type f -exec du -h {} \; | sort -rh find . -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn for f in *.pdf; do pdftotext "$f" - | wc -w; done ``` Categorize findings: - **Compatible as-is**: PDF, DOCX, TXT, MD, images - **Needs conversion**: PPTX, XLSX, XLS, PPT, scanned PDFs - **Too large**: Files >500k words or >200MB - **Duplicates**: Same content in different formats - **Merge candidates**: Sources identified for consolidation in Step 1 ### Step 4: Convert Unsupported Formats **PowerPoint to PDF:** ```bash soffice --headless --convert-to pdf *.pptx touch -r original.pptx converted.pdf # Preserve timestamp ``` **Excel to CSV:** ```bash soffice --headless --convert-to csv:"Text - txt - csv (StarCalc)":44,34,UTF8 *.xlsx touch -r original.xlsx converted.csv # Preserve timestamp ``` **Scanned PDF to Searchable:** ```bash ocrmypdf input.pdf output_searchable.pdf touch -r input.pdf output_searchable.pdf # Preserve timestamp pdftotext output_searchable.pdf - | wc -w # Verify text extraction ``` WARNING: Always run `touch -r original converted` after every conversion to preserve the original file timestamp. ### Step 5: Apply Naming Use this pattern: `category_topic_descriptor_YYYY_MM_DD.ext` Examples: - `research_quantum_computing_basics_2025.pdf` - `meeting_notes_project_kickoff_2026_01_15.txt` - `client_proposal_acme_corp_final.docx` - `reference_api_documentation_v2.md` - `data_sales_figures_q4_2025.csv` See `references/organization-scripts.md` for the automated naming script. Preserve timestamps when renaming: use `mv` (preserves by default) and verify with `stat`. ### Step 6: Split Large Documents For files >500k words or >200MB: ```bash pdftotext document.pdf - | wc -w # Check word count pdftk large.pdf cat 1-500 output large_part_1.pdf pdftk large.pdf cat 501-1000 output large_part_2.pdf touch -r large.pdf large_part_1.pdf large_part_2.pdf # Preserve timestamps ``` Name parts by content, not arbitrary numbers: - `annual_report_2025_part_1_executive_summary.pdf` - `annual_report_2025_part_2_financials.pdf` - `annual_report_2025_part_3_appendices.pdf` ### Step 7: Consolidation Pass Perform strategic merging to optimize source count. This step is critical when merge candidates were identified in Step 1 or the collection is near the 50-source limit. Merging is a primary optimization strategy, not a last resort. Three patterns apply: - **Time-series**: Combine chronological documents into period summaries (daily to monthly, weekly to quarterly) - **Topic-based**: Combine related papers/docs into comprehensive guides with chapter markers - **Format consolidation**: Combine slides + transcript + notes for the same event into a single PDF See `references/merging-strategies.md` for full merge patterns, scripts (time-series merger, topic-based PDF merger), decision trees, and quality checks. IMPORTANT: Preserve chronological timestamps in merged content. Add clear date headers within merged files so temporal context is not lost. Log all merge decisions for inclusion in the organization plan. ### Step 8: Implement Flat Structure NotebookLM works best with flat source lists, no nested folders. **Before:** ``` docs/ project/ planning/ requirements.pdf research/ background.pdf reference/ api_docs.pdf ``` **After:** ``` notebooklm_sources/ project_requirements_2026.pdf project_background_research.pdf reference_api_documentation.pdf ``` See `references/organization-scripts.md` for the implementation script. Preserve timestamps when copying: use `cp -p` to maintain original dates. ### Step 9: Find and Remove Duplicates ```bash find . -type f -exec md5 {} \; | sort | uniq -d find . -type f -printf '%f\n' | sed 's/\.[^.]*$//' | sort | uniq -d for pdf in *.pdf; do echo "=== $pdf ==="; pdftotext "$pdf" - | md5; done | sort ``` Decision matrix: - Same content, different formats: keep PDF (best for NotebookLM) - Same content, different names: keep most descriptive name - Slight variations: merge into single document if <500k words - Truly duplicate: delete older version (check timestamps first) ### Step 10: Optimize for RAG NotebookLM uses RAG, which works best with focused documents: - Split 100-page documents into 3-5 topic-focused files - Separate chapters/sections into individual sources - Keep each source focused on one topic/subtopic - Prefer 20-50 pages per PDF over 200+ page megadocs ``` Instead of: company_handbook_500_pages.pdf Create: handbook_code_of_conduct.pdf handbook_benefits_overview.pdf handbook_time_off_policy.pdf handbook_remote_work_guidelines.pdf handbook_career_development.pdf ``` ### Step 11: Propose Organization Plan Present a plan to the user before making changes. The plan should cover current state, source selection strategy (if >50 sources), proposed structure, changes to make, and a compatibility check. See `references/organization-plan-template.md` for the full template with sections for prioritization results, merge decisions, and final source count verification. ### Step 12: Execute Organization After user approval, execute all conversions, merges, renames, and structural changes. Log all operations. See `references/organization-scripts.md` for the complete execution script with logging and limit verification. Run `touch -r` after every file operation to preserve original timestamps. ### Step 13: Provide Upload Instructions Provide the user with a summary of organized sources and upload instructions for NotebookLM (direct upload and Google Drive options). See `references/upload-guide.md` for the full upload instructions template including maintenance guidance. ## Examples ### Example 1: Research Paper Collection **User**: "Prepare my PhD research papers folder for NotebookLM" **Process**: 1. Finds 35 PDFs, 12 DOCX, 8 PPTX across nested folders 2. Converts 8 PPTX to PDF (preserves timestamps) 3. Identifies 2 papers >500k words, splits into parts 4. Renames: `smith_2024.pdf` to `research_quantum_entanglement_smith_2024.pdf` 5. Creates flat structure in `phd_research_sources/` 6. Result: 48 sources ready for upload ### Example 2: Company Knowledge Base **User**: "Convert our company wiki exports to NotebookLM format" Split single 145-page PDF by section into 7 focused sources: - `company_overview_history_mission.pdf` (8 pages) - `company_policies_hr_guidelines.pdf` (28 pages) - `company_product_documentation.pdf` (45 pages) - (4 more topic-focused files) Result: 7 focused sources instead of 1 large doc. Better RAG retrieval. ### Example 3: Excel Data **User**: "I have 10 Excel files with research data" Convert each sheet to separate CSV. Name descriptively: `data_survey_responses_2025.csv`. Create overview doc: `data_overview_methodology.txt`. Preserve timestamps on all conversions. Result: 10 XLSX to 23 CSV files + 1 overview doc. ### Example 4: Conference Materials **User**: "Organize my conference materials for a knowledge base" Input: 12 MP3 recordings, 8 PPTX decks, 15 JPG notes, 5 PDFs. Keep MP3 as-is (NotebookLM transcribes on upload). Convert PPTX to PDF. Keep JPGs (NotebookLM reads handwriting via OCR). Apply naming: `conf_session_title_speaker_date.ext`. Preserve all timestamps. Result: 40 sources in flat folder. ### Example 5: Large Collection (200+ Sources) For a complete workflow handling 200+ sources (e.g., reducing 237 sources to 48 with strategic merging), see `references/large-collection-workflow.md`. ## Common Patterns ### Academic Research ``` research_[topic]_[author]_[year].pdf notes_[course]_[topic]_[date].md textbook_[subject]_chapter_[n]_[title].pdf ``` ### Business Projects ``` project_[name]_requirements.pdf project_[name]_timeline.csv meeting_[project]_[date]_notes.txt client_[name]_proposal_final.docx ``` ### Learning/Courses ``` course_[name]_lecture_[n]_[topic].pdf course_[name]_readings_week_[n].pdf course_[name]_assignment_[n].docx ``` ### Personal Knowledge Base ``` article_[topic]_[author]_[date].pdf book_notes_[title]_[author].md tutorial_[skill]_[topic].pdf reference_[tool]_documentation.pdf ``` ## Pro Tips 1. **Optimize for Search**: Use descriptive names with search keywords. Good: `tutorial_python_async_programming_advanced.pdf`. Bad: `tutorial_5.pdf`. 2. **Topic-Based Splitting**: Split large docs by topic, not arbitrary page count. Good: `handbook_benefits.pdf`, `handbook_policies.pdf`. Bad: `handbook_part_1.pdf`, `handbook_part_2.pdf`. 3. **Date Formatting**: Use ISO format (YYYY-MM-DD) for sortability. Good: `meeting_notes_2026_02_04.txt`. Bad: `meeting_notes_feb_4_2026.txt`. 4. **Preserve Source Timestamps**: Always maintain original file creation/modification dates. These enable accurate recency scoring and help NotebookLM's RAG weight recent meeting notes, decisions, and additions appropriately. Use `touch -r original converted` after every conversion. 5. **Extract Text from Scans**: Scanned PDFs do not work in NotebookLM. Test with `pdftotext test.pdf - | head`. If blank, run `ocrmypdf input.pdf output.pdf`. 6. **Use Prefixes for Ordering**: Add numeric prefixes for logical ordering: `01_project_overview.pdf`, `02_project_requirements.pdf`. 7. **Test Before Bulk Upload**: Upload 2-3 files first to verify processing, summaries, and search accuracy. Then upload the rest. ## Best Practices Summary **Source Selection and Optimization:** - Always assess total source count first before organizing - Use scoring rubric for objective prioritization (>50 sources) - Merge strategically as primary optimization, not last resort - Prefer quality over quantity: 48 great sources over 50 mediocre ones - Reserve 2-3 slots for future additions - Do not merge high-value unique sources (score 35+) - Do not combine unrelated topics just to hit limits **File Naming:** - Descriptive snake_case with searchable terms and ISO dates - Keep under 100 characters, no spaces or special characters - Use dates instead of version numbers **Format Selection:** - PDF for presentations and mixed content - CSV for spreadsheet data - DOCX/TXT/MD for text documents - Always convert PPTX and XLSX before upload **Timestamp Preservation:** - Run `touch -r original converted` after every conversion - Use `cp -p` when copying files to preserve modification dates - Include ISO dates in filenames for explicit temporal context - Timestamps drive recency scoring and RAG relevance weighting **Organization Structure:** - Flat structure (one folder, all files) - Descriptive names include folder context - Stay under 50 sources per notebook ## Implementation Checklist **Phase 1: Assessment and Prioritization** - [ ] Identify target notebook topic/purpose - [ ] Locate all source files and count total - [ ] If >50: run scoring rubric for all sources - [ ] If >50: identify and execute strategic merges - [ ] If >50: select top sources using decision matrix (target 48) - [ ] Check file formats, note conversions needed - [ ] Estimate word counts for large files **Phase 2: Conversion and Organization** - [ ] Convert unsupported formats (preserve timestamps) - [ ] Apply descriptive snake_case naming - [ ] Split large documents by topic - [ ] Remove duplicates - [ ] Create flat output directory - [ ] Verify all files <200MB and <500k words - [ ] Verify final source count is at or below 50 - [ ] Verify timestamps preserved on all converted/moved files **Phase 3: Upload and Verification** - [ ] Document selection strategy in organization plan - [ ] Test upload 2-3 files - [ ] Upload remaining sources - [ ] Verify NotebookLM processing and summaries - [ ] Test search functionality - [ ] Confirm all key topics covered despite any source reduction