--- name: docx-advanced-patterns description: Advanced python-docx patterns for handling nested tables, complex cell structures, and content extraction beyond basic .text property. Complements the official docx skill with specialized techniques for forms, checklists, and complex layouts. version: 1.0.0 dependencies: - python>=3.8 - python-docx>=0.8.11 --- # DOCX Advanced Patterns Skill Specialized patterns for python-docx that handle complex document structures not covered by basic `.text` extraction. ## When to Use This Skill Invoke this skill when working with DOCX files that have: - Nested tables within table cells - Forms with checkbox options - Complex multi-row cell layouts - Checklists with embedded options - Cell content that doesn't appear with `.text` property **Use alongside** the official `docx` skill for comprehensive document handling. ## Core Pattern: Nested Table Extraction ### Problem python-docx's `cell.text` property only extracts direct paragraph text - it **does not** traverse nested tables within cells. **Symptom:** ```python cell.text # Returns: '' or '\n' # But cell visually contains content! ``` ### Detection Check if a cell contains nested tables: ```python if cell.tables: print(f"Found {len(cell.tables)} nested table(s)") # Cell has nested content - need special extraction ``` ### Solution (Simple) ```python def extract_cell_content_with_nested_tables(cell): """ Extract all text from a cell, including text from nested tables. Args: cell: python-docx _Cell object Returns: str: Combined text from cell paragraphs and nested tables """ text_parts = [] # Get direct paragraph text (not inside nested tables) for para in cell.paragraphs: para_text = para.text.strip() if para_text: text_parts.append(para_text) # Get content from nested tables if cell.tables: for nested_table in cell.tables: for nested_row in nested_table.rows: # For checkbox lists: Column 0 = label, Column 1 = checkbox # Extract text from first column only if nested_row.cells: first_col_text = nested_row.cells[0].text.strip() # Filter out checkbox characters if first_col_text and first_col_text not in ['', '☐', '☑', '☒']: text_parts.append(first_col_text) return '\n'.join(text_parts) if text_parts else '' ``` ### Solution (Recursive for Deep Nesting) For documents with multiple levels of table nesting: ```python def extract_cell_content_recursively(cell): """ Recursively extract text from cell including deeply nested tables. Handles arbitrary nesting depth. """ text_parts = [] def _extract_recursive(cell_obj): # Get direct paragraphs for para in cell_obj.paragraphs: para_text = para.text.strip() if para_text and para_text not in ['', '☐', '☑', '☒']: text_parts.append(para_text) # Recursively get nested tables for nested_table in cell_obj.tables: for nested_row in nested_table.rows: for nested_cell in nested_row.cells: _extract_recursive(nested_cell) _extract_recursive(cell) return '\n'.join(text_parts) if text_parts else '' ``` ## Usage Examples ### Example 1: Extracting Form Checkbox Options **Document Structure:** ``` Table Cell contains: Nested Table: Row 1: "High potential" | ☐ Row 2: "Moderate potential" | ☐ Row 3: "Low potential" | ☐ ``` **Extraction:** ```python from docx import Document doc = Document('form.docx') table = doc.tables[0] cell = table.rows[1].cells[0] # Wrong way - returns empty basic_text = cell.text print(basic_text) # Output: '' or '\n' # Right way - extracts nested content full_text = extract_cell_content_with_nested_tables(cell) print(full_text) # Output: # High potential # Moderate potential # Low potential ``` ### Example 2: Processing All Cells in a Table ```python def process_table_with_nested_content(table): """Process all cells, handling nested tables""" for row in table.rows: for cell in row.cells: # Extract with nested table support content = extract_cell_content_with_nested_tables(cell) if content: # Process content (translate, analyze, etc.) processed = do_something_with(content) print(f"Cell content: {processed}") ``` ### Example 3: Detecting Nested Tables ```python def analyze_document_structure(doc): """Find all cells with nested tables""" nested_cells = [] for t_idx, table in enumerate(doc.tables): for r_idx, row in enumerate(table.rows): for c_idx, cell in enumerate(row.cells): if cell.tables: nested_cells.append({ 'table': t_idx, 'row': r_idx, 'col': c_idx, 'nested_count': len(cell.tables) }) return nested_cells # Usage doc = Document('complex_form.docx') nested = analyze_document_structure(doc) for item in nested: print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: " f"{item['nested_count']} nested table(s)") ``` ## Common Use Cases ### 1. Government Forms Forms often use nested tables for checkbox grids: ```python def extract_form_responses(doc): """Extract all form checkbox options""" responses = {} for table in doc.tables: for row in table.rows: # First cell = question question = row.cells[0].text.strip() # Second cell = checkbox options (nested table) if row.cells[1].tables: options = extract_cell_content_with_nested_tables(row.cells[1]) responses[question] = options.split('\n') return responses ``` ### 2. Evaluation Forms Extract rating scales and options: ```python def extract_evaluation_items(doc): """Extract evaluation criteria and options""" evaluations = [] for table in doc.tables: for row_idx, row in enumerate(table.rows[1:], 1): # Get criterion criterion = row.cells[0].text.strip() # Get rating options (often nested) rating_cell = row.cells[1] rating_options = extract_cell_content_with_nested_tables(rating_cell) evaluations.append({ 'criterion': criterion, 'options': rating_options.split('\n') }) return evaluations ``` ### 3. Complex Data Tables Extract structured data from cells with nested layouts: ```python def extract_complex_cell_data(cell): """Extract data from cells with complex nested structures""" data = { 'main_content': '', 'nested_items': [] } # Direct paragraphs for para in cell.paragraphs: if para.text.strip(): data['main_content'] = para.text.strip() break # Nested table data if cell.tables: for nested_table in cell.tables: for nested_row in nested_table.rows: row_data = [c.text.strip() for c in nested_row.cells] data['nested_items'].append(row_data) return data ``` ## Integration with Official docx Skill This skill **complements** the official docx skill: **Official docx skill provides:** - Document creation (docx-js) - Basic text extraction (pandoc) - Tracked changes workflows - Comment handling - XML access for complex cases **This skill provides:** - Nested table extraction - Complex cell content handling - Form and checklist processing - Advanced content extraction patterns **Use together:** ```python # For basic operations: use official skill from docx import Document # For nested table handling: use this skill from docx_advanced import extract_cell_content_with_nested_tables # Combine both doc = Document('complex_form.docx') # Official for table in doc.tables: # Official for row in table.rows: # Official for cell in row.cells: # Official # Advanced extraction: content = extract_cell_content_with_nested_tables(cell) ``` ## Performance Considerations **For Large Documents:** Cache nested table checks: ```python def build_nested_table_cache(doc): """Pre-compute which cells have nested tables""" cache = {} for t_idx, table in enumerate(doc.tables): for r_idx, row in enumerate(table.rows): for c_idx, cell in enumerate(row.cells): if cell.tables: cache[(t_idx, r_idx, c_idx)] = len(cell.tables) return cache # Usage cache = build_nested_table_cache(doc) for t_idx, table in enumerate(doc.tables): for r_idx, row in enumerate(table.rows): for c_idx, cell in enumerate(row.cells): if (t_idx, r_idx, c_idx) in cache: # This cell has nested tables content = extract_cell_content_with_nested_tables(cell) else: # Regular extraction content = cell.text ``` ## Troubleshooting ### Issue: Extraction returns empty despite visible content **Diagnosis:** ```python cell = table.rows[1].cells[0] print(f"cell.text: '{cell.text}'") print(f"cell.tables: {len(cell.tables)}") if not cell.text.strip() and cell.tables: print("Content is in nested tables!") ``` **Fix:** Use `extract_cell_content_with_nested_tables(cell)` ### Issue: Checkbox characters (, ☐) appear in output **Fix:** Filter them out: ```python text = cell.text.strip() # Remove checkbox unicode characters clean_text = text.replace('', '').replace('☐', '').replace('☑', '').replace('☒', '') ``` ### Issue: Multi-line content not preserved **Fix:** Join with newlines: ```python '\n'.join(text_parts) # Preserves line structure ``` ## Best Practices 1. **Always check for nested tables first:** ```python if cell.tables: content = extract_cell_content_with_nested_tables(cell) else: content = cell.text ``` 2. **Handle checkbox characters:** ```python CHECKBOX_CHARS = ['', '☐', '☑', '☒'] if text not in CHECKBOX_CHARS: # Process text ``` 3. **Preserve structure:** ```python # Use newlines to maintain line breaks '\n'.join(lines) ``` 4. **Test with sample documents:** ```python def test_extraction(): doc = Document('sample_form.docx') cell = doc.tables[0].rows[1].cells[0] extracted = extract_cell_content_with_nested_tables(cell) assert 'High potential' in extracted assert 'Moderate potential' in extracted ``` ## Reference Implementation See `REFERENCE.md` for: - Complete working examples - Integration patterns - Advanced recursive extraction - Performance optimization techniques ## Contributing to Anthropic Skills This pattern is not currently in the official `docx` skill. If you find it useful, consider contributing: 1. Fork https://github.com/anthropics/skills 2. Add to `document-skills/docx/SKILL.md` 3. Submit pull request with: - Pattern description - Code examples - Use cases ## Success Criteria Pattern is working if: - [ ] Cells with nested tables return full content - [ ] Checkbox options are extracted correctly - [ ] Form fields are readable - [ ] No content is lost during extraction - [ ] Structure is preserved (line breaks maintained)