---
name: docx-advanced-patterns
description: Advanced python-docx patterns for handling nested tables, complex cell structures, and content extraction beyond basic .text property. Complements the official docx skill with specialized techniques for forms, checklists, and complex layouts.
version: 1.0.0
dependencies:
  - python>=3.8
  - python-docx>=0.8.11
---

# DOCX Advanced Patterns Skill

Specialized patterns for python-docx that handle complex document structures not covered by basic `.text` extraction.

## When to Use This Skill

Invoke this skill when working with DOCX files that have:
- Nested tables within table cells
- Forms with checkbox options
- Complex multi-row cell layouts
- Checklists with embedded options
- Cell content that doesn't appear with `.text` property

**Use alongside** the official `docx` skill for comprehensive document handling.

## Core Pattern: Nested Table Extraction

### Problem

python-docx's `cell.text` property only extracts direct paragraph text - it **does not** traverse nested tables within cells.

**Symptom:**
```python
cell.text  # Returns: '' or '\n'
# But cell visually contains content!
```

### Detection

Check if a cell contains nested tables:

```python
if cell.tables:
    print(f"Found {len(cell.tables)} nested table(s)")
    # Cell has nested content - need special extraction
```

### Solution (Simple)

```python
def extract_cell_content_with_nested_tables(cell):
    """
    Extract all text from a cell, including text from nested tables.

    Args:
        cell: python-docx _Cell object

    Returns:
        str: Combined text from cell paragraphs and nested tables
    """
    text_parts = []

    # Get direct paragraph text (not inside nested tables)
    for para in cell.paragraphs:
        para_text = para.text.strip()
        if para_text:
            text_parts.append(para_text)

    # Get content from nested tables
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                # For checkbox lists: Column 0 = label, Column 1 = checkbox
                # Extract text from first column only
                if nested_row.cells:
                    first_col_text = nested_row.cells[0].text.strip()
                    # Filter out checkbox characters
                    if first_col_text and first_col_text not in ['⁮', '☐', '☑', '☒']:
                        text_parts.append(first_col_text)

    return '\n'.join(text_parts) if text_parts else ''
```

### Solution (Recursive for Deep Nesting)

For documents with multiple levels of table nesting:

```python
def extract_cell_content_recursively(cell):
    """
    Recursively extract text from cell including deeply nested tables.

    Handles arbitrary nesting depth.
    """
    text_parts = []

    def _extract_recursive(cell_obj):
        # Get direct paragraphs
        for para in cell_obj.paragraphs:
            para_text = para.text.strip()
            if para_text and para_text not in ['⁮', '☐', '☑', '☒']:
                text_parts.append(para_text)

        # Recursively get nested tables
        for nested_table in cell_obj.tables:
            for nested_row in nested_table.rows:
                for nested_cell in nested_row.cells:
                    _extract_recursive(nested_cell)

    _extract_recursive(cell)
    return '\n'.join(text_parts) if text_parts else ''
```

## Usage Examples

### Example 1: Extracting Form Checkbox Options

**Document Structure:**
```
Table Cell contains:
  Nested Table:
    Row 1: "High potential" | ☐
    Row 2: "Moderate potential" | ☐
    Row 3: "Low potential" | ☐
```

**Extraction:**
```python
from docx import Document

doc = Document('form.docx')
table = doc.tables[0]
cell = table.rows[1].cells[0]

# Wrong way - returns empty
basic_text = cell.text
print(basic_text)  # Output: '' or '\n'

# Right way - extracts nested content
full_text = extract_cell_content_with_nested_tables(cell)
print(full_text)
# Output:
# High potential
# Moderate potential
# Low potential
```

### Example 2: Processing All Cells in a Table

```python
def process_table_with_nested_content(table):
    """Process all cells, handling nested tables"""
    for row in table.rows:
        for cell in row.cells:
            # Extract with nested table support
            content = extract_cell_content_with_nested_tables(cell)

            if content:
                # Process content (translate, analyze, etc.)
                processed = do_something_with(content)
                print(f"Cell content: {processed}")
```

### Example 3: Detecting Nested Tables

```python
def analyze_document_structure(doc):
    """Find all cells with nested tables"""
    nested_cells = []

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    nested_cells.append({
                        'table': t_idx,
                        'row': r_idx,
                        'col': c_idx,
                        'nested_count': len(cell.tables)
                    })

    return nested_cells

# Usage
doc = Document('complex_form.docx')
nested = analyze_document_structure(doc)

for item in nested:
    print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: "
          f"{item['nested_count']} nested table(s)")
```

## Common Use Cases

### 1. Government Forms

Forms often use nested tables for checkbox grids:

```python
def extract_form_responses(doc):
    """Extract all form checkbox options"""
    responses = {}

    for table in doc.tables:
        for row in table.rows:
            # First cell = question
            question = row.cells[0].text.strip()

            # Second cell = checkbox options (nested table)
            if row.cells[1].tables:
                options = extract_cell_content_with_nested_tables(row.cells[1])
                responses[question] = options.split('\n')

    return responses
```

### 2. Evaluation Forms

Extract rating scales and options:

```python
def extract_evaluation_items(doc):
    """Extract evaluation criteria and options"""
    evaluations = []

    for table in doc.tables:
        for row_idx, row in enumerate(table.rows[1:], 1):
            # Get criterion
            criterion = row.cells[0].text.strip()

            # Get rating options (often nested)
            rating_cell = row.cells[1]
            rating_options = extract_cell_content_with_nested_tables(rating_cell)

            evaluations.append({
                'criterion': criterion,
                'options': rating_options.split('\n')
            })

    return evaluations
```

### 3. Complex Data Tables

Extract structured data from cells with nested layouts:

```python
def extract_complex_cell_data(cell):
    """Extract data from cells with complex nested structures"""
    data = {
        'main_content': '',
        'nested_items': []
    }

    # Direct paragraphs
    for para in cell.paragraphs:
        if para.text.strip():
            data['main_content'] = para.text.strip()
            break

    # Nested table data
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                row_data = [c.text.strip() for c in nested_row.cells]
                data['nested_items'].append(row_data)

    return data
```

## Integration with Official docx Skill

This skill **complements** the official docx skill:

**Official docx skill provides:**
- Document creation (docx-js)
- Basic text extraction (pandoc)
- Tracked changes workflows
- Comment handling
- XML access for complex cases

**This skill provides:**
- Nested table extraction
- Complex cell content handling
- Form and checklist processing
- Advanced content extraction patterns

**Use together:**
```python
# For basic operations: use official skill
from docx import Document

# For nested table handling: use this skill
from docx_advanced import extract_cell_content_with_nested_tables

# Combine both
doc = Document('complex_form.docx')  # Official
for table in doc.tables:            # Official
    for row in table.rows:          # Official
        for cell in row.cells:      # Official
            # Advanced extraction:
            content = extract_cell_content_with_nested_tables(cell)
```

## Performance Considerations

**For Large Documents:**

Cache nested table checks:

```python
def build_nested_table_cache(doc):
    """Pre-compute which cells have nested tables"""
    cache = {}

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    cache[(t_idx, r_idx, c_idx)] = len(cell.tables)

    return cache

# Usage
cache = build_nested_table_cache(doc)

for t_idx, table in enumerate(doc.tables):
    for r_idx, row in enumerate(table.rows):
        for c_idx, cell in enumerate(row.cells):
            if (t_idx, r_idx, c_idx) in cache:
                # This cell has nested tables
                content = extract_cell_content_with_nested_tables(cell)
            else:
                # Regular extraction
                content = cell.text
```

## Troubleshooting

### Issue: Extraction returns empty despite visible content

**Diagnosis:**
```python
cell = table.rows[1].cells[0]
print(f"cell.text: '{cell.text}'")
print(f"cell.tables: {len(cell.tables)}")

if not cell.text.strip() and cell.tables:
    print("Content is in nested tables!")
```

**Fix:** Use `extract_cell_content_with_nested_tables(cell)`

### Issue: Checkbox characters (⁮, ☐) appear in output

**Fix:** Filter them out:
```python
text = cell.text.strip()
# Remove checkbox unicode characters
clean_text = text.replace('⁮', '').replace('☐', '').replace('☑', '').replace('☒', '')
```

### Issue: Multi-line content not preserved

**Fix:** Join with newlines:
```python
'\n'.join(text_parts)  # Preserves line structure
```

## Best Practices

1. **Always check for nested tables first:**
   ```python
   if cell.tables:
       content = extract_cell_content_with_nested_tables(cell)
   else:
       content = cell.text
   ```

2. **Handle checkbox characters:**
   ```python
   CHECKBOX_CHARS = ['⁮', '☐', '☑', '☒']
   if text not in CHECKBOX_CHARS:
       # Process text
   ```

3. **Preserve structure:**
   ```python
   # Use newlines to maintain line breaks
   '\n'.join(lines)
   ```

4. **Test with sample documents:**
   ```python
   def test_extraction():
       doc = Document('sample_form.docx')
       cell = doc.tables[0].rows[1].cells[0]

       extracted = extract_cell_content_with_nested_tables(cell)
       assert 'High potential' in extracted
       assert 'Moderate potential' in extracted
   ```

## Reference Implementation

See `REFERENCE.md` for:
- Complete working examples
- Integration patterns
- Advanced recursive extraction
- Performance optimization techniques

## Contributing to Anthropic Skills

This pattern is not currently in the official `docx` skill. If you find it useful, consider contributing:

1. Fork https://github.com/anthropics/skills
2. Add to `document-skills/docx/SKILL.md`
3. Submit pull request with:
   - Pattern description
   - Code examples
   - Use cases

## Success Criteria

Pattern is working if:
- [ ] Cells with nested tables return full content
- [ ] Checkbox options are extracted correctly
- [ ] Form fields are readable
- [ ] No content is lost during extraction
- [ ] Structure is preserved (line breaks maintained)