---
name: docx-advanced-patterns
description: Advanced python-docx patterns for nested tables, complex cells, and content extraction beyond .text property. Techniques for forms, checklists, and complex layouts.
---

# DOCX Advanced Patterns Skill

Specialized patterns for python-docx that handle complex document structures not covered by basic `.text` extraction.

## When to Use This Skill

Invoke this skill when working with DOCX files that have:
- Nested tables within table cells
- Forms with checkbox options
- Complex multi-row cell layouts
- Checklists with embedded options
- Cell content that doesn't appear with `.text` property

**Use alongside** the official `docx` skill for comprehensive document handling.

## Core Pattern: Nested Table Extraction

### Problem

python-docx's `cell.text` property only extracts direct paragraph text - it **does not** traverse nested tables within cells.

**Symptom:**
```python
cell.text  # Returns: '' or '\n'
# But cell visually contains content!
```

### Detection

Check if a cell contains nested tables:

```python
if cell.tables:
    print(f"Found {len(cell.tables)} nested table(s)")
    # Cell has nested content - need special extraction
```

### Solution (Simple)

```python
def extract_cell_content_with_nested_tables(cell):
    """
    Extract all text from a cell, including text from nested tables.

    Args:
        cell: python-docx _Cell object

    Returns:
        str: Combined text from cell paragraphs and nested tables
    """
    text_parts = []

    # Get direct paragraph text (not inside nested tables)
    for para in cell.paragraphs:
        para_text = para.text.strip()
        if para_text:
            text_parts.append(para_text)

    # Get content from nested tables
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                # For checkbox lists: Column 0 = label, Column 1 = checkbox
                # Extract text from first column only
                if nested_row.cells:
                    first_col_text = nested_row.cells[0].text.strip()
                    # Filter out checkbox characters
                    if first_col_text and first_col_text not in ['⁮', '☐', '☑', '☒']:
                        text_parts.append(first_col_text)

    return '\n'.join(text_parts) if text_parts else ''
```

### Solution (Recursive for Deep Nesting)

For documents with multiple levels of table nesting:

```python
def extract_cell_content_recursively(cell):
    """
    Recursively extract text from cell including deeply nested tables.

    Handles arbitrary nesting depth.
    """
    text_parts = []

    def _extract_recursive(cell_obj):
        # Get direct paragraphs
        for para in cell_obj.paragraphs:
            para_text = para.text.strip()
            if para_text and para_text not in ['⁮', '☐', '☑', '☒']:
                text_parts.append(para_text)

        # Recursively get nested tables
        for nested_table in cell_obj.tables:
            for nested_row in nested_table.rows:
                for nested_cell in nested_row.cells:
                    _extract_recursive(nested_cell)

    _extract_recursive(cell)
    return '\n'.join(text_parts) if text_parts else ''
```

## Usage Examples

### Example 1: Extracting Form Checkbox Options

**Document Structure:**
```
Table Cell contains:
  Nested Table:
    Row 1: "High potential" | ☐
    Row 2: "Moderate potential" | ☐
    Row 3: "Low potential" | ☐
```

**Extraction:**
```python
from docx import Document

doc = Document('form.docx')
table = doc.tables[0]
cell = table.rows[1].cells[0]

# Wrong way - returns empty
basic_text = cell.text
print(basic_text)  # Output: '' or '\n'

# Right way - extracts nested content
full_text = extract_cell_content_with_nested_tables(cell)
print(full_text)
# Output:
# High potential
# Moderate potential
# Low potential
```

### Example 2: Processing All Cells in a Table

```python
def process_table_with_nested_content(table):
    """Process all cells, handling nested tables"""
    for row in table.rows:
        for cell in row.cells:
            # Extract with nested table support
            content = extract_cell_content_with_nested_tables(cell)

            if content:
                # Process content (translate, analyze, etc.)
                processed = do_something_with(content)
                print(f"Cell content: {processed}")
```

### Example 3: Detecting Nested Tables

```python
def analyze_document_structure(doc):
    """Find all cells with nested tables"""
    nested_cells = []

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    nested_cells.append({
                        'table': t_idx,
                        'row': r_idx,
                        'col': c_idx,
                        'nested_count': len(cell.tables)
                    })

    return nested_cells

# Usage
doc = Document('complex_form.docx')
nested = analyze_document_structure(doc)

for item in nested:
    print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: "
          f"{item['nested_count']} nested table(s)")
```

## Common Use Cases

### 1. Government Forms

Forms often use nested tables for checkbox grids:

```python
def extract_form_responses(doc):
    """Extract all form checkbox options"""
    responses = {}

    for table in doc.tables:
        for row in table.rows:
            # First cell = question
            question = row.cells[0].text.strip()

            # Second cell = checkbox options (nested table)
            if row.cells[1].tables:
                options = extract_cell_content_with_nested_tables(row.cells[1])
                responses[question] = options.split('\n')

    return responses
```

### 2. Evaluation Forms

Extract rating scales and options:

```python
def extract_evaluation_items(doc):
    """Extract evaluation criteria and options"""
    evaluations = []

    for table in doc.tables:
        for row_idx, row in enumerate(table.rows[1:], 1):
            # Get criterion
            criterion = row.cells[0].text.strip()

            # Get rating options (often nested)
            rating_cell = row.cells[1]
            rating_options = extract_cell_content_with_nested_tables(rating_cell)

            evaluations.append({
                'criterion': criterion,
                'options': rating_options.split('\n')
            })

    return evaluations
```

### 3. Complex Data Tables

Extract structured data from cells with nested layouts:

```python
def extract_complex_cell_data(cell):
    """Extract data from cells with complex nested structures"""
    data = {
        'main_content': '',
        'nested_items': []
    }

    # Direct paragraphs
    for para in cell.paragraphs:
        if para.text.strip():
            data['main_content'] = para.text.strip()
            break

    # Nested table data
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                row_data = [c.text.strip() for c in nested_row.cells]
                data['nested_items'].append(row_data)

    return data
```

## Integration with Official docx Skill

This skill **complements** the official docx skill:

**Official docx skill provides:**
- Document creation (docx-js)
- Basic text extraction (pandoc)
- Tracked changes workflows
- Comment handling
- XML access for complex cases

**This skill provides:**
- Nested table extraction
- Complex cell content handling
- Form and checklist processing
- Advanced content extraction patterns

**Use together:**
```python
# For basic operations: use official skill
from docx import Document

# For nested table handling: use this skill
from docx_advanced import extract_cell_content_with_nested_tables

# Combine both
doc = Document('complex_form.docx')  # Official
for table in doc.tables:            # Official
    for row in table.rows:          # Official
        for cell in row.cells:      # Official
            # Advanced extraction:
            content = extract_cell_content_with_nested_tables(cell)
```

## Performance Considerations

**For Large Documents:**

Cache nested table checks:

```python
def build_nested_table_cache(doc):
    """Pre-compute which cells have nested tables"""
    cache = {}

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    cache[(t_idx, r_idx, c_idx)] = len(cell.tables)

    return cache

# Usage
cache = build_nested_table_cache(doc)

for t_idx, table in enumerate(doc.tables):
    for r_idx, row in enumerate(table.rows):
        for c_idx, cell in enumerate(row.cells):
            if (t_idx, r_idx, c_idx) in cache:
                # This cell has nested tables
                content = extract_cell_content_with_nested_tables(cell)
            else:
                # Regular extraction
                content = cell.text
```

## Troubleshooting

### Issue: Extraction returns empty despite visible content

**Diagnosis:**
```python
cell = table.rows[1].cells[0]
print(f"cell.text: '{cell.text}'")
print(f"cell.tables: {len(cell.tables)}")

if not cell.text.strip() and cell.tables:
    print("Content is in nested tables!")
```

**Fix:** Use `extract_cell_content_with_nested_tables(cell)`

### Issue: Checkbox characters (⁮, ☐) appear in output

**Fix:** Filter them out:
```python
text = cell.text.strip()
# Remove checkbox unicode characters
clean_text = text.replace('⁮', '').replace('☐', '').replace('☑', '').replace('☒', '')
```

### Issue: Multi-line content not preserved

**Fix:** Join with newlines:
```python
'\n'.join(text_parts)  # Preserves line structure
```

## Best Practices

1. **Always check for nested tables first:**
   ```python
   if cell.tables:
       content = extract_cell_content_with_nested_tables(cell)
   else:
       content = cell.text
   ```

2. **Handle checkbox characters:**
   ```python
   CHECKBOX_CHARS = ['⁮', '☐', '☑', '☒']
   if text not in CHECKBOX_CHARS:
       # Process text
   ```

3. **Preserve structure:**
   ```python
   # Use newlines to maintain line breaks
   '\n'.join(lines)
   ```

4. **Test with sample documents:**
   ```python
   def test_extraction():
       doc = Document('sample_form.docx')
       cell = doc.tables[0].rows[1].cells[0]

       extracted = extract_cell_content_with_nested_tables(cell)
       assert 'High potential' in extracted
       assert 'Moderate potential' in extracted
   ```

## Reference Implementation

See `REFERENCE.md` for:
- Complete working examples
- Integration patterns
- Advanced recursive extraction
- Performance optimization techniques

## Contributing to Anthropic Skills

This pattern is not currently in the official `docx` skill. If you find it useful, consider contributing:

1. Fork https://github.com/anthropics/skills
2. Add to `document-skills/docx/SKILL.md`
3. Submit pull request with:
   - Pattern description
   - Code examples
   - Use cases

## Success Criteria

Pattern is working if:
- [ ] Cells with nested tables return full content
- [ ] Checkbox options are extracted correctly
- [ ] Form fields are readable
- [ ] No content is lost during extraction
- [ ] Structure is preserved (line breaks maintained)