--- name: xlsx-editor description: Specialized guidance for editing EXISTING Excel files with emphasis on preserving formulas, formatting, and structure across multiple tabs. Use this skill when modifying, updating, or adding data to existing .xlsx files where maintaining integrity is critical—particularly for multi-tab workbooks, complex formulas, formatted tables, and data that must maintain sort order and relationships across sheets. --- # Excel Editor - Editing Existing Files ## Overview This skill provides specialized workflows for editing existing Excel files while preserving their integrity. It addresses the unique challenges of modifying live workbooks: maintaining formula relationships across tabs, preserving formatting and structure, ensuring data insertion maintains sort order, and guaranteeing completeness of multi-tab updates. ## When to Use This Skill Use this skill when: - Modifying existing Excel files (not creating new ones) - Working with multi-tab workbooks where changes must cascade correctly - Adding data that needs to integrate with existing tables while maintaining sort order - Updating formulas that reference cells across multiple sheets - Preserving complex formatting, merged cells, or conditional formatting - Ensuring all related tables across tabs are updated completely ## Critical Principles for Editing Existing Files ### 1. Always Analyze Before Editing Never dive directly into making changes. First understand the file's structure, dependencies, and relationships. ### 2. Preserve, Don't Replace The existing file has established patterns. Maintain its structure, formulas, and formatting conventions rather than imposing new ones. ### 3. Multi-Tab Completeness is Mandatory If data appears in multiple tabs, ALL instances must be updated. Missing one tab is a critical failure. ### 4. Validate Exhaustively After edits, verify formulas, check for blank rows, confirm sort order, and ensure multi-tab consistency. ## Pre-Edit Analysis Workflow **MANDATORY: Always perform this analysis before making any edits.** ```python from openpyxl import load_workbook import pandas as pd # Load file without modifying it wb = load_workbook('existing.xlsx', data_only=False) # 1. INVENTORY ALL SHEETS print("Sheets in workbook:", wb.sheetnames) for sheet_name in wb.sheetnames: sheet = wb[sheet_name] print(f"\n{sheet_name}:") print(f" Dimensions: {sheet.dimensions}") print(f" Max row: {sheet.max_row}, Max col: {sheet.max_column}") # 2. IDENTIFY DATA STRUCTURES # For each sheet, identify: # - Table boundaries (first/last row and column) # - Header rows # - Sort keys (which columns determine order) # - Merged cells for sheet_name in wb.sheetnames: sheet = wb[sheet_name] print(f"\nMerged cells in {sheet_name}:", list(sheet.merged_cells)) # 3. MAP FORMULAS AND DEPENDENCIES # Find all formulas and note cross-sheet references for sheet_name in wb.sheetnames: sheet = wb[sheet_name] formulas = [] for row in sheet.iter_rows(): for cell in row: if cell.value and isinstance(cell.value, str) and cell.value.startswith('='): formulas.append({ 'cell': cell.coordinate, 'formula': cell.value, 'cross_sheet': '!' in cell.value }) if formulas: print(f"\n{sheet_name} has {len(formulas)} formulas") cross_sheet = [f for f in formulas if f['cross_sheet']] if cross_sheet: print(f" {len(cross_sheet)} have cross-sheet references") for f in cross_sheet[:3]: # Show first 3 examples print(f" {f['cell']}: {f['formula']}") # 4. UNDERSTAND SORT ORDER # Read data to understand current ordering for sheet_name in wb.sheetnames: df = pd.read_excel('existing.xlsx', sheet_name=sheet_name) if not df.empty: print(f"\n{sheet_name} current sort:") print(f" First 5 rows of first column: {df.iloc[:5, 0].tolist()}") print(f" Last 5 rows of first column: {df.iloc[-5:, 0].tolist()}") ``` **Document findings before proceeding:** - Which sheets contain related data? - What formulas reference other sheets? - What determines sort order? - Where are merged cells or complex formatting? - What is the established naming/structure pattern? ## Core Editing Workflows ### Workflow 1: Updating Cell Values While Preserving Formulas **Use case:** Changing data values without breaking formula dependencies. ```python from openpyxl import load_workbook # Load with formulas intact wb = load_workbook('existing.xlsx', data_only=False) sheet = wb['SheetName'] # CORRECT: Update value, preserve formula context cell = sheet['A5'] if cell.value and isinstance(cell.value, str) and cell.value.startswith('='): print(f"WARNING: A5 contains formula: {cell.value}") print("Updating this will DELETE the formula. Proceed? (update dependent formulas instead)") else: cell.value = new_value # For cells with formulas, update the formula itself or its dependencies sheet['B10'] = '=SUM(A1:A9)' # Updates formula, preserving calculation logic wb.save('existing.xlsx') ``` **Critical checks:** - Verify you're not overwriting formula cells with values - Check if other cells reference the cell you're updating - Test a few formulas to ensure they still work after the change ### Workflow 2: Adding Rows While Maintaining Structure **Use case:** Inserting new data into existing tables. ```python from openpyxl import load_workbook import pandas as pd # 1. UNDERSTAND CURRENT STRUCTURE df = pd.read_excel('existing.xlsx', sheet_name='Data') # Identify sort key(s) # Example: sorted by date (column A), then by category (column B) # 2. DETERMINE INSERTION POINT # Find where new row belongs in sort order new_data = {'Date': '2024-06-15', 'Category': 'Sales', 'Amount': 1500} insertion_row = None for idx, row in df.iterrows(): if row['Date'] > new_data['Date']: insertion_row = idx + 2 # +2 because: +1 for header, +1 for 1-based indexing break if insertion_row is None: insertion_row = len(df) + 2 # Append at end # 3. INSERT ROW wb = load_workbook('existing.xlsx') sheet = wb['Data'] sheet.insert_rows(insertion_row) # 4. POPULATE WITH CORRECT STRUCTURE # CRITICAL: Copy formatting from adjacent row source_row = insertion_row - 1 if insertion_row > 2 else insertion_row + 1 for col in range(1, sheet.max_column + 1): # Copy formatting source_cell = sheet.cell(source_row, col) target_cell = sheet.cell(insertion_row, col) if source_cell.has_style: target_cell.font = source_cell.font.copy() target_cell.border = source_cell.border.copy() target_cell.fill = source_cell.fill.copy() target_cell.number_format = source_cell.number_format target_cell.alignment = source_cell.alignment.copy() # 5. SET VALUES sheet.cell(insertion_row, 1, new_data['Date']) sheet.cell(insertion_row, 2, new_data['Category']) sheet.cell(insertion_row, 3, new_data['Amount']) # 6. UPDATE FORMULAS THAT REFERENCE THIS RANGE # If there are SUM formulas like =SUM(A2:A100), they may need adjustment # Check all sheets for formulas referencing this sheet for ws_name in wb.sheetnames: ws = wb[ws_name] for row in ws.iter_rows(): for cell in row: if cell.value and isinstance(cell.value, str) and cell.value.startswith('='): if 'Data!' in cell.value: # References the sheet we modified print(f"REVIEW: {ws_name}!{cell.coordinate} formula may need adjustment: {cell.value}") wb.save('existing.xlsx') ``` **Critical checks:** - [ ] New row is in correct sort position - [ ] Formatting matches surrounding rows - [ ] No blank rows introduced - [ ] Formulas referencing the range are still valid ### Workflow 3: Adding Columns While Preserving Formulas **Use case:** Adding new data columns to existing tables. ```python from openpyxl import load_workbook wb = load_workbook('existing.xlsx') sheet = wb['Data'] # 1. IDENTIFY INSERTION POINT # Example: Insert after column C (before column D) insert_col = 4 # Column D # 2. INSERT COLUMN sheet.insert_cols(insert_col) # 3. COPY FORMATTING FROM ADJACENT COLUMN source_col = insert_col - 1 for row_idx in range(1, sheet.max_row + 1): source_cell = sheet.cell(row_idx, source_col) target_cell = sheet.cell(row_idx, insert_col) if source_cell.has_style: target_cell.font = source_cell.font.copy() target_cell.border = source_cell.border.copy() target_cell.fill = source_cell.fill.copy() target_cell.number_format = source_cell.number_format target_cell.alignment = source_cell.alignment.copy() # 4. SET HEADER sheet.cell(1, insert_col, 'New Column Header') # 5. ADD FORMULAS OR DATA for row_idx in range(2, sheet.max_row + 1): # Example: Formula referencing other columns in same row sheet.cell(row_idx, insert_col, f'=B{row_idx}*C{row_idx}') wb.save('existing.xlsx') ``` **Critical checks:** - [ ] Formulas in other columns still reference correct cells - [ ] Cross-sheet references are not broken - [ ] Column widths are appropriate ### Workflow 4: Multi-Tab Data Updates (CRITICAL WORKFLOW) **Use case:** Updating data that appears in multiple sheets (THE most common source of errors). **THE PROBLEM:** When the same entity (customer, product, date, etc.) appears across multiple tabs, ALL instances must be updated. Missing even one tab creates data inconsistency. **THE SOLUTION:** Systematic multi-tab update workflow. ```python from openpyxl import load_workbook import pandas as pd # STEP 1: IDENTIFY ALL OCCURRENCES # Find which sheets contain the data to be updated target_value = "Customer X" # Example: updating customer name occurrences = {} wb = load_workbook('existing.xlsx', data_only=True) # Read values for sheet_name in wb.sheetnames: df = pd.read_excel('existing.xlsx', sheet_name=sheet_name) # Search for target value in all cells found_locations = [] for col in df.columns: mask = df[col].astype(str).str.contains(target_value, case=False, na=False) if mask.any(): locations = df[mask].index.tolist() found_locations.extend([(col, idx+2) for idx in locations]) # +2 for header and 1-based if found_locations: occurrences[sheet_name] = found_locations print(f"Found '{target_value}' in {sheet_name}: {len(found_locations)} locations") if not occurrences: print(f"ERROR: '{target_value}' not found in any sheet!") else: print(f"\nTotal sheets containing '{target_value}': {len(occurrences)}") # STEP 2: UPDATE ALL OCCURRENCES wb = load_workbook('existing.xlsx', data_only=False) # Now load with formulas new_value = "Customer X - Updated" update_count = 0 for sheet_name, locations in occurrences.items(): sheet = wb[sheet_name] for col_name, row_idx in locations: # Convert column name to index df = pd.read_excel('existing.xlsx', sheet_name=sheet_name, nrows=0) col_idx = list(df.columns).index(col_name) + 1 cell = sheet.cell(row_idx, col_idx) # Check if it's a formula if cell.value and isinstance(cell.value, str) and cell.value.startswith('='): print(f"WARNING: {sheet_name}!{cell.coordinate} is a formula. Skipping.") else: cell.value = new_value update_count += 1 print(f"Updated {sheet_name}!{cell.coordinate}") print(f"\nTotal cells updated: {update_count}") # STEP 3: VERIFY ALL SHEETS WERE UPDATED print("\nVerifying updates...") wb.save('existing_updated.xlsx') wb_verify = load_workbook('existing_updated.xlsx', data_only=True) for sheet_name in wb_verify.sheetnames: df = pd.read_excel('existing_updated.xlsx', sheet_name=sheet_name) old_count = df.astype(str).apply(lambda x: x.str.contains(target_value, case=False, na=False)).sum().sum() new_count = df.astype(str).apply(lambda x: x.str.contains(new_value, case=False, na=False)).sum().sum() if old_count > 0: print(f"❌ {sheet_name}: Still has {old_count} instances of old value!") else: print(f"✅ {sheet_name}: All instances updated (found {new_count} new values)") ``` **Critical checks:** - [ ] ALL sheets searched, not just obvious ones - [ ] Update count matches expected number of occurrences - [ ] Verification confirms zero instances of old value remain - [ ] No formulas were accidentally overwritten ### Workflow 5: Cascading Formula Updates Across Tabs **Use case:** Changing a formula that must be propagated to other sheets. ```python from openpyxl import load_workbook wb = load_workbook('existing.xlsx') # EXAMPLE: Update formula pattern across all sheets # Old formula: =SUM(B2:B10) # New formula: =SUMIF(B2:B10,">0") # Only sum positive values for sheet_name in wb.sheetnames: sheet = wb[sheet_name] changes_in_sheet = 0 for row in sheet.iter_rows(): for cell in row: if cell.value and isinstance(cell.value, str): if cell.value.startswith('=SUM(') and not 'SUMIF' in cell.value: # Extract range from SUM formula old_formula = cell.value # Example: =SUM(B2:B10) -> =SUMIF(B2:B10,">0") range_part = old_formula[5:-1] # Extract "B2:B10" new_formula = f'=SUMIF({range_part},">0")' cell.value = new_formula changes_in_sheet += 1 print(f"{sheet_name}!{cell.coordinate}: {old_formula} -> {new_formula}") if changes_in_sheet > 0: print(f" {sheet_name}: {changes_in_sheet} formulas updated") wb.save('existing.xlsx') ``` **Critical checks:** - [ ] Formula pattern correctly identified across all sheets - [ ] New formula syntax is valid - [ ] All sheets were checked (not just the first few) - [ ] Edge cases handled (formulas with different ranges, nested functions) ### Workflow 6: Handling Merged Cells **Use case:** Working with tables that have merged cells (common in headers/formatting). ```python from openpyxl import load_workbook wb = load_workbook('existing.xlsx') sheet = wb['Data'] # 1. IDENTIFY MERGED CELLS merged_ranges = list(sheet.merged_cells.ranges) print(f"Found {len(merged_ranges)} merged cell ranges:") for merged_range in merged_ranges: print(f" {merged_range}") # 2. WHEN INSERTING ROWS/COLUMNS NEAR MERGED CELLS # Merged cells will NOT expand automatically. Must handle manually. # Example: Inserting row in middle of merged region insert_row = 5 # Check if insertion point is in a merged range for merged_range in merged_ranges: if merged_range.min_row <= insert_row <= merged_range.max_row: print(f"WARNING: Row {insert_row} is within merged range {merged_range}") print("Unmerging, inserting, then re-merging with expanded range") # Unmerge sheet.unmerge_cells(str(merged_range)) # Insert row sheet.insert_rows(insert_row) # Re-merge with expanded range new_range = f"{merged_range.min_col}{merged_range.min_row}:{merged_range.max_col}{merged_range.max_row+1}" sheet.merge_cells(new_range) wb.save('existing.xlsx') ``` **Critical checks:** - [ ] Merged cells are preserved or appropriately expanded - [ ] Values in merged cells are maintained - [ ] Formatting of merged cells is preserved ## Post-Edit Validation Checklist **MANDATORY: Complete this checklist after ANY edits.** ### 1. Formula Integrity Check ```python # Run recalc.py to identify formula errors import subprocess import json result = subprocess.run(['python', 'recalc.py', 'existing.xlsx'], capture_output=True, text=True) recalc_result = json.loads(result.stdout) if recalc_result['status'] == 'errors_found': print(f"❌ ERRORS FOUND: {recalc_result['total_errors']} formula errors") for error_type, details in recalc_result['error_summary'].items(): print(f"\n{error_type}: {details['count']} occurrences") for location in details['locations'][:5]: # Show first 5 print(f" - {location}") else: print("✅ All formulas calculated successfully (zero errors)") ``` - [ ] Zero formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?) - [ ] All formulas return expected values - [ ] Cross-sheet references still work ### 2. Structural Integrity Check ```python from openpyxl import load_workbook import pandas as pd wb = load_workbook('existing.xlsx') # Check for blank rows in data tables for sheet_name in wb.sheetnames: df = pd.read_excel('existing.xlsx', sheet_name=sheet_name) # Identify blank rows (all NaN) blank_rows = df[df.isnull().all(axis=1)] if not blank_rows.empty and len(blank_rows) > 0: print(f"⚠️ {sheet_name}: Found {len(blank_rows)} blank rows at indices: {blank_rows.index.tolist()}") # Check for unexpected blank cells in key columns (first column) if not df.empty and len(df.columns) > 0: blank_cells = df[df.iloc[:, 0].isnull()] if not blank_cells.empty and len(blank_cells) > 0: print(f"⚠️ {sheet_name}: First column has {len(blank_cells)} blank cells") ``` - [ ] No unexpected blank rows in data tables - [ ] No blank cells in key identifier columns - [ ] Table boundaries are clean (no trailing empty rows/columns) ### 3. Sort Order Verification ```python # Verify data maintains expected sort order df = pd.read_excel('existing.xlsx', sheet_name='Data') # Example: Should be sorted by Date ascending, then Category ascending expected_sort_cols = ['Date', 'Category'] # Check if sorted is_sorted = df[expected_sort_cols].equals(df[expected_sort_cols].sort_values(expected_sort_cols)) if is_sorted: print("✅ Data is in correct sort order") else: print("❌ Data is NOT sorted correctly") print("First 10 rows of sort columns:") print(df[expected_sort_cols].head(10)) ``` - [ ] Data maintains expected sort order - [ ] New rows inserted at correct position - [ ] Sort keys (if any) are consistent ### 4. Multi-Tab Consistency Check ```python # Verify related data is consistent across tabs # Example: Check that customer names are consistent across Summary and Detail sheets df_summary = pd.read_excel('existing.xlsx', sheet_name='Summary') df_detail = pd.read_excel('existing.xlsx', sheet_name='Detail') customers_summary = set(df_summary['Customer'].dropna().unique()) customers_detail = set(df_detail['Customer'].dropna().unique()) in_summary_not_detail = customers_summary - customers_detail in_detail_not_summary = customers_detail - customers_summary if in_summary_not_detail: print(f"⚠️ Customers in Summary but not Detail: {in_summary_not_detail}") if in_detail_not_summary: print(f"⚠️ Customers in Detail but not Summary: {in_detail_not_summary}") if not in_summary_not_detail and not in_detail_not_summary: print("✅ Customer lists are consistent across tabs") ``` - [ ] Related data exists in ALL expected sheets - [ ] Counts/totals are consistent across sheets - [ ] No orphaned data (exists in one sheet but not related sheets) ### 5. Formatting Preservation Check ```python # Visual inspection is often needed for formatting # But can programmatically check some basics from openpyxl import load_workbook wb = load_workbook('existing.xlsx') sheet = wb['Data'] # Check that new rows have formatting for row_idx in [5, 10, 15]: # Example rows you inserted cell = sheet.cell(row_idx, 1) has_font = cell.font is not None and cell.font != sheet.cell(1, 1).font has_fill = cell.fill is not None and cell.fill.start_color.index != '00000000' print(f"Row {row_idx}: Font={has_font}, Fill={has_fill}") ``` - [ ] New rows/columns match formatting of surrounding cells - [ ] Borders are continuous (no breaks in table borders) - [ ] Merged cells are intact - [ ] Conditional formatting still applies correctly ## Common Pitfalls and How to Avoid Them ### Pitfall 1: Incomplete Multi-Tab Updates **Problem:** Updating data in Sheet1 but forgetting it also appears in Sheet2 and Sheet3. **Solution:** Always search ALL sheets before updating. Use the Multi-Tab Data Updates workflow. **Prevention:** ```python # Before ANY update, run this search search_term = "value_to_update" for sheet_name in wb.sheetnames: df = pd.read_excel('file.xlsx', sheet_name=sheet_name) found = df.astype(str).apply(lambda x: x.str.contains(search_term, case=False, na=False)).any().any() if found: print(f"'{search_term}' found in {sheet_name}") ``` ### Pitfall 2: Breaking Formulas When Inserting Rows **Problem:** Inserting a row causes formulas like =SUM(A2:A10) to break or not include the new row. **Solution:** After inserting rows, check if range-based formulas need adjustment. **Prevention:** - Insert rows WITHIN existing ranges (not before or after) - Use entire column references when possible: =SUM(A:A) instead of =SUM(A2:A10) - After insertion, verify formula ranges programmatically ### Pitfall 3: Inserting Data Out of Sort Order **Problem:** New row appears at the end instead of in alphabetical/chronological order. **Solution:** Always determine the correct insertion point based on sort keys BEFORE inserting. **Prevention:** ```python # Determine insertion point based on sort logic insertion_row = None for idx, row in df.iterrows(): if should_come_before(new_data, row): # Define your comparison logic insertion_row = idx + 2 break ``` ### Pitfall 4: Introducing Blank Rows **Problem:** Editing operations create unexpected blank rows in tables. **Solution:** Never use append() for mid-table insertions. Always use insert_rows() with calculated position. **Prevention:** Always run blank row check in post-edit validation. ### Pitfall 5: Overwriting Formulas with Values **Problem:** Updating a cell that contains a formula, replacing it with a hardcoded value. **Solution:** Always check if cell contains a formula before updating. **Prevention:** ```python if cell.value and isinstance(cell.value, str) and cell.value.startswith('='): raise ValueError(f"{cell.coordinate} contains formula. Cannot overwrite with value.") ``` ### Pitfall 6: Not Recalculating After Formula Changes **Problem:** Formulas are updated but values shown are stale (from before the edit). **Solution:** ALWAYS run recalc.py after any formula modifications. **Prevention:** Make recalc.py part of your standard workflow. Never skip it. ### Pitfall 7: Losing Formatting on New Rows/Columns **Problem:** New rows have default formatting instead of matching the table style. **Solution:** Always copy formatting from adjacent rows/columns. **Prevention:** Use the formatting copy code from Workflow 2 and 3 above. ## Summary: Editing Workflow Template For any editing task, follow this sequence: 1. **Analyze** (Pre-Edit Analysis Workflow) - Map all sheets - Identify dependencies - Document structure 2. **Plan** - Determine which sheets need changes - Identify formula impacts - Calculate insertion points 3. **Execute** - Use appropriate workflow - Update ALL relevant sheets - Preserve formatting 4. **Validate** (Post-Edit Validation Checklist) - Check formulas (zero errors) - Verify structure (no blank rows) - Confirm sort order - Ensure multi-tab consistency - Validate formatting 5. **Recalculate** - Run recalc.py - Fix any errors found - Run recalc.py again until zero errors **Never skip steps.** Each step catches different classes of errors.