--- name: xlsx description: "Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas" --- # Requirements for Outputs ## All Excel files ### Zero Formula Errors - Every Excel model MUST be delivered with ZERO formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?) ### Preserve Existing Templates (when updating templates) - Study and EXACTLY match existing format, style, and conventions when modifying files - Never impose standardized formatting on files with established patterns - Existing template conventions ALWAYS override these guidelines ## Financial models ### Color Coding Standards Unless otherwise stated by the user or existing template #### Industry-Standard Color Conventions - **Blue text (RGB: 0,0,255)**: Hardcoded inputs, and numbers users will change for scenarios - **Black text (RGB: 0,0,0)**: ALL formulas and calculations - **Green text (RGB: 0,128,0)**: Links pulling from other worksheets within same workbook - **Red text (RGB: 255,0,0)**: External links to other files - **Yellow background (RGB: 255,255,0)**: Key assumptions needing attention or cells that need to be updated ### Number Formatting Standards #### Required Format Rules - **Years**: Format as text strings (e.g., "2024" not "2,024") - **Currency**: Use $#,##0 format; ALWAYS specify units in headers ("Revenue ($mm)") - **Zeros**: Use number formatting to make all zeros "-", including percentages (e.g., "$#,##0;($#,##0);-") - **Percentages**: Default to 0.0% format (one decimal) - **Multiples**: Format as 0.0x for valuation multiples (EV/EBITDA, P/E) - **Negative numbers**: Use parentheses (123) not minus -123 ### Formula Construction Rules #### Assumptions Placement - Place ALL assumptions (growth rates, margins, multiples, etc.) in separate assumption cells - Use cell references instead of hardcoded values in formulas - Example: Use =B5*(1+$B$6) instead of =B5*1.05 #### Formula Error Prevention - Verify all cell references are correct - Check for off-by-one errors in ranges - Ensure consistent formulas across all projection periods - Test with edge cases (zero values, negative numbers) - Verify no unintended circular references #### Documentation Requirements for Hardcodes - Comment or in cells beside (if end of table). Format: "Source: [System/Document], [Date], [Specific Reference], [URL if applicable]" - Examples: - "Source: Company 10-K, FY2024, Page 45, Revenue Note, [SEC EDGAR URL]" - "Source: Company 10-Q, Q2 2025, Exhibit 99.1, [SEC EDGAR URL]" - "Source: Bloomberg Terminal, 8/15/2025, AAPL US Equity" - "Source: FactSet, 8/20/2025, Consensus Estimates Screen" # XLSX creation, editing, and analysis ## Overview A user may ask you to create, edit, or analyze the contents of an .xlsx file. You have different tools and workflows available for different tasks. ## Important Requirements **LibreOffice Required for Formula Recalculation**: You can assume LibreOffice is installed for recalculating formula values using the `recalc.py` script. The script automatically configures LibreOffice on first run ## Reading and analyzing data ### Data analysis with pandas For data analysis, visualization, and basic operations, use **pandas** which provides powerful data manipulation capabilities: ```python import pandas as pd # Read Excel df = pd.read_excel('file.xlsx') # Default: first sheet all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict # Analyze df.head() # Preview data df.info() # Column info df.describe() # Statistics # Write Excel df.to_excel('output.xlsx', index=False) ``` ## Excel File Workflows ## CRITICAL: Use Formulas, Not Hardcoded Values **Always use Excel formulas instead of calculating values in Python and hardcoding them.** This ensures the spreadsheet remains dynamic and updateable. ### ❌ WRONG - Hardcoding Calculated Values ```python # Bad: Calculating in Python and hardcoding result total = df['Sales'].sum() sheet['B10'] = total # Hardcodes 5000 # Bad: Computing growth rate in Python growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue'] sheet['C5'] = growth # Hardcodes 0.15 # Bad: Python calculation for average avg = sum(values) / len(values) sheet['D20'] = avg # Hardcodes 42.5 ``` ### ✅ CORRECT - Using Excel Formulas ```python # Good: Let Excel calculate the sum sheet['B10'] = '=SUM(B2:B9)' # Good: Growth rate as Excel formula sheet['C5'] = '=(C4-C2)/C2' # Good: Average using Excel function sheet['D20'] = '=AVERAGE(D2:D19)' ``` This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes. ## Common Workflow 1. **Choose tool**: pandas for data, openpyxl for formulas/formatting 2. **Create/Load**: Create new workbook or load existing file 3. **Modify**: Add/edit data, formulas, and formatting 4. **Save**: Write to file 5. **Recalculate formulas (MANDATORY IF USING FORMULAS)**: Use the recalc.py script ```bash python recalc.py output.xlsx ``` 6. **Verify and fix any errors**: - The script returns JSON with error details - If `status` is `errors_found`, check `error_summary` for specific error types and locations - Fix the identified errors and recalculate again - Common errors to fix: - `#REF!`: Invalid cell references - `#DIV/0!`: Division by zero - `#VALUE!`: Wrong data type in formula - `#NAME?`: Unrecognized formula name ### Creating new Excel files ```python # Using openpyxl for formulas and formatting from openpyxl import Workbook from openpyxl.styles import Font, PatternFill, Alignment wb = Workbook() sheet = wb.active # Add data sheet['A1'] = 'Hello' sheet['B1'] = 'World' sheet.append(['Row', 'of', 'data']) # Add formula sheet['B2'] = '=SUM(A1:A10)' # Formatting sheet['A1'].font = Font(bold=True, color='FF0000') sheet['A1'].fill = PatternFill('solid', start_color='FFFF00') sheet['A1'].alignment = Alignment(horizontal='center') # Column width sheet.column_dimensions['A'].width = 20 wb.save('output.xlsx') ``` ### Editing existing Excel files ```python # Using openpyxl to preserve formulas and formatting from openpyxl import load_workbook # Load existing file wb = load_workbook('existing.xlsx') sheet = wb.active # or wb['SheetName'] for specific sheet # Working with multiple sheets for sheet_name in wb.sheetnames: sheet = wb[sheet_name] print(f"Sheet: {sheet_name}") # Modify cells sheet['A1'] = 'New Value' sheet.insert_rows(2) # Insert row at position 2 sheet.delete_cols(3) # Delete column 3 # Add new sheet new_sheet = wb.create_sheet('NewSheet') new_sheet['A1'] = 'Data' wb.save('modified.xlsx') ``` ## Recalculating formulas Excel files created or modified by openpyxl contain formulas as strings but not calculated values. Use the provided `recalc.py` script to recalculate formulas: ```bash python recalc.py [timeout_seconds] ``` Example: ```bash python recalc.py output.xlsx 30 ``` The script: - Automatically sets up LibreOffice macro on first run - Recalculates all formulas in all sheets - Scans ALL cells for Excel errors (#REF!, #DIV/0!, etc.) - Returns JSON with detailed error locations and counts - Works on both Linux and macOS ## Formula Verification Checklist Quick checks to ensure formulas work correctly: ### Essential Verification - [ ] **Test 2-3 sample references**: Verify they pull correct values before building full model - [ ] **Column mapping**: Confirm Excel columns match (e.g., column 64 = BL, not BK) - [ ] **Row offset**: Remember Excel rows are 1-indexed (DataFrame row 5 = Excel row 6) ### Common Pitfalls - [ ] **NaN handling**: Check for null values with `pd.notna()` - [ ] **Far-right columns**: FY data often in columns 50+ - [ ] **Multiple matches**: Search all occurrences, not just first - [ ] **Division by zero**: Check denominators before using `/` in formulas (#DIV/0!) - [ ] **Wrong references**: Verify all cell references point to intended cells (#REF!) - [ ] **Cross-sheet references**: Use correct format (Sheet1!A1) for linking sheets ### Formula Testing Strategy - [ ] **Start small**: Test formulas on 2-3 cells before applying broadly - [ ] **Verify dependencies**: Check all cells referenced in formulas exist - [ ] **Test edge cases**: Include zero, negative, and very large values ### Interpreting recalc.py Output The script returns JSON with error details: ```json { "status": "success", // or "errors_found" "total_errors": 0, // Total error count "total_formulas": 42, // Number of formulas in file "error_summary": { // Only present if errors found "#REF!": { "count": 2, "locations": ["Sheet1!B5", "Sheet1!C10"] } } } ``` ## Best Practices ### Library Selection - **pandas**: Best for data analysis, bulk operations, and simple data export - **openpyxl**: Best for complex formatting, formulas, and Excel-specific features ### Working with openpyxl - Cell indices are 1-based (row=1, column=1 refers to cell A1) - Use `data_only=True` to read calculated values: `load_workbook('file.xlsx', data_only=True)` - **Warning**: If opened with `data_only=True` and saved, formulas are replaced with values and permanently lost - For large files: Use `read_only=True` for reading or `write_only=True` for writing - Formulas are preserved but not evaluated - use recalc.py to update values ### Working with pandas - Specify data types to avoid inference issues: `pd.read_excel('file.xlsx', dtype={'id': str})` - For large files, read specific columns: `pd.read_excel('file.xlsx', usecols=['A', 'C', 'E'])` - Handle dates properly: `pd.read_excel('file.xlsx', parse_dates=['date_column'])` ## Code Style Guidelines **IMPORTANT**: When generating Python code for Excel operations: - Write minimal, concise Python code without unnecessary comments - Avoid verbose variable names and redundant operations - Avoid unnecessary print statements **For Excel files themselves**: - Add comments to cells with complex formulas or important assumptions - Document data sources for hardcoded values - Include notes for key calculations and model sections ## Data Analysis Patterns ### Reading Multiple Sheets Process all sheets efficiently with ExcelFile: ```python import pandas as pd excel_file = pd.ExcelFile("workbook.xlsx") for sheet_name in excel_file.sheet_names: df = pd.read_excel(excel_file, sheet_name=sheet_name) print(f"{sheet_name}: {len(df)} rows") ``` ### Pivot Tables ```python import pandas as pd df = pd.read_excel("sales_data.xlsx") pivot = pd.pivot_table( df, values="sales", index="region", columns="product", aggfunc="sum", fill_value=0 ) pivot.to_excel("pivot_report.xlsx") ``` ### Group By and Aggregate ```python df = pd.read_excel("sales.xlsx") # Group and sum sales_by_region = df.groupby("region")["sales"].sum() # Multiple aggregations summary = df.groupby("region").agg({ "sales": "sum", "quantity": "mean", "profit": ["min", "max"] }) ``` ### Filtering ```python # Simple filter high_sales = df[df["sales"] > 10000] # Multiple conditions filtered = df[(df["region"] == "West") & (df["sales"] > 5000)] # Calculate new columns df["profit_margin"] = (df["revenue"] - df["cost"]) / df["revenue"] # Sort df_sorted = df.sort_values("sales", ascending=False) ``` ## Data Cleaning ```python import pandas as pd df = pd.read_excel("messy_data.xlsx") # Remove duplicates df = df.drop_duplicates() # Handle missing values df = df.fillna(0) # Fill with value df = df.dropna() # Drop rows with missing values df = df.dropna(subset=["important_col"]) # Drop only if specific column is null # Remove whitespace from strings df["name"] = df["name"].str.strip() # Convert data types df["date"] = pd.to_datetime(df["date"]) df["amount"] = pd.to_numeric(df["amount"], errors="coerce") # Save cleaned data df.to_excel("cleaned_data.xlsx", index=False) ``` ## Merging and Joining ```python import pandas as pd # Concatenate files vertically (stack rows) df1 = pd.read_excel("sales_q1.xlsx") df2 = pd.read_excel("sales_q2.xlsx") combined = pd.concat([df1, df2], ignore_index=True) # Merge on common column (like SQL JOIN) customers = pd.read_excel("customers.xlsx") sales = pd.read_excel("sales.xlsx") merged = pd.merge(sales, customers, on="customer_id", how="left") merged.to_excel("merged_data.xlsx", index=False) ``` ## Charts and Visualization Generate charts from Excel data using matplotlib: ```python import pandas as pd import matplotlib.pyplot as plt df = pd.read_excel("data.xlsx") # Bar chart df.plot(x="category", y="value", kind="bar") plt.title("Sales by Category") plt.xlabel("Category") plt.ylabel("Sales") plt.tight_layout() plt.savefig("bar_chart.png") plt.close() # Pie chart df.set_index("category")["value"].plot(kind="pie", autopct="%1.1f%%") plt.title("Market Share") plt.ylabel("") plt.savefig("pie_chart.png") plt.close() # Line chart df.plot(x="date", y="revenue", kind="line") plt.savefig("trend.png") plt.close() ``` ## Conditional Formatting Apply formatting programmatically based on cell values: ```python import pandas as pd from openpyxl import load_workbook from openpyxl.styles import PatternFill, Font df = pd.DataFrame({ "Product": ["A", "B", "C"], "Sales": [100, 200, 150] }) df.to_excel("formatted.xlsx", index=False) wb = load_workbook("formatted.xlsx") ws = wb.active # Define fills red_fill = PatternFill(start_color="FF0000", end_color="FF0000", fill_type="solid") green_fill = PatternFill(start_color="00FF00", end_color="00FF00", fill_type="solid") # Apply conditional formatting for row in range(2, len(df) + 2): cell = ws[f"B{row}"] if cell.value < 150: cell.fill = red_fill else: cell.fill = green_fill # Bold headers for cell in ws[1]: cell.font = Font(bold=True) wb.save("formatted.xlsx") ``` ## Performance Tips For large Excel files: ```python import pandas as pd # Read only specific columns df = pd.read_excel("large.xlsx", usecols=["A", "C", "E"]) # Read in chunks for very large files for chunk in pd.read_excel("huge.xlsx", chunksize=10000): # Process each chunk process(chunk) # Specify dtypes to avoid inference overhead df = pd.read_excel("data.xlsx", dtype={"id": str, "amount": float}) # For openpyxl with large files from openpyxl import load_workbook wb = load_workbook("large.xlsx", read_only=True) # Read-only mode ``` ## Utilities ### Auto-Adjust Column Widths ```python import pandas as pd df = pd.DataFrame({"Product": ["Widget A", "Widget B"], "Sales": [100, 200]}) writer = pd.ExcelWriter("output.xlsx", engine="openpyxl") df.to_excel(writer, sheet_name="Sales", index=False) worksheet = writer.sheets["Sales"] for column in worksheet.columns: max_length = 0 column_letter = column[0].column_letter for cell in column: try: if len(str(cell.value)) > max_length: max_length = len(str(cell.value)) except: pass worksheet.column_dimensions[column_letter].width = max_length + 2 writer.close() ```