--- name: "pandas-construction-analysis" description: "Comprehensive Pandas toolkit for construction data analysis. Filter, group, aggregate BIM elements, calculate quantities, merge datasets, and generate reports from structured construction data." homepage: "https://datadrivenconstruction.io" metadata: {"openclaw": {"emoji": "🐼", "os": ["darwin", "linux", "win32"], "homepage": "https://datadrivenconstruction.io", "requires": {"bins": ["python3"]}}} --- # Pandas Construction Data Analysis ## Overview Based on DDC methodology (Chapter 2.3), this skill provides comprehensive Pandas operations for construction data processing. Pandas is the Swiss Army knife for data analysts - handling everything from simple data filtering to complex aggregations across millions of rows. **Book Reference:** "Pandas DataFrame и LLM ChatGPT" / "Pandas DataFrame and LLM ChatGPT" > "Используя Pandas, вы можете управлять и анализировать наборы данных, намного превосходящие возможности Excel. В то время как Excel способен обрабатывать до 1 миллиона строк данных, Pandas может без труда работать с наборами данных, содержащими десятки миллионов строк." > — DDC Book, Chapter 2.3 ## Quick Start ```python import pandas as pd # Read construction data df = pd.read_excel("bim_export.xlsx") # Basic operations print(df.head()) # First 5 rows print(df.info()) # Column types and memory print(df.describe()) # Statistics for numeric columns # Filter structural elements structural = df[df['Category'] == 'Structural'] # Calculate total volume total_volume = df['Volume'].sum() print(f"Total volume: {total_volume:.2f} m³") ``` ## DataFrame Fundamentals ### Creating DataFrames ```python import pandas as pd # From dictionary (construction elements) elements = pd.DataFrame({ 'ElementId': ['E001', 'E002', 'E003', 'E004'], 'Category': ['Wall', 'Floor', 'Wall', 'Column'], 'Material': ['Concrete', 'Concrete', 'Brick', 'Steel'], 'Volume_m3': [45.5, 120.0, 32.0, 8.5], 'Level': ['Level 1', 'Level 1', 'Level 2', 'Level 1'] }) # From CSV df_csv = pd.read_csv("construction_data.csv") # From Excel df_excel = pd.read_excel("project_data.xlsx", sheet_name="Elements") # From multiple Excel sheets all_sheets = pd.read_excel("project.xlsx", sheet_name=None) # Dict of DataFrames ``` ### Data Types in Construction ```python # Common data types for construction df = pd.DataFrame({ 'element_id': pd.Series(['W001', 'W002'], dtype='string'), 'quantity': pd.Series([10, 20], dtype='int64'), 'volume': pd.Series([45.5, 32.0], dtype='float64'), 'is_structural': pd.Series([True, False], dtype='bool'), 'created_date': pd.to_datetime(['2024-01-15', '2024-01-16']), 'category': pd.Categorical(['Wall', 'Slab']) }) # Check data types print(df.dtypes) # Convert types df['quantity'] = df['quantity'].astype('float64') df['volume'] = pd.to_numeric(df['volume'], errors='coerce') ``` ## Filtering and Selection ### Basic Filtering ```python # Single condition walls = df[df['Category'] == 'Wall'] # Multiple conditions (AND) large_concrete = df[(df['Material'] == 'Concrete') & (df['Volume_m3'] > 50)] # Multiple conditions (OR) walls_or_floors = df[(df['Category'] == 'Wall') | (df['Category'] == 'Floor')] # Using isin for multiple values structural = df[df['Category'].isin(['Wall', 'Column', 'Beam', 'Foundation'])] # String contains insulated = df[df['Description'].str.contains('insulated', case=False, na=False)] # Null value filtering incomplete = df[df['Cost'].isna()] complete = df[df['Cost'].notna()] ``` ### Advanced Selection ```python # Select columns volumes = df[['ElementId', 'Category', 'Volume_m3']] # Query syntax (SQL-like) result = df.query("Category == 'Wall' and Volume_m3 > 30") # Loc and iloc specific_row = df.loc[0] # By label range_rows = df.iloc[0:10] # By position specific_cell = df.loc[0, 'Volume_m3'] # Row and column subset = df.loc[0:5, ['Category', 'Volume_m3']] # Range with columns ``` ## Grouping and Aggregation ### GroupBy Operations ```python # Basic groupby by_category = df.groupby('Category')['Volume_m3'].sum() # Multiple aggregations summary = df.groupby('Category').agg({ 'Volume_m3': ['sum', 'mean', 'count'], 'Cost': ['sum', 'mean'] }) # Named aggregations (cleaner output) summary = df.groupby('Category').agg( total_volume=('Volume_m3', 'sum'), avg_volume=('Volume_m3', 'mean'), element_count=('ElementId', 'count'), total_cost=('Cost', 'sum') ).reset_index() # Multiple grouping columns by_level_cat = df.groupby(['Level', 'Category']).agg({ 'Volume_m3': 'sum', 'Cost': 'sum' }).reset_index() ``` ### Pivot Tables ```python # Create pivot table pivot = pd.pivot_table( df, values='Volume_m3', index='Level', columns='Category', aggfunc='sum', fill_value=0, margins=True, # Add totals margins_name='Total' ) # Multiple values pivot_detailed = pd.pivot_table( df, values=['Volume_m3', 'Cost'], index='Level', columns='Category', aggfunc={'Volume_m3': 'sum', 'Cost': 'mean'} ) ``` ## Data Transformation ### Adding Calculated Columns ```python # Simple calculation df['Cost_Total'] = df['Volume_m3'] * df['Unit_Price'] # Conditional column df['Size_Category'] = df['Volume_m3'].apply( lambda x: 'Large' if x > 50 else ('Medium' if x > 20 else 'Small') ) # Using np.where for binary conditions import numpy as np df['Is_Large'] = np.where(df['Volume_m3'] > 50, True, False) # Using cut for binning df['Volume_Bin'] = pd.cut( df['Volume_m3'], bins=[0, 10, 50, 100, float('inf')], labels=['XS', 'S', 'M', 'L'] ) ``` ### String Operations ```python # Extract from strings df['Level_Number'] = df['Level'].str.extract(r'(\d+)').astype(int) # Split and expand df[['Building', 'Floor']] = df['Location'].str.split('-', expand=True) # Clean strings df['Category'] = df['Category'].str.strip().str.lower().str.title() # Replace values df['Material'] = df['Material'].str.replace('Reinforced Concrete', 'RC') ``` ### Date Operations ```python # Parse dates df['Start_Date'] = pd.to_datetime(df['Start_Date']) # Extract components df['Year'] = df['Start_Date'].dt.year df['Month'] = df['Start_Date'].dt.month df['Week'] = df['Start_Date'].dt.isocalendar().week df['DayOfWeek'] = df['Start_Date'].dt.day_name() # Calculate duration df['Duration_Days'] = (df['End_Date'] - df['Start_Date']).dt.days # Filter by date range recent = df[df['Start_Date'] >= '2024-01-01'] ``` ## Merging and Joining ### Merge DataFrames ```python # Elements data elements = pd.DataFrame({ 'ElementId': ['E001', 'E002', 'E003'], 'Category': ['Wall', 'Floor', 'Column'], 'Volume_m3': [45.5, 120.0, 8.5] }) # Unit prices prices = pd.DataFrame({ 'Category': ['Wall', 'Floor', 'Column', 'Beam'], 'Unit_Price': [150, 80, 450, 200] }) # Inner join (only matching) merged = elements.merge(prices, on='Category', how='inner') # Left join (keep all elements) merged = elements.merge(prices, on='Category', how='left') # Join on different column names result = df1.merge(df2, left_on='elem_id', right_on='ElementId') ``` ### Concatenating DataFrames ```python # Vertical concatenation (stacking) all_floors = pd.concat([floor1_df, floor2_df, floor3_df], ignore_index=True) # Horizontal concatenation combined = pd.concat([quantities, costs, schedule], axis=1) # Append new rows new_elements = pd.DataFrame({'ElementId': ['E004'], 'Category': ['Beam']}) df = pd.concat([df, new_elements], ignore_index=True) ``` ## Construction-Specific Analyses ### Quantity Take-Off (QTO) ```python def generate_qto_report(df): """Generate Quantity Take-Off summary by category""" qto = df.groupby(['Category', 'Material']).agg( count=('ElementId', 'count'), total_volume=('Volume_m3', 'sum'), total_area=('Area_m2', 'sum'), avg_volume=('Volume_m3', 'mean') ).round(2) # Add percentage column qto['volume_pct'] = (qto['total_volume'] / qto['total_volume'].sum() * 100).round(1) return qto.sort_values('total_volume', ascending=False) # Usage qto_report = generate_qto_report(df) qto_report.to_excel("qto_report.xlsx") ``` ### Cost Estimation ```python def calculate_project_cost(elements_df, prices_df, markup=0.15): """Calculate total project cost with markup""" # Merge with prices df = elements_df.merge(prices_df, on='Category', how='left') # Calculate base cost df['Base_Cost'] = df['Volume_m3'] * df['Unit_Price'] # Apply markup df['Total_Cost'] = df['Base_Cost'] * (1 + markup) # Summary by category summary = df.groupby('Category').agg( volume=('Volume_m3', 'sum'), base_cost=('Base_Cost', 'sum'), total_cost=('Total_Cost', 'sum') ).round(2) return df, summary, summary['total_cost'].sum() # Usage detailed, summary, total = calculate_project_cost(elements, prices) print(f"Project Total: ${total:,.2f}") ``` ### Material Summary ```python def material_summary(df): """Summarize materials across project""" summary = df.groupby('Material').agg({ 'Volume_m3': 'sum', 'Weight_kg': 'sum', 'ElementId': 'nunique' }).rename(columns={'ElementId': 'Element_Count'}) summary['Volume_Pct'] = (summary['Volume_m3'] / summary['Volume_m3'].sum() * 100).round(1) return summary.sort_values('Volume_m3', ascending=False) ``` ### Level-by-Level Analysis ```python def analyze_by_level(df): """Analyze construction quantities by building level""" level_summary = df.pivot_table( values=['Volume_m3', 'Cost'], index='Level', columns='Category', aggfunc='sum', fill_value=0 ) level_summary['Total_Volume'] = level_summary['Volume_m3'].sum(axis=1) level_summary['Total_Cost'] = level_summary['Cost'].sum(axis=1) return level_summary ``` ## Data Export ### Export to Excel with Multiple Sheets ```python def export_to_excel_formatted(df, summary, filepath): """Export with multiple sheets""" with pd.ExcelWriter(filepath, engine='openpyxl') as writer: df.to_excel(writer, sheet_name='Details', index=False) summary.to_excel(writer, sheet_name='Summary') pivot = pd.pivot_table(df, values='Volume_m3', index='Level', columns='Category') pivot.to_excel(writer, sheet_name='By_Level') # Usage export_to_excel_formatted(elements, qto_summary, "project_report.xlsx") ``` ### Export to CSV ```python # Basic export df.to_csv("output.csv", index=False) # With encoding for special characters df.to_csv("output.csv", index=False, encoding='utf-8-sig') # Specific columns df[['ElementId', 'Category', 'Volume_m3']].to_csv("volumes.csv", index=False) ``` ## Performance Tips ```python # Use categories for string columns with few unique values df['Category'] = df['Category'].astype('category') # Read only needed columns df = pd.read_csv("large_file.csv", usecols=['ElementId', 'Category', 'Volume']) # Use chunking for very large files chunks = pd.read_csv("huge_file.csv", chunksize=100000) result = pd.concat([chunk[chunk['Category'] == 'Wall'] for chunk in chunks]) # Check memory usage print(df.memory_usage(deep=True).sum() / 1024**2, "MB") ``` ## Quick Reference | Operation | Code | |-----------|------| | Read Excel | `pd.read_excel("file.xlsx")` | | Read CSV | `pd.read_csv("file.csv")` | | Filter rows | `df[df['Column'] == 'Value']` | | Select columns | `df[['Col1', 'Col2']]` | | Group and sum | `df.groupby('Cat')['Vol'].sum()` | | Pivot table | `pd.pivot_table(df, values='Vol', index='Level')` | | Merge | `df1.merge(df2, on='key')` | | Add column | `df['New'] = df['A'] * df['B']` | | Export Excel | `df.to_excel("out.xlsx", index=False)` | ## Resources - **Book**: "Data-Driven Construction" by Artem Boiko, Chapter 2.3 - **Website**: https://datadrivenconstruction.io - **Pandas Docs**: https://pandas.pydata.org/docs/ ## Next Steps - See `llm-data-automation` for generating Pandas code with AI - See `qto-report` for specialized QTO calculations - See `cost-estimation-resource` for detailed cost calculations