--- name: synthdata-compute description: > Compute derived, aggregated, or transformed tables from existing datasets. Use this skill when the user needs to "compute monthly scores", "aggregate by month", "create a summary table", "derive risk scores", "compute percentile ranks", "roll up events", "create benchmarks from raw data", "add a computed column", or bridge the gap between raw generated tables and downstream analytics. Works on xlsx, csv, or json input. Claude writes the computation logic; the script handles data I/O. version: 0.1.0 allowed-tools: Read Bash Glob Write --- # Synthdata Compute Derive aggregated, scored, or transformed tables from existing datasets. This skill bridges the gap between raw synthetic data (produced by `synthdata-generate`) and the derived metrics that downstream tools expect — monthly rollups, composite scores, percentile ranks, segment summaries. ## Prerequisites ```bash pip install openpyxl pandas numpy --break-system-packages ``` ## Workflow ### Step 1: Identify input and desired output Ask the user: - **Input:** What file(s)? (xlsx workbook, csv directory, json files) - **Computation:** What do you need? (natural language — e.g., "monthly risk scores from event tables") - **Output:** Where should results go? (default: adds sheets/files alongside input) ### Step 2: Inspect the data Run the script in inspect mode to understand what we're working with: ```bash python3 scripts/compute.py --inspect --input data.xlsx ``` This prints each table's name, column names with dtypes, row count, and a 3-row sample. Use this to understand the schema before writing computation logic. ### Step 3: Write the computation Write a Python file that operates on pre-loaded DataFrames. The script provides: - `tables` — dict mapping table names to pandas DataFrames (all input tables pre-loaded) - `pd` — pandas - `np` — numpy - `result` — empty dict; assign `{name: DataFrame}` pairs for each output table Example computation file (`monthly_risk.py`): ```python import pandas as pd import numpy as np users = tables['users'] threats = tables['threat_events'] sims = tables['phishing_sims'] training = tables['training'] dlp = tables['dlp_events'] abuse = tables['abuse_mailbox'] # Extract month from event dates threats['month'] = pd.to_datetime(threats['event_date']).dt.to_period('M') sims['month'] = pd.to_datetime(sims['sim_date']).dt.to_period('M') if 'sim_date' in sims else ... # Group and aggregate per user per month # ... (Claude writes the full logic based on the user's requirements) result['monthly_risk'] = monthly_risk_df ``` Save the computation to a `.py` file, then execute it. ### Step 4: Execute ```bash python3 scripts/compute.py --input data.xlsx --code monthly_risk.py --output data_with_scores.xlsx ``` **CLI flags:** | Flag | Description | |------|-------------| | `--input` | Source dataset (xlsx/csv-dir/json) | | `--output` | Output path (default: `_computed.`) | | `--inspect` | Print table schemas and exit | | `--code` | Path to a Python file with computation logic | | `--expr` | Inline Python expression (for simple one-liners) | | `--append` | Add computed tables to a copy of the input file (xlsx only) | **Inline mode** for simple computations: ```bash python3 scripts/compute.py --input data.xlsx --output scored.xlsx \ --expr "result['summary'] = tables['events'].groupby('category').size().reset_index(name='count')" ``` ### Step 5: Validate After execution, the script prints: - Each output table's name, row count, column names - A 5-row sample of each output table - Confirmation of output path Review the output with the user before proceeding. --- ## Common Computation Recipes Use these patterns as starting points when writing computation logic. ### Monthly event rollup ```python events = tables['threat_events'] events['month'] = pd.to_datetime(events['event_date']).dt.to_period('M').astype(str) monthly = events.groupby(['user_id', 'month']).agg( threat_events=('event_id', 'count'), real_clicks=('clicked', 'sum'), credentials_entered=('credentials_entered', 'sum'), ).reset_index() result['monthly_threats'] = monthly ``` ### Composite score from weighted sub-scores ```python # Weights must sum to 1.0 WEIGHTS = {'threat': 0.35, 'sim': 0.25, 'dlp': 0.20, 'training': 0.10, 'reporting': 0.10} df['composite'] = ( df['threat_score'] * WEIGHTS['threat'] + df['sim_score'] * WEIGHTS['sim'] + df['dlp_score'] * WEIGHTS['dlp'] + df['training_score'] * WEIGHTS['training'] + df['reporting_score'] * WEIGHTS['reporting'] ) ``` ### Risk tier assignment ```python def assign_tier(score): if score >= 75: return 'Critical' if score >= 50: return 'High' if score >= 25: return 'Medium' return 'Low' df['risk_tier'] = df['composite_risk_score'].apply(assign_tier) ``` ### Percentile rank ```python df['percentile'] = df['composite_risk_score'].rank(pct=True) * 100 ``` ### Segment summary (peer benchmarks) ```python peers = tables['peers_detail'] metrics = ['sim_click_rate', 'training_completion', 'composite_risk_score'] rows = [] for segment_col in ['segment_healthcare', 'segment_geo', 'segment_size']: cohort = peers[peers[segment_col] == 1] for m in metrics: rows.append({ 'segment': segment_col, 'metric': m, 'mean': cohort[m].mean(), 'median': cohort[m].median(), 'p25': cohort[m].quantile(0.25), 'p75': cohort[m].quantile(0.75), }) result['segment_summaries'] = pd.DataFrame(rows) ``` ### Department breakdown ```python users = tables['users'] risk = tables['monthly_risk'] merged = risk.merge(users[['user_id', 'department']], on='user_id') dept = merged.groupby('department').agg( avg_risk=('composite_risk_score', 'mean'), critical_count=('risk_tier', lambda x: (x == 'Critical').sum()), user_count=('user_id', 'nunique'), ).sort_values('avg_risk', ascending=False).reset_index() result['department_risk'] = dept ``` --- ## When to use this skill vs others | Need | Skill | |------|-------| | Generate new rows from a schema/template | `synthdata-generate` | | Extract Excel → JSON | `synthdata-extract` | | Add rows/columns to existing data | `synthdata-extend` | | Replace PII with fake values | `synthdata-anonymize` | | **Aggregate, score, rank, or derive new tables from existing data** | **synthdata-compute** |