---
name: synthdata-compute
description: >
Compute derived, aggregated, or transformed tables from existing datasets. Use this skill when
the user needs to "compute monthly scores", "aggregate by month", "create a summary table",
"derive risk scores", "compute percentile ranks", "roll up events", "create benchmarks from
raw data", "add a computed column", or bridge the gap between raw generated tables and
downstream analytics. Works on xlsx, csv, or json input. Claude writes the computation logic;
the script handles data I/O.
version: 0.1.0
allowed-tools: Read Bash Glob Write
---
# Synthdata Compute
Derive aggregated, scored, or transformed tables from existing datasets. This skill bridges the
gap between raw synthetic data (produced by `synthdata-generate`) and the derived metrics that
downstream tools expect — monthly rollups, composite scores, percentile ranks, segment summaries.
## Prerequisites
```bash
pip install openpyxl pandas numpy --break-system-packages
```
## Workflow
### Step 1: Identify input and desired output
Ask the user:
- **Input:** What file(s)? (xlsx workbook, csv directory, json files)
- **Computation:** What do you need? (natural language — e.g., "monthly risk scores from event tables")
- **Output:** Where should results go? (default: adds sheets/files alongside input)
### Step 2: Inspect the data
Run the script in inspect mode to understand what we're working with:
```bash
python3 scripts/compute.py --inspect --input data.xlsx
```
This prints each table's name, column names with dtypes, row count, and a 3-row sample. Use this
to understand the schema before writing computation logic.
### Step 3: Write the computation
Write a Python file that operates on pre-loaded DataFrames. The script provides:
- `tables` — dict mapping table names to pandas DataFrames (all input tables pre-loaded)
- `pd` — pandas
- `np` — numpy
- `result` — empty dict; assign `{name: DataFrame}` pairs for each output table
Example computation file (`monthly_risk.py`):
```python
import pandas as pd
import numpy as np
users = tables['users']
threats = tables['threat_events']
sims = tables['phishing_sims']
training = tables['training']
dlp = tables['dlp_events']
abuse = tables['abuse_mailbox']
# Extract month from event dates
threats['month'] = pd.to_datetime(threats['event_date']).dt.to_period('M')
sims['month'] = pd.to_datetime(sims['sim_date']).dt.to_period('M') if 'sim_date' in sims else ...
# Group and aggregate per user per month
# ... (Claude writes the full logic based on the user's requirements)
result['monthly_risk'] = monthly_risk_df
```
Save the computation to a `.py` file, then execute it.
### Step 4: Execute
```bash
python3 scripts/compute.py --input data.xlsx --code monthly_risk.py --output data_with_scores.xlsx
```
**CLI flags:**
| Flag | Description |
|------|-------------|
| `--input` | Source dataset (xlsx/csv-dir/json) |
| `--output` | Output path (default: `_computed.`) |
| `--inspect` | Print table schemas and exit |
| `--code` | Path to a Python file with computation logic |
| `--expr` | Inline Python expression (for simple one-liners) |
| `--append` | Add computed tables to a copy of the input file (xlsx only) |
**Inline mode** for simple computations:
```bash
python3 scripts/compute.py --input data.xlsx --output scored.xlsx \
--expr "result['summary'] = tables['events'].groupby('category').size().reset_index(name='count')"
```
### Step 5: Validate
After execution, the script prints:
- Each output table's name, row count, column names
- A 5-row sample of each output table
- Confirmation of output path
Review the output with the user before proceeding.
---
## Common Computation Recipes
Use these patterns as starting points when writing computation logic.
### Monthly event rollup
```python
events = tables['threat_events']
events['month'] = pd.to_datetime(events['event_date']).dt.to_period('M').astype(str)
monthly = events.groupby(['user_id', 'month']).agg(
threat_events=('event_id', 'count'),
real_clicks=('clicked', 'sum'),
credentials_entered=('credentials_entered', 'sum'),
).reset_index()
result['monthly_threats'] = monthly
```
### Composite score from weighted sub-scores
```python
# Weights must sum to 1.0
WEIGHTS = {'threat': 0.35, 'sim': 0.25, 'dlp': 0.20, 'training': 0.10, 'reporting': 0.10}
df['composite'] = (
df['threat_score'] * WEIGHTS['threat']
+ df['sim_score'] * WEIGHTS['sim']
+ df['dlp_score'] * WEIGHTS['dlp']
+ df['training_score'] * WEIGHTS['training']
+ df['reporting_score'] * WEIGHTS['reporting']
)
```
### Risk tier assignment
```python
def assign_tier(score):
if score >= 75: return 'Critical'
if score >= 50: return 'High'
if score >= 25: return 'Medium'
return 'Low'
df['risk_tier'] = df['composite_risk_score'].apply(assign_tier)
```
### Percentile rank
```python
df['percentile'] = df['composite_risk_score'].rank(pct=True) * 100
```
### Segment summary (peer benchmarks)
```python
peers = tables['peers_detail']
metrics = ['sim_click_rate', 'training_completion', 'composite_risk_score']
rows = []
for segment_col in ['segment_healthcare', 'segment_geo', 'segment_size']:
cohort = peers[peers[segment_col] == 1]
for m in metrics:
rows.append({
'segment': segment_col, 'metric': m,
'mean': cohort[m].mean(), 'median': cohort[m].median(),
'p25': cohort[m].quantile(0.25), 'p75': cohort[m].quantile(0.75),
})
result['segment_summaries'] = pd.DataFrame(rows)
```
### Department breakdown
```python
users = tables['users']
risk = tables['monthly_risk']
merged = risk.merge(users[['user_id', 'department']], on='user_id')
dept = merged.groupby('department').agg(
avg_risk=('composite_risk_score', 'mean'),
critical_count=('risk_tier', lambda x: (x == 'Critical').sum()),
user_count=('user_id', 'nunique'),
).sort_values('avg_risk', ascending=False).reset_index()
result['department_risk'] = dept
```
---
## When to use this skill vs others
| Need | Skill |
|------|-------|
| Generate new rows from a schema/template | `synthdata-generate` |
| Extract Excel → JSON | `synthdata-extract` |
| Add rows/columns to existing data | `synthdata-extend` |
| Replace PII with fake values | `synthdata-anonymize` |
| **Aggregate, score, rank, or derive new tables from existing data** | **synthdata-compute** |