---
name: ai-data-analyst
description: Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Use when you need to analyze datasets, perform statistical tests, create visualizations, or build predictive models with reproducible, code-based workflows.
---

# Skill: AI data analyst

## Purpose

Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Generate publication-quality charts, statistical reports, and actionable insights from data files or databases.

## When to use this skill

- You need to **analyze datasets** to understand patterns, trends, or relationships.
- You want to perform **statistical tests** or build predictive models.
- You need **data visualizations** (charts, graphs, dashboards) to communicate findings.
- You're doing **exploratory data analysis** (EDA) to understand data structure and quality.
- You need to **clean, transform, or merge** datasets for analysis.
- You want **reproducible analysis** with documented methodology and code.
- You are performing **Convex Backend Engineering** (schema design, query optimization, log analysis).

## Key capabilities

Unlike point-solution data analysis tools:

- **Convex Engineering Integration**: Native support for Convex MCP tools (`mcp_convex`) and CLI.
- **Full Python ecosystem**: Access to pandas, numpy, scikit-learn, statsmodels, matplotlib, seaborn, plotly, and more.
- **Runs locally**: Your data stays on your machine; no uploads to third-party services.
- **Reproducible**: All analysis is code-based and version controllable.
- **Customizable**: Extend with any Python library or custom analysis logic.
- **Publication-quality output**: Generate professional charts and reports.
- **Statistical rigor**: Access to comprehensive statistical and ML libraries.

## Inputs

- **Data sources**: CSV files, Excel files, JSON, Parquet, or database connections.
- **Analysis goals**: Questions to answer or hypotheses to test.
- **Variables of interest**: Specific columns, metrics, or dimensions to focus on.
- **Output preferences**: Chart types, report format, statistical tests needed.
- **Context**: Business domain, data dictionary, or known data quality issues.

## Out of scope

- Real-time streaming data analysis (use appropriate streaming tools).
- Extremely large datasets requiring distributed computing (use Spark/Dask instead).
- Production ML model deployment (use ML ops tools and infrastructure).
- Live dashboarding (use BI tools like Tableau/Looker for operational dashboards).

## Conventions and best practices

### Python environment

- Use **virtual environments** to isolate dependencies.
- Install only necessary packages for the specific analysis.
- Document all dependencies in `requirements.txt` or `environment.yml`.

### Code structure

- Write **self-contained scripts** that can be re-run by others.
- Use **clear variable names** and add comments for complex logic.
- **Separate concerns**: data loading, cleaning, analysis, visualization.
- Save **intermediate results** to files when analysis is multi-stage.

### Data handling

- **Never modify source data files** – work on copies or in-memory dataframes.
- **Document data transformations** clearly in code comments.
- **Handle missing values** explicitly and document approach.
- **Validate data quality** before analysis (check for nulls, outliers, duplicates).

### Visualization best practices

- Choose **appropriate chart types** for the data and question.
- Use **clear labels, titles, and legends** on all charts.
- Apply **appropriate color schemes** (colorblind-friendly when possible).
- Include **sample sizes and confidence intervals** where relevant.
- Save visualizations in **high-resolution formats** (PNG 300 DPI, SVG for vector graphics).

### Statistical analysis

- **State assumptions** for statistical tests clearly.
- **Check assumptions** before applying tests (normality, homoscedasticity, etc.).
- **Report effect sizes** not just p-values.
- **Use appropriate corrections** for multiple comparisons.
- **Explain practical significance** in addition to statistical significance.

## Required behavior

1. **Understand the question**: Clarify what insights or decisions the analysis should support.
2. **Explore the data**: Check structure, types, missing values, distributions, outliers.
3. **Clean and prepare**: Handle missing data, outliers, and transformations appropriately.
4. **Analyze systematically**: Apply appropriate statistical methods or ML techniques.
5. **Visualize effectively**: Create clear, informative charts that answer the question.
6. **Generate insights**: Translate statistical findings into actionable business insights.
7. **Document thoroughly**: Explain methodology, assumptions, limitations, and conclusions.
8. **Make reproducible**: Ensure others can re-run the analysis and get the same results.

## Required artifacts

- **Analysis script(s)**: Well-documented Python code performing the analysis.
- **Visualizations**: Charts saved as high-quality image files (PNG/SVG).
- **Analysis report**: Markdown or text document summarizing:
  - Research question and methodology
  - Data description and quality assessment
  - Key findings with supporting statistics
  - Visualizations with interpretations
  - Limitations and caveats
  - Recommendations or next steps
- **Requirements file**: `requirements.txt` with all dependencies.
- **Sample data** (if appropriate and non-sensitive): Small sample for reproducibility.

## Implementation checklist

### 1. Data exploration and preparation

- [ ] Load data and inspect structure (shape, columns, types)
- [ ] Check for missing values, duplicates, outliers
- [ ] Generate summary statistics (mean, median, std, min, max)
- [ ] Visualize distributions of key variables
- [ ] Document data quality issues found

### 2. Data cleaning and transformation

- [ ] Handle missing values (impute, drop, or flag)
- [ ] Address outliers if needed (cap, transform, or document)
- [ ] Create derived variables if needed
- [ ] Normalize or scale variables for modeling
- [ ] Split data if doing train/test analysis

### 3. Analysis execution

- [ ] Choose appropriate analytical methods
- [ ] Check statistical assumptions
- [ ] Execute analysis with proper parameters
- [ ] Calculate confidence intervals and effect sizes
- [ ] Perform sensitivity analyses if appropriate

### 4. Visualization

- [ ] Create exploratory visualizations
- [ ] Generate publication-quality final charts
- [ ] Ensure all charts have clear labels and titles
- [ ] Use appropriate color schemes and styling
- [ ] Save in high-resolution formats

### 5. Reporting

- [ ] Write clear summary of methods used
- [ ] Present key findings with supporting evidence
- [ ] Explain practical significance of results
- [ ] Document limitations and assumptions
- [ ] Provide actionable recommendations

### 6. Reproducibility

- [ ] Test that script runs from clean environment
- [ ] Document all dependencies
- [ ] Add comments explaining non-obvious code
- [ ] Include instructions for running analysis

## Convex Engineering Workflow

When working with Convex (backend, database, schemas), you **MUST** follow this specialized workflow:

### 1. Protocols & Rules

- **READ FIRST**: Always read `resources/convex_rules.md` before writing any Convex code.
  - Command: `view_file(AbsolutePath=".../resources/convex_rules.md")`
- **MCP Integration**: Use `mcp_convex` tools to inspect CURRENT state before proposing changes.
  - `mcp_convex_tables`: Check table schemas.
  - `mcp_convex_functionSpec`: Check existing functions.
  - `mcp_convex_logs`: Analyze recent failures.

### 2. Implementation & fix

- **CLI First**: Use `bunx convex` for all operations.
  - DO NOT use generic SQL or other DB commands.
  - Example: `bunx convex run serena/actions:doSomething`
- **Log Analysis**:
  - When debugging, pull logs via `bunx convex logs --prod --failure` OR `mcp_convex_logs`.
  - Analyze stack traces using Python scripts if text analysis is insufficient.

### 3. Code Generation

- **Schema**: Define in `convex/schema.ts` using `defineSchema` and `defineTable`.
- **Functions**: Use `query`, `mutation`, `action` from `_generated/server`.
- **Validation**: Ensure `args` and `returns` validators (e.g., `v.string()`, `v.id()`) are strictly typed.

## Verification

Run the following to verify the analysis:

```bash
# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt

# Run analysis script
python analysis.py

# Check outputs generated
ls -lh outputs/
```

The skill is complete when:

- Analysis script runs without errors from clean environment.
- All required visualizations are generated in high quality.
- Report clearly explains methodology, findings, and limitations.
- Results are interpretable and actionable.
- Code is well-documented and reproducible.

## Common analysis patterns

### Exploratory Data Analysis (EDA)

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and inspect data
df = pd.read_csv('data.csv')
print(df.info())
print(df.describe())

# Check for missing values
print(df.isnull().sum())

# Visualize distributions
df.hist(figsize=(12, 10), bins=30)
plt.tight_layout()
plt.savefig('distributions.png', dpi=300)

# Check correlations
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.savefig('correlations.png', dpi=300)
```

### Time series analysis

```python
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# Load time series data
df = pd.read_csv('timeseries.csv', parse_dates=['date'])
df.set_index('date', inplace=True)

# Decompose time series
decomposition = seasonal_decompose(df['value'], model='additive', period=30)
fig = decomposition.plot()
fig.set_size_inches(12, 8)
plt.savefig('decomposition.png', dpi=300)

# Calculate rolling statistics
df['rolling_mean'] = df['value'].rolling(window=7).mean()
df['rolling_std'] = df['value'].rolling(window=7).std()

# Plot with trends
plt.figure(figsize=(12, 6))
plt.plot(df['value'], label='Original')
plt.plot(df['rolling_mean'], label='7-day Moving Avg', linewidth=2)
plt.fill_between(df.index,
                 df['rolling_mean'] - df['rolling_std'],
                 df['rolling_mean'] + df['rolling_std'],
                 alpha=0.3)
plt.legend()
plt.savefig('trends.png', dpi=300)
```

### Statistical hypothesis testing

```python
from scipy import stats
import numpy as np

# Compare two groups
group_a = df[df['group'] == 'A']['metric']
group_b = df[df['group'] == 'B']['metric']

# Check normality
_, p_norm_a = stats.shapiro(group_a)
_, p_norm_b = stats.shapiro(group_b)

# Choose appropriate test
if p_norm_a > 0.05 and p_norm_b > 0.05:
    # Parametric test (t-test)
    statistic, p_value = stats.ttest_ind(group_a, group_b)
    test_used = "Independent t-test"
else:
    # Non-parametric test (Mann-Whitney U)
    statistic, p_value = stats.mannwhitneyu(group_a, group_b)
    test_used = "Mann-Whitney U test"

# Calculate effect size (Cohen's d)
pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2)
cohens_d = (group_a.mean() - group_b.mean()) / pooled_std

print(f"Test used: {test_used}")
print(f"Test statistic: {statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Effect size (Cohen's d): {cohens_d:.4f}")
```

### Predictive modeling

```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Prepare data
X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

# Feature importance
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importance['feature'][:10], importance['importance'][:10])
plt.xlabel('Feature Importance')
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
```

## Recommended Python libraries

### Data manipulation

- **pandas**: Data manipulation and analysis
- **numpy**: Numerical computing
- **polars**: High-performance DataFrame library (alternative to pandas)

### Visualization

- **matplotlib**: Foundational plotting library
- **seaborn**: Statistical visualizations
- **plotly**: Interactive charts
- **altair**: Declarative statistical visualization

### Statistical analysis

- **scipy.stats**: Statistical functions and tests
- **statsmodels**: Statistical modeling
- **pingouin**: Statistical tests with clear output

### Machine learning

- **scikit-learn**: ML algorithms and tools
- **xgboost**: Gradient boosting
- **lightgbm**: Fast gradient boosting

### Time series

- **statsmodels.tsa**: Time series analysis
- **prophet**: Forecasting tool
- **pmdarima**: Auto ARIMA

### Specialized

- **networkx**: Network analysis
- **geopandas**: Geospatial data analysis
- **textblob** / **spacy**: Natural language processing

## Safety and escalation

- **Data privacy**: Never analyze or share data containing PII without proper authorization.
- **Statistical validity**: If sample sizes are too small for reliable inference, call this out explicitly.
- **Causal claims**: Avoid implying causation from correlational analysis; be explicit about limitations.
- **Model limitations**: Document when models may not generalize or when predictions should not be trusted.
- **Data quality**: If data quality issues could materially affect conclusions, flag this prominently.

## Integration with other skills

This skill can be combined with:

- **Internal data querying**: To fetch data from warehouses or databases for analysis.
- **Web app builder**: To create interactive dashboards displaying analysis results.
- **Internal tools**: To build analysis tools for non-technical stakeholders.