---
name: senior-data-scientist
description: 
license: MIT + Commons Clause
metadata:
  version: 1.0.0
  author: borghei
  category: engineering
  domain: data-science
  updated: 2026-03-31
  tags: [data-science, ml, statistics, experimentation, python, mlops]
---
# Senior Data Scientist

Expert data science for statistical modeling, experimentation, ML deployment,
and data-driven decision making.

## Keywords

data-science, machine-learning, statistics, a-b-testing, causal-inference,
feature-engineering, mlops, experiment-design, model-deployment, python,
scikit-learn, pytorch, tensorflow, spark, airflow

---

## Quick Start

```bash
# Design an experiment with power analysis
python scripts/experiment_designer.py --input data/ --output results/

# Run feature engineering pipeline
python scripts/feature_engineering_pipeline.py --target project/ --analyze

# Evaluate model performance
python scripts/model_evaluation_suite.py --config config.yaml --deploy

# Statistical analysis
python scripts/statistical_analyzer.py --data input.csv --test ttest --output report.json
```

---

## Tools

| Script | Purpose |
|--------|---------|
| `scripts/experiment_designer.py` | A/B test design, power analysis, sample size calculation |
| `scripts/feature_engineering_pipeline.py` | Automated feature generation, correlation analysis, feature selection |
| `scripts/statistical_analyzer.py` | Hypothesis testing, causal inference, regression analysis |
| `scripts/model_evaluation_suite.py` | Model comparison, cross-validation, deployment readiness checks |

---

## Tech Stack

| Category | Tools |
|----------|-------|
| Languages | Python, SQL, R, Scala |
| ML Frameworks | PyTorch, TensorFlow, Scikit-learn, XGBoost |
| Data Processing | Spark, Airflow, dbt, Kafka, Databricks |
| Deployment | Docker, Kubernetes, AWS SageMaker, GCP Vertex AI |
| Experiment Tracking | MLflow, Weights & Biases |
| Databases | PostgreSQL, BigQuery, Snowflake, Pinecone |

---

## Workflow 1: Design and Analyze an A/B Test

1. **Define hypothesis** -- State the null and alternative hypotheses. Identify the primary metric (e.g., conversion rate, revenue per user).
2. **Calculate sample size** -- `python scripts/experiment_designer.py --input data/ --output results/`
   - Specify minimum detectable effect (MDE), significance level (alpha=0.05), and power (0.80).
   - Example: For baseline conversion 5%, MDE 10% relative lift, need ~31,000 users per variant.
3. **Randomize assignment** -- Use hash-based assignment on user ID for deterministic, reproducible splits.
4. **Run experiment** -- Monitor for sample ratio mismatch (SRM) daily. Flag if observed ratio deviates >1% from expected.
5. **Analyze results:**
   ```python
   from scipy import stats

   # Two-proportion z-test for conversion rates
   control_conv = control_successes / control_total
   treatment_conv = treatment_successes / treatment_total
   z_stat, p_value = stats.proportions_ztest(
       [treatment_successes, control_successes],
       [treatment_total, control_total],
       alternative='two-sided'
   )
   # Reject H0 if p_value < 0.05
   ```
6. **Validate** -- Check for novelty effects, Simpson's paradox across segments, and pre-experiment balance on covariates.

## Workflow 2: Build a Feature Engineering Pipeline

1. **Profile raw data** -- `python scripts/feature_engineering_pipeline.py --target project/ --analyze`
   - Identify null rates, cardinality, distributions, and data types.
2. **Generate candidate features:**
   - Temporal: day-of-week, hour, recency, frequency, monetary (RFM)
   - Aggregation: rolling means/sums over 7d/30d/90d windows
   - Interaction: ratio features, polynomial combinations
   - Text: TF-IDF, embedding vectors
3. **Select features** -- Remove features with >95% null rate, near-zero variance, or >0.95 pairwise correlation. Use recursive feature elimination or SHAP importance.
4. **Validate** -- Confirm no target leakage (no features derived from post-outcome data). Check train/test distribution alignment.
5. **Register** -- Store features in feature store with versioning and lineage metadata.

## Workflow 3: Train and Evaluate a Model

1. **Split data** -- Stratified train/validation/test split (70/15/15). For time series, use temporal split (no future leakage).
2. **Train baseline** -- Start with a simple model (logistic regression, gradient boosted trees) to establish a benchmark.
3. **Tune hyperparameters** -- Use Optuna or cross-validated grid search. Log all runs to MLflow.
4. **Evaluate on held-out test set:**
   ```python
   from sklearn.metrics import classification_report, roc_auc_score

   y_pred = model.predict(X_test)
   y_prob = model.predict_proba(X_test)[:, 1]

   print(classification_report(y_test, y_pred))
   print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
   ```
5. **Validate** -- Check calibration (predicted probabilities match observed rates). Evaluate fairness metrics across protected groups. Confirm no overfitting (train vs test gap <5%).

## Workflow 4: Deploy a Model to Production

1. **Containerize** -- Package model with inference dependencies in Docker:
   ```bash
   docker build -t model-service:v1 .
   ```
2. **Set up serving** -- Deploy behind a REST API with health check, input validation, and structured error responses.
3. **Configure monitoring:**
   - Input drift: compare incoming feature distributions to training baseline (KS test, PSI)
   - Output drift: monitor prediction distribution shifts
   - Performance: track latency P50/P95/P99 targets (<50ms / <100ms / <200ms)
4. **Enable canary deployment** -- Route 5% traffic to new model, compare metrics against baseline for 24-48 hours.
5. **Validate** -- `python scripts/model_evaluation_suite.py --config config.yaml --deploy` confirms serving latency, error rate <0.1%, and model outputs match offline evaluation.

## Workflow 5: Perform Causal Inference

1. **Assess assignment mechanism** -- Determine if treatment was randomized (use experiment analysis) or observational (use causal methods below).
2. **Select method** based on data structure:
   - **Propensity Score Matching**: when treatment is binary, many covariates available
   - **Difference-in-Differences**: when pre/post data available for treatment and control groups
   - **Regression Discontinuity**: when treatment assigned by threshold on running variable
   - **Instrumental Variables**: when unobserved confounding present but valid instrument exists
3. **Check assumptions** -- Parallel trends (DiD), overlap/positivity (PSM), continuity (RDD).
4. **Estimate treatment effect** and compute confidence intervals.
5. **Validate** -- Run placebo tests (apply method to pre-treatment period, expect null effect). Sensitivity analysis for unobserved confounding.

---

## Performance Targets

| Metric | Target |
|--------|--------|
| P50 latency | < 50ms |
| P95 latency | < 100ms |
| P99 latency | < 200ms |
| Throughput | > 1,000 req/s |
| Availability | 99.9% |
| Error rate | < 0.1% |

---

## Common Commands

```bash
# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/

# Training
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth

# Deployment
docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/

# Monitoring
kubectl logs -f deployment/service
python scripts/health_check.py
```

---

## Reference Documentation

| Document | Path |
|----------|------|
| Statistical Methods | [references/statistical_methods_advanced.md](references/statistical_methods_advanced.md) |
| Experiment Design Frameworks | [references/experiment_design_frameworks.md](references/experiment_design_frameworks.md) |
| Feature Engineering Patterns | [references/feature_engineering_patterns.md](references/feature_engineering_patterns.md) |
| Automation Scripts | `scripts/` directory |

---

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| Sample size calculation returns unreasonably large numbers | Minimum detectable effect (MDE) is set too small relative to baseline variance | Increase MDE to a practically meaningful threshold or accept longer experiment duration |
| Feature pipeline reports high null rates across all generated features | Source data contains upstream ingestion gaps or schema drift | Validate raw data completeness before running the pipeline; check ETL logs for failed loads |
| Model AUC drops significantly on validation vs. training set | Overfitting due to high-cardinality features or insufficient regularization | Apply stronger regularization, reduce feature set, or increase training data volume |
| Experiment shows significant results but large confidence intervals | Insufficient sample size or high metric variance | Extend experiment runtime, increase traffic allocation, or switch to a variance-reduction technique (CUPED) |
| Deployed model latency exceeds P95 targets | Model complexity too high for serving infrastructure or missing batching | Quantize the model, reduce input feature count, or enable request batching on the serving layer |
| Feature importance scores are unstable across cross-validation folds | Correlated features cause importance to shift between redundant predictors | Remove highly correlated features (>0.95) before training or use permutation importance with repeated runs |
| Causal inference estimates show implausible treatment effects | Violation of parallel trends assumption (DiD) or poor covariate overlap (PSM) | Run diagnostic tests (placebo checks, overlap histograms) and consider alternative identification strategies |

---

## Success Criteria

- **Model discrimination**: AUC-ROC above 0.85 on held-out test set for classification tasks
- **Calibration quality**: Brier score below 0.15; predicted probabilities within 5% of observed rates across decile bins
- **Feature coverage**: Feature importance analysis accounts for at least 90% of cumulative model importance
- **Experiment power**: All A/B tests designed with statistical power of 0.80 or higher at the specified MDE
- **Deployment readiness**: Model serving latency under 100ms at P95; error rate below 0.1%
- **Reproducibility**: All experiments and training runs logged with full parameter tracking; results reproducible from logged artifacts
- **Data quality**: Input feature pipelines maintain less than 2% null rate on critical features after imputation

---

## Scope & Limitations

**This skill covers:**
- End-to-end experiment design including power analysis, randomization, and post-hoc analysis
- Feature engineering pipelines with profiling, generation, selection, and validation
- Model training evaluation including cross-validation, calibration, and fairness checks
- Production model deployment with monitoring, drift detection, and canary rollouts

**This skill does NOT cover:**
- Data engineering infrastructure (ETL orchestration, pipeline scheduling, data lake management) -- see `senior-data-engineer`
- Deep learning model architecture design and training at scale (distributed GPU training, custom layers) -- see `senior-ml-engineer`
- Prompt engineering, RAG systems, and LLM fine-tuning workflows -- see `senior-prompt-engineer`
- Computer vision pipelines (object detection, segmentation, video processing) -- see `senior-computer-vision`

---

## Integration Points

| Skill | Integration | Data Flow |
|-------|-------------|-----------|
| `senior-data-engineer` | Feature pipeline ingests data from ETL outputs; shares data quality validation patterns | Raw data stores --> feature engineering pipeline --> feature store |
| `senior-ml-engineer` | Trained models handed off for MLOps deployment; shares model registry and serving configs | Evaluated model artifacts --> deployment pipeline --> production serving |
| `senior-prompt-engineer` | Embedding features from LLMs feed into ML pipelines; experiment frameworks apply to prompt A/B tests | LLM embeddings --> feature vectors; experiment designs --> prompt evaluation |
| `senior-architect` | Model serving architecture reviewed for scalability; data platform design aligned with training infrastructure | Architecture specs --> deployment topology --> monitoring dashboards |
| `senior-backend` | Model inference endpoints integrated into backend services; API contracts defined for prediction requests | REST/gRPC model API --> backend service layer --> client applications |
| `senior-devops` | CI/CD pipelines extended for model retraining triggers; containerized model images deployed via infrastructure-as-code | Docker images --> Kubernetes manifests --> production clusters |

---

## Tool Reference

### experiment_designer.py

**Purpose:** A/B test design, statistical power analysis, and sample size calculation. Validates experiment configuration and produces structured results with timestamps.

**Usage:**
```bash
python scripts/experiment_designer.py --input data/ --output results/
```

**Flags/Parameters:**

| Flag | Short | Required | Description |
|------|-------|----------|-------------|
| `--input` | `-i` | Yes | Input path (directory or file containing experiment data) |
| `--output` | `-o` | Yes | Output path (directory for results) |
| `--config` | `-c` | No | Path to configuration file (YAML or JSON) |
| `--verbose` | `-v` | No | Enable verbose (DEBUG-level) logging output |

**Example:**
```bash
python scripts/experiment_designer.py -i data/experiment_config/ -o results/power_analysis/ -c config.yaml -v
```

**Output Format:** JSON to stdout with the following structure:
```json
{
  "status": "completed",
  "start_time": "2026-03-21T10:00:00.000000",
  "processed_items": 0,
  "end_time": "2026-03-21T10:00:01.000000"
}
```

---

### feature_engineering_pipeline.py

**Purpose:** Automated feature generation, correlation analysis, and feature selection. Profiles raw data, generates candidate features, and validates for target leakage.

**Usage:**
```bash
python scripts/feature_engineering_pipeline.py --input data/ --output features/
```

**Flags/Parameters:**

| Flag | Short | Required | Description |
|------|-------|----------|-------------|
| `--input` | `-i` | Yes | Input path (directory or file containing raw data) |
| `--output` | `-o` | Yes | Output path (directory for generated features) |
| `--config` | `-c` | No | Path to configuration file (YAML or JSON) |
| `--verbose` | `-v` | No | Enable verbose (DEBUG-level) logging output |

**Example:**
```bash
python scripts/feature_engineering_pipeline.py -i data/raw/ -o features/v2/ -v
```

**Output Format:** JSON to stdout with the following structure:
```json
{
  "status": "completed",
  "start_time": "2026-03-21T10:00:00.000000",
  "processed_items": 0,
  "end_time": "2026-03-21T10:00:01.000000"
}
```

---

### model_evaluation_suite.py

**Purpose:** Model comparison, cross-validation, and deployment readiness checks. Validates serving latency, error rates, and confirms model outputs match offline evaluation.

**Usage:**
```bash
python scripts/model_evaluation_suite.py --input models/ --output evaluation/
```

**Flags/Parameters:**

| Flag | Short | Required | Description |
|------|-------|----------|-------------|
| `--input` | `-i` | Yes | Input path (directory or file containing model artifacts) |
| `--output` | `-o` | Yes | Output path (directory for evaluation results) |
| `--config` | `-c` | No | Path to configuration file (YAML or JSON) |
| `--verbose` | `-v` | No | Enable verbose (DEBUG-level) logging output |

**Example:**
```bash
python scripts/model_evaluation_suite.py -i models/xgb_v3/ -o evaluation/report/ -c prod_config.yaml
```

**Output Format:** JSON to stdout with the following structure:
```json
{
  "status": "completed",
  "start_time": "2026-03-21T10:00:00.000000",
  "processed_items": 0,
  "end_time": "2026-03-21T10:00:01.000000"
}
```

> **Note:** The Tools table references `scripts/statistical_analyzer.py` but this script does not yet exist in the repository. Statistical analysis workflows described in the SKILL.md can be performed using inline Python (scipy, statsmodels) as shown in Workflow 1 and Workflow 5.