---
name: ai-ml-data-science
description: "End-to-end data science and ML engineering workflows: problem framing, data/EDA, feature engineering (feature stores), modelling, evaluation/reporting, plus SQL transformations with SQLMesh. Use for dataset exploration, feature design, model selection, metrics and slice analysis, model cards/eval reports, experiment reproducibility, and production handoff (monitoring and retraining)."
---

# Data Science Engineering Suite - Quick Reference

This skill turns **raw data and questions** into **validated, documented models** ready for production:

- **EDA workflows**: Structured exploration with drift detection
- **Feature engineering**: Reproducible feature pipelines with leakage prevention and train/serve parity
- **Model selection**: Baselines first; strong tabular defaults; escalate complexity only when justified
- **Evaluation & reporting**: Slice analysis, uncertainty, model cards, production metrics
- **SQL transformation**: SQLMesh for staging/intermediate/marts layers
- **MLOps**: CI/CD, CT (continuous training), CM (continuous monitoring)
- **Production patterns**: Data contracts, lineage, feedback loops, streaming features

**Modern emphasis (2026):** Feature stores, automated retraining, drift monitoring (Evidently), train-serve parity, and agentic ML loops (plan -> execute -> evaluate -> improve). Tools: LightGBM, CatBoost, scikit-learn, PyTorch, Polars (lazy eval for larger-than-RAM datasets), lakeFS for data versioning.

---

## Quick Reference

| Task | Tool/Framework | Command | When to Use |
|------|----------------|---------|-------------|
| EDA & Profiling | Pandas, Great Expectations | `df.describe()`, `ge.validate()` | Initial data exploration and quality checks |
| Feature Engineering | Pandas, Polars, Feature Stores | `df.transform()`, Feast materialization | Creating lag, rolling, categorical features |
| Model Training | Gradient boosting, linear models, scikit-learn | `lgb.train()`, `model.fit()` | Strong baselines for tabular ML |
| Hyperparameter Tuning | Optuna, Ray Tune | `optuna.create_study()`, `tune.run()` | Optimizing model parameters |
| SQL Transformation | SQLMesh | `sqlmesh plan`, `sqlmesh run` | Building staging/intermediate/marts layers |
| Experiment Tracking | MLflow, W&B | `mlflow.log_metric()`, `wandb.log()` | Versioning experiments and models |
| Model Evaluation | scikit-learn, custom metrics | `metrics.roc_auc_score()`, slice analysis | Validating model performance |

---

## Data Lake & Lakehouse

For comprehensive data lake/lakehouse patterns (beyond SQLMesh transformation), see **[data-lake-platform](../data-lake-platform/SKILL.md)**:

- **Table formats:** Apache Iceberg, Delta Lake, Apache Hudi
- **Query engines:** ClickHouse, DuckDB, Apache Doris, StarRocks
- **Alternative transformation:** dbt (alternative to SQLMesh)
- **Ingestion:** dlt, Airbyte (connectors)
- **Streaming:** Apache Kafka patterns
- **Orchestration:** Dagster, Airflow

This skill focuses on **ML feature engineering and modeling**. Use data-lake-platform for general-purpose data infrastructure.

---

## Related Skills

For adjacent topics, reference:

- **[ai-mlops](../ai-mlops/SKILL.md)** - APIs, batch jobs, monitoring, drift, data ingestion (dlt)
- **[ai-llm](../ai-llm/SKILL.md)** - LLM prompting, fine-tuning, evaluation
- **[ai-rag](../ai-rag/SKILL.md)** - RAG pipelines, chunking, retrieval
- **[ai-llm-inference](../ai-llm-inference/SKILL.md)** - LLM inference optimization, quantization
- **[ai-ml-timeseries](../ai-ml-timeseries/SKILL.md)** - Time series forecasting, backtesting
- **[qa-testing-strategy](../qa-testing-strategy/SKILL.md)** - Test-driven development, coverage
- **[data-sql-optimization](../data-sql-optimization/SKILL.md)** - SQL optimization, index patterns (complements SQLMesh)
- **[data-lake-platform](../data-lake-platform/SKILL.md)** - Data lake/lakehouse infrastructure (ClickHouse, Iceberg, Kafka)

---

## Decision Tree: Choosing Data Science Approach

```text
User needs ML for: [Problem Type]
  - Tabular data?
    - Small-medium (<1M rows)? -> LightGBM (fast, efficient)
    - Large and complex (>1M rows)? -> LightGBM first, then NN if needed
    - High-dim sparse (text, counts)? -> Linear models, then shallow NN

  - Time series?
    - Seasonality? -> LightGBM, then see ai-ml-timeseries
    - Long-term dependencies? -> Transformers (see ai-ml-timeseries)

  - Text or mixed modalities?
    - LLMs/Transformers -> See ai-llm

  - SQL transformations?
    - SQLMesh (staging/intermediate/marts layers)
```

**Rule of thumb:** For tabular data, tree-based gradient boosting is a strong baseline, but must be validated against alternatives and constraints.

---

## Core Concepts (Vendor-Agnostic)

- **Problem framing**: define success metrics, baselines, and decision thresholds before modeling.
- **Leakage prevention**: ensure all features are available at prediction time; split by time/group when appropriate.
- **Uncertainty**: report confidence intervals and stability (fold variance, bootstrap) rather than single-point metrics.
- **Reproducibility**: version code/data/features, fix seeds, and record the environment.
- **Operational handoff**: define monitoring, retraining triggers, and rollback criteria with MLOps.

## Implementation Practices (Tooling Examples)

- Track experiments and artifacts (run id, commit hash, data version).
- Add data validation gates in pipelines (schema + distribution + freshness).
- Prefer reproducible, testable feature code (shared transforms, point-in-time correctness).
- Use datasheets/model cards and eval reports as deployment prerequisites (Datasheets for Datasets: https://arxiv.org/abs/1803.09010; Model Cards: https://arxiv.org/abs/1810.03993).

## Do / Avoid

**Do**
- Do start with baselines and a simple model to expose leakage and data issues early.
- Do run slice analysis and document failure modes before recommending deployment.
- Do keep an immutable eval set; refresh training data without contaminating evaluation.

**Avoid**
- Avoid random splits for temporal or user-correlated data.
- Avoid "metric gaming" (optimizing the number without validating business impact).
- Avoid training on labels created after the prediction timestamp (silent future leakage).

# Core Patterns (Overview)

## Pattern 1: End-to-End DS Project Lifecycle

**Use when:** Starting or restructuring any DS/ML project.

**Stages:**

1. **Problem framing** - Business objective, success metrics, baseline
2. **Data & feasibility** - Sources, coverage, granularity, label quality
3. **EDA & data quality** - Schema, missingness, outliers, leakage checks
4. **Feature engineering** - Per data type with feature store integration
5. **Modelling** - Baselines first, then LightGBM, then complexity as needed
6. **Evaluation** - Offline metrics, slice analysis, error analysis
7. **Reporting** - Model evaluation report + model card
8. **MLOps** - CI/CD, CT (continuous training), CM (continuous monitoring)

**Detailed guide:** [EDA Best Practices](references/eda-best-practices.md)

---

## Pattern 2: Feature Engineering

**Use when:** Designing features before modelling or during model improvement.

**By data type:**

- **Numeric:** Standardize, handle outliers, transform skew, scale
- **Categorical:** One-hot/ordinal (low cardinality), target/frequency/hashing (high cardinality)
  - **Feature Store Integration:** Store encoders, mappings, statistics centrally
- **Text:** Cleaning, TF-IDF, embeddings, simple stats
- **Time:** Calendar features, recency, rolling/lag features

**Key Modern Practice:** Use feature stores (Feast, Tecton, Databricks) for versioning, sharing, and train-serve parity.

**Detailed guide:** [Feature Engineering Patterns](references/feature-engineering-patterns.md)

---

## Pattern 3: Data Contracts & Lineage

**Use when:** Building production ML systems with data quality requirements.

**Components:**

- **Contracts:** Schema + ranges/nullability + freshness SLAs
- **Lineage:** Track source -> feature store -> train -> serve
- **Feature store hygiene:** Materialization cadence, backfill/replay, encoder versioning
- **Schema evolution:** Backward/forward-compatible migrations with shadow runs

**Detailed guide:** [Data Contracts & Lineage](references/data-contracts-lineage.md)

---

## Pattern 4: Model Selection & Training

**Use when:** Picking model families and starting experiments.

**Decision guide (modern benchmarks):**

- **Tabular:** Start with a **strong baseline** (linear/logistic, then gradient boosting) and iterate based on error analysis
- **Baselines:** Always implement simple baselines first (majority class, mean, naive forecast)
- **Train/val/test splits:** Time-based (forecasting), group-based (user/item leakage), or random (IID)
- **Hyperparameter tuning:** Start manual, then Bayesian optimization (Optuna, Ray Tune)
- **Overfitting control:** Regularization, early stopping, cross-validation

**Detailed guide:** [Modelling Patterns](references/modelling-patterns.md)

---

## Pattern 5: Evaluation & Reporting

**Use when:** Finalizing a model candidate or handing over to production.

**Key components:**

- **Metric selection:** Primary (ROC-AUC, PR-AUC, RMSE) + guardrails (calibration, fairness)
- **Threshold selection:** ROC/PR curves, cost-sensitive, F1 maximization
- **Slice analysis:** Performance by geography, user segments, product categories
- **Error analysis:** Collect high-error examples, cluster by error type, identify systematic failures
- **Uncertainty:** Confidence intervals (bootstrap where appropriate), variance across folds, and stability checks
- **Evaluation report:** 8-section report (objective, data, features, models, metrics, slices, risks, recommendation)
- **Model card:** Documentation for stakeholders (intended use, data, performance, ethics, operations)

**Detailed guide:** [Evaluation Patterns](references/evaluation-patterns.md)

---

## Pattern 6: Reproducibility & MLOps

**Use when:** Ensuring experiments are reproducible and production-ready.

**Modern MLOps (CI/CD/CT/CM):**

- **CI (Continuous Integration):** Automated testing, data validation, code quality
- **CD (Continuous Delivery):** Environment-specific promotion (dev -> staging -> prod), canary deployment
- **CT (Continuous Training):** Drift-triggered and scheduled retraining
- **CM (Continuous Monitoring):** Real-time data drift, performance, system health

**Versioning:**
- Code (git commit), data (DVC, LakeFS), features (feature store), models (MLflow Registry)
- Seeds (reproducibility), hyperparameters (experiment tracker)

**Detailed guide:** [Reproducibility Checklist](references/reproducibility-checklist.md)

---

## Pattern 7: Feature Freshness & Streaming

**Use when:** Managing real-time features and streaming pipelines.

**Components:**

- **Freshness contracts:** Define freshness SLAs per feature, monitor lag, alert on breaches
- **Batch + stream parity:** Same feature logic across batch/stream, idempotent upserts
- **Schema evolution:** Version schemas, add forward/backward-compatible parsers, backfill with rollback
- **Data quality gates:** PII/format checks, range checks, distribution drift (KL, KS, PSI)

**Detailed guide:** [Feature Freshness & Streaming](references/feature-freshness-streaming.md)

---

## Pattern 8: Production Feedback Loops

**Use when:** Capturing production signals and implementing continuous improvement.

**Components:**

- **Signal capture:** Log predictions + user edits/acceptance/abandonment (scrub PII)
- **Labeling:** Route failures/edge cases to human review, create balanced sets
- **Dataset refresh:** Periodic refresh (weekly/monthly) with lineage, protect eval set
- **Online eval:** Shadow/canary new models, track solve rate, calibration, cost, latency

**Detailed guide:** [Production Feedback Loops](references/production-feedback-loops.md)

---

## Resources (Detailed Guides)

For comprehensive operational patterns and checklists, see:

- [EDA Best Practices](references/eda-best-practices.md) - Structured workflow for exploratory data analysis
- [Feature Engineering Patterns](references/feature-engineering-patterns.md) - Operational patterns by data type
- [Data Contracts & Lineage](references/data-contracts-lineage.md) - Data quality, versioning, feature store ops
- [Modelling Patterns](references/modelling-patterns.md) - Model selection, hyperparameter tuning, train/test splits
- [Evaluation Patterns](references/evaluation-patterns.md) - Metrics, slice analysis, evaluation reports, model cards
- [Reproducibility Checklist](references/reproducibility-checklist.md) - Experiment tracking, MLOps (CI/CD/CT/CM)
- [Feature Freshness & Streaming](references/feature-freshness-streaming.md) - Real-time features, schema evolution
- [Production Feedback Loops](references/production-feedback-loops.md) - Online learning, labeling, canary deployment

---

## Templates

Use these as copy-paste starting points:

### Project & Workflow Templates

- **Standard DS project template:** `assets/project/template-standard.md`
- **Quick DS experiment template:** `assets/project/template-quick.md`

### Feature Engineering & EDA

- **Feature engineering template:** `assets/features/template-feature-engineering.md`
- **EDA checklist & notebook template:** `assets/eda/template-eda.md`

### Evaluation & Reporting

- **Model evaluation report:** `assets/evaluation/template-evaluation-report.md`
- **Model card:** `assets/evaluation/template-model-card.md`
- **ML experiment review:** `assets/review/experiment-review-template.md`

### SQL Transformation (SQLMesh)

For SQL-based data transformation and feature engineering:

- **SQLMesh project setup:** `../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md`
- **SQLMesh model types:** `../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md` (FULL, INCREMENTAL, VIEW)
- **Incremental models:** `../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md`
- **DAG and dependencies:** `../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md`
- **Testing and data quality:** `../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md`

**Use SQLMesh when:**
- Building SQL-based feature pipelines
- Managing incremental data transformations
- Creating staging/intermediate/marts layers
- Testing SQL logic with unit tests and audits

**For data ingestion (loading raw data), use:**
- [ai-mlops](../ai-mlops/SKILL.md) skill (dlt templates for REST APIs, databases, warehouses)

## Navigation

**Resources**
- [references/reproducibility-checklist.md](references/reproducibility-checklist.md)
- [references/evaluation-patterns.md](references/evaluation-patterns.md)
- [references/feature-engineering-patterns.md](references/feature-engineering-patterns.md)
- [references/modelling-patterns.md](references/modelling-patterns.md)
- [references/feature-freshness-streaming.md](references/feature-freshness-streaming.md)
- [references/eda-best-practices.md](references/eda-best-practices.md)
- [references/data-contracts-lineage.md](references/data-contracts-lineage.md)
- [references/production-feedback-loops.md](references/production-feedback-loops.md)

**Templates**
- [assets/project/template-standard.md](assets/project/template-standard.md)
- [assets/project/template-quick.md](assets/project/template-quick.md)
- [assets/features/template-feature-engineering.md](assets/features/template-feature-engineering.md)
- [assets/eda/template-eda.md](assets/eda/template-eda.md)
- [assets/evaluation/template-evaluation-report.md](assets/evaluation/template-evaluation-report.md)
- [assets/evaluation/template-model-card.md](assets/evaluation/template-model-card.md)
- [assets/review/experiment-review-template.md](assets/review/experiment-review-template.md)
- [template-sqlmesh-project.md](../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md)
- [template-sqlmesh-model.md](../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md)
- [template-sqlmesh-incremental.md](../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md)
- [template-sqlmesh-dag.md](../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md)
- [template-sqlmesh-testing.md](../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md)

**Data**
- [data/sources.json](data/sources.json) - Curated external references

---

## External Resources

See [data/sources.json](data/sources.json) for curated foundational and implementation references:

- **Core ML/DL**: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
- **Data processing**: pandas, NumPy, Polars, DuckDB, Spark, Dask
- **SQL transformation**: SQLMesh, dbt (staging/marts/incremental patterns)
- **Feature stores**: Feast, Tecton, Databricks Feature Store (centralized feature management)
- **Data validation**: Pydantic, Great Expectations, Pandera, Evidently (quality + drift)
- **Visualization**: Matplotlib, Seaborn, Plotly, Streamlit, Dash
- **MLOps**: MLflow, W&B, DVC, Neptune (experiment tracking + model registry)
- **Hyperparameter tuning**: Optuna, Ray Tune, Hyperopt
- **Model serving**: BentoML, FastAPI, TorchServe, Seldon, Ray Serve
- **Orchestration**: Kubeflow, Metaflow, Prefect, Airflow, ZenML
- **Cloud platforms**: AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake

Use this skill to **execute data science projects end-to-end**: concrete checklists, patterns, and templates, not theory.