---
name: data-warehouse
description: Data warehouse (大数据/数仓) ops skill for designing and operating analytical data platforms. Use for tasks like defining source-of-truth, building ETL/ELT pipelines, dimensional modeling (star schema), data quality checks, partitioning strategies, cost/performance tuning, governance, lineage, and SLA monitoring.
---

# data-warehouse

Use this skill for 大数据/数仓（DW）建设与运维：从数据接入到指标交付与治理。

## Defaults / assumptions to confirm

- Warehouse tech: BigQuery/Snowflake/Redshift/Hive/ClickHouse/etc.
- Orchestration: Airflow/Dagster/Argo/dbt
- Ingestion: CDC vs batch, streaming (Kafka) vs file
- Data consumers: BI dashboards, product analytics, ML features

## Core outputs

- Warehouse architecture (sources → staging → warehouse → marts)
- Data model (facts/dimensions) + metric definitions
- Pipeline plan (DAGs, schedules, dependencies, SLAs)
- Data quality plan (checks, thresholds, alerts)
- Cost/performance plan (partitioning, clustering, materialization)
- Governance plan (access control, PII handling, retention)

## Workflow

1) Understand the business questions
- What decisions will this warehouse support?
- Define critical metrics and their definitions (single source of truth).

2) Source & ingestion design
- Identify systems of record and ownership.
- Choose ingestion: CDC for mutable OLTP, append-only logs for events.
- Define late-arriving data strategy and backfills.

3) Modeling (practical)
- Prefer star schema for BI: Fact tables + Dimension tables.
- Keep grain explicit (one row represents what?).
- Separate raw/staging from curated models; avoid mixing.

4) ETL/ELT pipelines
- Define DAGs with clear inputs/outputs and idempotency.
- Handle retries, partial failures, and reprocessing windows.
- Provide backfill procedures and runbook.

5) Data quality & observability
- Validate freshness, volume, schema drift, null ratios, referential consistency (logical).
- Add anomaly detection for key metrics.
- Track pipeline success rate, duration, and SLA misses.

6) Performance & cost
- Partition by date/time for large facts; cluster by common filters/joins.
- Materialize expensive queries (summary tables, incremental models).
- Control scan cost with column pruning and predicate pushdown.

7) Governance & security
- PII classification, masking, and retention.
- RBAC/ABAC for datasets; audit logging.
- Document lineage and ownership for each table/model.

## Templates

### Table spec
- Name:
- Grain:
- Partition/cluster keys:
- Key columns:
- Sources:
- Refresh cadence:
- Consumers:
- Quality checks:
- Owner:

### Metric spec
- Name:
- Definition:
- Numerator/denominator:
- Filters:
- Time window:
- Known caveats: