---
name: data-engineering-orchestration
description: "Pipeline orchestration and workflow management with Prefect, Dagster, and dbt. Covers scheduling, dependency management, retries, and integration patterns."
dependsOn: ["@data-engineering-core"]
---

# Pipeline Orchestration

Workflow orchestration tools for data pipelines: Prefect, Dagster, and dbt. These tools handle scheduling, dependency resolution, retries, monitoring, and state management for production data pipelines.

## Quick Comparison

| Tool | Paradigm | Best For | Learning Curve |
|------|----------|----------|----------------|
| **Prefect** | Flow-based | Pythonic workflows, quick prototypes, cloud-first | Moderate |
| **Dagster** | Asset-based | Data asset lineage, reproducibility, type checking | Steeper |
| **dbt** | SQL transformations | Analytics engineering, ELT, data warehouses | Low (SQL-focused) |
| **FlowerPower** | Hamilton DAGs | Lightweight batch ETL, configuration-driven pipelines | Low-Moderate |

## When to Use Which?

- **Prefect**: You want Python code flexibility, Prefect Cloud UI, and quick setup. Good for general-purpose data pipelines, ETL, and API integrations.

- **Dagster**: You care about data asset observability, type safety, and reproducibility. Good for complex data platforms with clear asset dependencies.

- **dbt**: Your transformations are primarily SQL and you're building analytics marts in a data warehouse. Great for analytics engineering teams.

## Skill Dependencies

Assumes familiarity with:
- `@data-engineering-core` - Polars, DuckDB, PyArrow
- `@data-engineering-storage-remote-access` - Cloud storage for intermediate data

Related:
- `@data-engineering-quality` - Data validation integrated into orchestration
- `@data-engineering-observability` - Monitoring and tracing
- `@data-engineering-storage-lakehouse` - Delta/Iceberg for state management

---

## Detailed Guides

### Prefect
See: `@data-engineering-orchestration/prefect.md`

- Flows and tasks with decorators
- Retries, caching, and parameters
- Prefect Cloud (serverless) vs Prefect Server (self-hosted)
- Deployment patterns

### Dagster
See: `@data-engineering-orchestration/dagster.md`

- Asset-based programming model
- Materialization and partitions
- Type checking with Dagster types
- Sensors and schedules
- Integration with data platforms

### dbt (Data Build Tool)
See: `@data-engineering-orchestration/dbt.md`

- Projects, models, tests, snapshots, seeds
- Jinja templating and macros
- Data testing (schema, cardinality, custom)
- Documentation generation
- Package management (dbt packages)
- Adapters (DuckDB, Postgres, Snowflake, BigQuery, Spark)

### FlowerPower (Lightweight Alternative)

**FlowerPower** is a lightweight DAG orchestration framework built on Apache Hamilton, ideal for batch ETL and data transformation scripts without the overhead of full orchestrators.

**Key characteristics:**
- **Hamilton-based**: Define pipelines as Python functions; DAG auto-constructed
- **Configuration-driven**: YAML files for parameters and execution settings
- **Lightweight**: No database, no scheduler, no state persistence (batch-only)
- **Multiple executors**: synchronous, threadpool, processpool, ray, dask
- **I/O plugins**: Delta Lake, DuckDB, Polars, Pandas, S3, PostgreSQL, and more

**When to choose FlowerPower over Prefect/Dagster:**
- Simple batch pipelines (daily/Hourly ETL)
- Quick prototyping that can grow
- Teams that prefer code-first (Python functions) over YAML/UI
- No need for sophisticated scheduling, SLA tracking, or long-running state

**When NOT to use:**
- Production 24/7 workflows requiring reliability guarantees
- Complex dependency graphs with cross-dependencies
- Need for built-in retry policies with circuit breakers
- Workflows requiring checkpoints and state recovery
- Multi-team orchestration with fine-grained permissions

**FlowerPower limitations vs. Prefect/Dagster:**
| Feature | Prefect/Dagster | FlowerPower |
|---------|----------------|-------------|
| Scheduling | Native (cron, intervals) | External (cron/systemd) |
| State persistence | Database/cloud | None (ephemeral) |
| Retry policies | Configurable per task | Per-pipeline via YAML |
| Observability | Rich UI, lineage | Basic Hamilton UI |
| Production readiness | High | Moderate (batch jobs) |

**Integration with data-engineering stack:**
- Uses **Polars/DuckDB** for DataFrame operations (`@data-engineering-core`)
- **Delta Lake** for ACID table formats (`@data-engineering-storage-lakehouse`)
- **fsspec/S3** for cloud storage (`@data-engineering-storage-remote-access`)
- **Pandera** for data validation (`@data-engineering-quality`)
- Follows **medallion architecture** (`@data-engineering-best-practices`)

**Skill reference:** `@flowerpower` - Complete guide to FlowerPower with advanced production patterns (watermarks, data quality, incremental loads, cloud deployment).

---

### Cloud Storage Integration
See: `@data-engineering-orchestration/integrations/cloud-storage.md`

- dbt + S3/GCS via HTTPFS (DuckDB), aws_s3 extension (Postgres)
- Configuration patterns for profiles.yml
- Credential management best practices

---

## Common Patterns

### Retry Pattern (All Orchestrators)
```python
# Prefect: @task(retries=3, retry_delay_seconds=60)
# Dagster: @asset(retry_policy=RetryPolicy(...))
# dbt: --fail-fast flag + custom macro retry logic
```

### Idempotency
All orchestrators assume idempotent operations - running twice should produce identical results. Design your `INSERT`, `UPDATE`, `MERGE` operations to be idempotent.

### State Management
- **Prefect**: Flow run state persisted to database/cloud
- **Dagster**: Asset materialization events tracked
- **dbt**: Model run status in `dbt_run_results.json`; uses `SELECT` + `INSERT` by default

### Dependency Management
- **Prefect**: Explicit task dependencies (`task1 >> task2`)
- **Dagster**: Asset dependencies (`@asset(depends_on=[other_asset])`)
- **dbt**: DAG built from DAG from `ref()` calls in models

---

## Production Recommendations

1. **Version control everything**: Code, configs, dbt models, Prefect/Dagster definitions
2. **Test locally first**: Use unit tests for transformation logic, integration tests for pipeline runs
3. **Use environment variables** for credentials (never hardcode)
4. **Monitor pipeline runs**: Prefect Cloud UI, Dagster Dagit, dbt Cloud or custom alerts
5. **Alert on failures**: Configure email/Slack/webhook notifications
6. **Log aggregation**: Send orchestrator logs to centralized system (Datadog, CloudWatch)
7. **Idempotent writes**: Avoid duplicate data on retries
8. **Schema evolution**: Handle schema changes gracefully (additive only preferred)

---

## References

- [Prefect Documentation](https://docs.prefect.io/)
- [Dagster Documentation](https://docs.dagster.io/)
- [dbt Documentation](https://docs.getdbt.com/)
- [dbt-Labs/dbt-duckdb adapter](https://github.com/dbt-labs/dbt-duckdb)