--- name: data-engineering description: "Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines." --- # Data Engineering Hub Welcome to the comprehensive data engineering skill suite. This hub organizes all data engineering knowledge into logical, non-overlapping domains. ## Skill Map | Domain | Skills | When to Use | |--------|--------|-------------| | **Core** | `@data-engineering-core` | Polars, DuckDB, PyArrow fundamentals; ETL patterns; error handling; performance optimization | | **Storage** | `@data-engineering-storage-lakehouse` | Delta Lake, Apache Iceberg, Apache Hudi | | | `@data-engineering-storage-remote-access` | fsspec, pyarrow.fs, obstore; cloud access patterns | | | `@data-engineering-storage-authentication` | AWS, GCP, Azure auth - IAM roles, managed identity, secrets management | | | `@data-engineering-storage-formats` | Parquet optimizations, Lance, Zarr, Avro, ORC | | **Orchestration** | `@data-engineering-orchestration` | Prefect, Dagster, dbt, workflow scheduling | | **Streaming** | `@data-engineering-streaming` | Kafka, MQTT, NATS JetStream for real-time data | | **Quality** | `@data-engineering-quality` | Great Expectations, Pandera for data validation | | **Observability** | `@data-engineering-observability` | OpenTelemetry, Prometheus for pipeline monitoring | | **AI/ML** | `@data-engineering-ai-ml` | Embeddings, vector databases, RAG pipelines | | **Best Practices** | `@data-engineering-best-practices` | Medallion architecture, partitioning, file sizing, incremental loads, schema evolution, testing | | **Catalogs** | `@data-engineering-catalogs` | Data catalog systems: Iceberg catalogs, DuckDB multi-source, Amundsen/DataHub/OpenMetadata | ## Quick Reference: Core Stack | Task | Recommended Tool | |------|------------------| | DataFrame operations | **Polars** (10-50x faster than pandas) | | SQL analytics | **DuckDB** (embedded OLAP, zero-copy Arrow integration) | | Data interchange | **PyArrow** (Arrow format, zero-copy transfers) | | Cloud storage access | **fsspec** (universal), **pyarrow.fs** (Arrow-native), **obstore** (high-performance) | | Lakehouse format | **Delta Lake** (Spark ecosystem), **Iceberg** (engine-agnostic), **Hudi** (streaming CDC) | | Orchestration | **Prefect** (Pythonic flows), **Dagster** (asset-based), **dbt** (SQL transformations) | | Validation | **Pandera** (lightweight), **Great Expectations** (enterprise) | ## Getting Started ### New to Data Engineering? Start with `@data-engineering-core` to learn the foundational libraries and patterns. ### Working with Cloud Storage? Go to `@data-engineering-storage-remote-access` for fsspec, pyarrow.fs, and obstore. ### Building Data Lakes? Explore `@data-engineering-storage-lakehouse` for ACID table formats. ### Choosing a Data Catalog? Check `@data-engineering-catalogs` for Iceberg catalogs, DuckDB multi-source patterns, and tool comparisons. ### Production-Grade Pipelines? Read `@data-engineering-best-practices` for medallion architecture, partitioning, schema evolution, and testing strategies. ### Orchestrating Pipelines? Check `@data-engineering-orchestration` for Prefect, Dagster, and dbt. ### Production Monitoring? See `@data-engineering-observability` for tracing and metrics. ### AI/ML Data Pipelines? Visit `@data-engineering-ai-ml` for embeddings, vector databases, and RAG. ## Principles 1. **Lazy evaluation**: Use Polars lazy frames and DuckDB query planning for performance 2. **Zero-copy data transfer**: Leverage Arrow format for memory efficiency 3. **Pushdown optimization**: Filter at storage layer to minimize data transfer 4. **Type safety**: Use explicit schemas and type hints 5. **Resilience**: Implement retries, circuit breakers, and proper error handling 6. **Observability**: Instrument pipelines with traces and metrics 7. **Security**: Never hardcode credentials; use IAM roles and environment variables ## Migration from Legacy Skills This restructured suite replaces the previous split organization (`data-engineering-*` and `remote-filesystems-*`). All content has been consolidated to eliminate duplication and clarify ownership. **Legacy skill replacements:** - `data-engineering-core` → `@data-engineering-core` (plus specific integrations) - `data-engineering-lakehouse` → `@data-engineering-storage-lakehouse` - `data-engineering-orchestration` → `@data-engineering-orchestration` - `data-engineering-streaming` → `@data-engineering-streaming` - `data-engineering-quality` → `@data-engineering-quality` - `data-engineering-observability` → `@data-engineering-observability` - `data-engineering-llm-pipelines` → `@data-engineering-ai-ml` - `remote-filesystems-*` → `@data-engineering-storage-remote-access` and integrations All legacy skills remain functional but are deprecated. New content should be added to the new structure only.