--- name: data-engineering-storage-remote-access-integrations-delta-lake description: "Delta Lake integration with cloud storage (S3, GCS, Azure). Covers storage_options, PyArrow filesystem, time travel, and partitioned writes." dependsOn: ["@data-engineering-storage-lakehouse", "@data-engineering-storage-authentication"] --- # Delta Lake on Cloud Storage Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python `deltalake` package. ## Installation ```bash pip install deltalake pyarrow ``` ## Configuration Patterns ### Method 1: storage_options (Recommended) The simplest approach using dictionary-based configuration: ```python from deltalake import DeltaTable, write_deltalake import pyarrow as pa # S3 configuration storage_options = { "AWS_ACCESS_KEY_ID": "AKIA...", "AWS_SECRET_ACCESS_KEY": "...", "AWS_REGION": "us-east-1" } # Alternatively, use environment variables (preferred for production) # os.environ['AWS_ACCESS_KEY_ID'], etc. # Write Delta table write_deltalake( "s3://bucket/delta-table", data=pa_table, storage_options=storage_options, mode="overwrite", partition_by=["date"] ) # Read Delta table dt = DeltaTable( "s3://bucket/delta-table", storage_options=storage_options ) df = dt.to_pandas() ``` **GCS configuration:** ```python storage_options = { "GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json" # Or use env var GOOGLE_APPLICATION_CREDENTIALS } ``` **Azure configuration:** ```python storage_options = { "AZURE_STORAGE_CONNECTION_STRING": "...", # OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY" } ``` ### Method 2: PyArrow Filesystem (Advanced) Use PyArrow filesystem objects for more control: ```python import pyarrow.fs as fs from deltalake import write_deltalake, DeltaTable # Create filesystem raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table") filesystem = fs.SubTreeFileSystem(subpath, raw_fs) # Write write_deltalake( "delta-table", # relative to filesystem root data=pa_table, filesystem=filesystem, mode="append" ) # Read dt = DeltaTable("delta-table", filesystem=filesystem) ``` ## Time Travel ```python from deltalake import DeltaTable dt = DeltaTable("s3://bucket/delta-table") # Load specific version dt.load_version(5) df_v5 = dt.to_pandas() # Load by timestamp dt.load_with_datetime("2024-01-01T12:00:00Z") df_ts = dt.to_pandas() # Get history history = dt.history().to_pandas() print(history[["version", "timestamp", "operation"]]) ``` ## Maintenance Operations ```python # Vacuum old files (retention in hours) dt.vacuum(retention_hours=24) # Clean files older than 24h # Optimize compaction (combine small files) dt.optimize().execute() # Get file list files = dt.files() print(files) # List of Parquet files in the table # Get metadata details = dt.details() print(details) ``` ## Incremental Processing For change data capture (CDC) patterns: ```python from deltalake import DeltaTable from datetime import datetime dt = DeltaTable("s3://bucket/delta-table") # Get changes since last checkpoint last_version = get_checkpoint() # Your checkpoint tracking # Read only added/modified files changes = ( dt.history() .filter(f"version > {last_version}") .to_pyarrow_table() ) # Or read full snapshot and compare df = dt.to_pandas() # ... compare with previous snapshot ... # Update checkpoint save_checkpoint(dt.version()) ``` ## Best Practices 1. ✅ **Use environment variables** for credentials in production (never hardcode) 2. ✅ **Partition tables** by date/region for efficient querying 3. ✅ **Vacuum regularly** to clean up old files (but retain enough for your time travel needs) 4. ✅ **Optimize** periodically to compact small files 5. ✅ **Track versions** for incremental processing using `dt.version()` and `dt.history()` 6. ⚠️ **Don't** disable vacuum entirely - storage bloat 7. ⚠️ **Don't** vacuum too aggressively - you'll lose time travel capability ## Authentication See `@data-engineering-storage-authentication` for detailed cloud auth patterns. For S3: - Environment: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION` - IAM roles (EC2, ECS, Lambda) override env vars - For S3-compatible (MinIO): `AWS_ENDPOINT_URL` or in `storage_options` ## Related - `@data-engineering-storage-lakehouse/delta-lake` - Delta Lake concepts and API - `@data-engineering-core` - Using Delta with DuckDB - `@data-engineering-storage-lakehouse` - Comparisons with Iceberg, Hudi --- ## References - [deltalake Python API](https://delta-io.github.io/delta-rs/python/quickstart.html) - [Delta Lake Documentation](https://docs.delta.io/latest/index.html)