---
name: data-engineering-storage-remote-access-integrations-iceberg
description: "Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations."
dependsOn: ["@data-engineering-storage-lakehouse", "@data-engineering-storage-authentication"]
---

# Apache Iceberg with Cloud Storage

Configuring PyIceberg catalogs to store Iceberg tables on S3, GCS, or Azure Blob Storage.

## Installation

```bash
pip install pyiceberg[pyarrow,pandas,aws]  # AWS backend
# or
pip install pyiceberg[pyarrow,rest]       # REST catalog
```

## Catalog Configuration

### AWS Glue Catalog

```python
from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "glue",
    **{
        "type": "glue",
        "s3.region": "us-east-1",
        "s3.access-key-id": "AKIA...",        # Optional: uses env/IAM if omitted
        "s3.secret-access-key": "...",
    }
)
```

Credentials are read from environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) or IAM roles by default. Pass explicitly only when necessary.

### REST Catalog (Tabular, custom REST service)

```python
catalog = load_catalog(
    "rest",
    **{
        "uri": "https://iceberg-catalog.example.com",
        "s3.endpoint": "http://minio:9000",
        "s3.access-key-id": "minioadmin",
        "s3.secret-access-key": "minioadmin",
    }
)
```

### Hive Metastore

```python
catalog = load_catalog(
    "hive",
    **{
        "uri": "thrift://localhost:9083",
        "s3.endpoint": "http://minio:9000",
    }
)
```

### Local Development (No Catalog)

```python
from pyiceberg.catalog import InMemoryCatalog

catalog = InMemoryCatalog("local")
# Tables stored in ~/.pyiceberg/ by default (local file-based catalog)
```

## Table Operations

```python
# Load existing table
table = catalog.load_table("db.my_table")

# Scan with filter pushdown
scan = table.scan(
    row_filter="year = 2024 AND country = 'USA'",
    selected_fields=("id", "value", "timestamp")
)
df = scan.to_pandas()  # or .to_arrow(), .to_polars()

# Append data
import pyarrow as pa
new_data = pa.table({
    "id": [4, 5],
    "value": [400.0, 500.0],
    "year": [2024, 2024]
})
table.append(new_data)

# Overwrite (replaces entire table)
table.overwrite(new_data)
```

## Schema Evolution

```python
# Add column (non-breaking)
with table.update_schema() as update:
    update.add_column("country", StringType(), required=False)

# Upgrade column type (e.g., int → long)
with table.update_schema() as update:
    update.upgrade_column("population", IntegerType(), required=False)
```

## Cloud Storage Authentication

See `@data-engineering-storage-authentication` for:
- AWS: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, IAM roles
- GCS: `GOOGLE_APPLICATION_CREDENTIALS`
- Azure: `AZURE_STORAGE_ACCOUNT`, `AZURE_STORAGE_KEY`

PyIceberg catalogs automatically detect these environment variables. Only provide explicit credentials for local development or non-standard setups.

## Best Practices

1. ✅ **Use a catalog** - Never manage Iceberg tables without catalog metadata
2. ✅ **Leverage partition evolution** - Change partition specs without rewriting data
3. ✅ **Archive old snapshots** - Run `expire_snapshots()` to limit metadata growth
4. ✅ **Schema evolution over schema enforcement** - Iceberg is designed for evolving schemas
5. ⚠️ **Monitor table metadata size** - Large histories slow operations
6. ⚠️ **Don't use local filesystem for production** - Use a shared catalog (Glue, Hive, REST)

## Performance

- ✅ **Predicate pushdown**: Use `row_filter` in `scan()` to skip irrelevant files
- ✅ **Column pruning**: Use `selected_fields` to read only needed columns
- ✅ **Batch operations**: Append multiple records at once for better throughput
- ✅ **PyArrow backend**: Use PyArrow tables (not pandas) for zero-copy operations

## Related Skills

- `@data-engineering-storage-lakehouse/iceberg.md` - Iceberg concepts and detailed API
- `@data-engineering-storage-lakehouse` - Delta Lake vs Iceberg comparison
- `@data-engineering-storage-remote-access/libraries/pyarrow-fs` - PyArrow filesystem for direct S3/GCS access

---

## References

- [PyIceberg Documentation](https://pyiceberg.readthedocs.io/)
- [Apache Iceberg Specification](https://iceberg.apache.org/spec/)
- [Iceberg Catalog Configurations](https://iceberg.apache.org/docs/latest/catalog/)