---
name: data-engineering-storage-remote-access-integrations-polars
description: "Integrating Polars with remote filesystems (S3, GCS, Azure). Covers native cloud support, fsspec integration, PyArrow dataset scanning, and partitioned writes."
dependsOn: ["@data-engineering-core", "@data-engineering-storage-authentication"]
---

# Polars Integration with Remote Storage

Polars has native cloud storage support via multiple backends, plus integration with fsspec and PyArrow filesystems.

## Native Cloud Access (object_store)

Polars uses the Rust `object_store` crate internally for direct cloud URI access:

```python
import polars as pl

# Read from cloud URIs directly (s3://, gs://, az://)
df = pl.read_parquet("s3://bucket/data/file.parquet")
df = pl.read_parquet("gs://bucket/data/file.parquet")
df = pl.read_csv("s3://bucket/data/file.csv.gz", infer_schema_length=1000)

# Lazy scanning with predicate and column pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = (
    lazy_df
    .filter(pl.col("date") > "2024-01-01")  # Pushed to storage layer
    .group_by("category")
    .agg([
        pl.col("value").sum().alias("total_value"),
        pl.col("id").count().alias("count")
    ])
    .collect()
)

# Write to cloud storage
df.write_parquet("s3://bucket/output/data.parquet")

# Partitioned write (Hive-style)
df.write_parquet(
    "s3://bucket/output/",
    partition_by=["year", "month"],
    use_pyarrow=True  # Requires PyArrow
)
```

**Supported protocols:** `s3://`, `gs://`, `az://`, `file://`

## Via fsspec

Use fsspec for broader compatibility and protocol chaining:

```python
import fsspec
import polars as pl

# Create fsspec filesystem
fs = fsspec.filesystem("s3", config_kwargs={"region": "us-east-1"})

# Open file through fsspec
with fs.open("s3://bucket/data.csv") as f:
    df = pl.read_csv(f)

# Use fsspec caching wrapper
cached_fs = fsspec.filesystem(
    "simplecache",
    target_protocol="s3",
    target_options={"anon": False}
)
df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")
```

## Via PyArrow Dataset (Advanced)

For Hive-partitioned datasets with complex pushdown:

```python
import pyarrow.fs as fs
import pyarrow.dataset as ds
import polars as pl

s3_fs = fs.S3FileSystem(region="us-east-1")

# Load partitioned dataset
dataset = ds.dataset(
    "bucket/dataset/",
    filesystem=s3_fs,
    format="parquet",
    partitioning=ds.HivePartitioning.discover()
)

# Convert to Polars lazy frame
lazy_df = pl.scan_pyarrow_dataset(dataset)

# Query with full pushdown
result = (
    lazy_df
    .filter((pl.col("year") == 2024) & (pl.col("month") <= 6))
    .select(["id", "value", "timestamp"])
    .collect()
)
```

## Authentication

Native Polars cloud access inherits credentials from:
- **AWS**: Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`), `~/.aws/credentials`, IAM roles
- **GCP**: `GOOGLE_APPLICATION_CREDENTIALS`, gcloud CLI, metadata server
- **Azure**: `AZURE_STORAGE_ACCOUNT`, `AZURE_STORAGE_KEY`, managed identity

For explicit credentials, use fsspec or PyArrow filesystem constructors.

## Performance Tips

- ✅ **Use native `s3://` URIs** for best performance (direct object_store usage)
- ✅ **Lazy evaluation** with predicates for pushdown
- ✅ **Partitioned writes** for large datasets (avoid huge single files)
- ✅ **Column selection** in lazy queries to read only needed data
- ⚠️ For complex authentication (SSO, temporary creds), use fsspec/ PyArrow constructors
- ⚠️ For caching, use fsspec's `simplecache::` or `filecache::` wrappers

## Common Patterns

### Incremental Load from Partitioned Data
```python
# Only read recent partitions
lazy_df = pl.scan_parquet("s3://bucket/events/")
last_month = datetime.now() - timedelta(days=30)

result = (
    lazy_df
    .filter(pl.col("date") >= last_month)
    .collect()
)
```

### Cross-Cloud Copy
```python
# Read from S3, write to GCS (Polars doesn't support mixed URIs directly)
# Use PyArrow bridge:
import pyarrow.fs as fs
import pyarrow.dataset as ds

s3 = fs.S3FileSystem()
gcs = fs.GcsFileSystem()

dataset = ds.dataset("s3://bucket/input/", filesystem=s3, format="parquet")
table = dataset.to_table()
gcs_file = fs.GcsFileSystem().open_output_stream("gs://bucket/output.parquet")
pq.write_table(table, gcs_file)
```

---

## References

- [Polars Cloud Storage Guide](https://pola.rs/posts/polars_cloud_storage/)
- [Polars File System Backends](https://pola.rs/posts/polars_file_format_backends/)
- `@data-engineering-storage-remote-access/libraries/fsspec` - fsspec usage
- `@data-engineering-storage-remote-access/libraries/pyarrow-fs` - PyArrow filesystem