---
name: data-engineering-storage-remote-access-integrations-pandas
description: "Reading and writing data with Pandas from/to cloud storage (S3, GCS, Azure) using fsspec and PyArrow filesystems."
dependsOn: ["@data-engineering-core", "@data-engineering-storage-authentication"]
---

# Pandas Integration with Remote Storage

Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.

## Auto-Detection (Simplest)

Pandas automatically uses fsspec for cloud URIs:

```python
import pandas as pd

# Read CSV/Parquet directly from cloud URIs
df = pd.read_csv("s3://bucket/data.csv")
df = pd.read_parquet("s3://bucket/data.parquet")
df = pd.read_json("gs://bucket/data.json")

# Compression is auto-detected
df = pd.read_csv("s3://bucket/data.csv.gz")  # Automatically decompressed
```

**Note:** Auto-detection uses default credentials. For explicit auth, see below.

## Explicit Filesystem (More Control)

```python
import fsspec
import pandas as pd

# Create fsspec filesystem with configuration
fs = fsspec.filesystem("s3", anon=False)  # Uses default credentials chain

# Open file through filesystem
with fs.open("s3://bucket/data.csv") as f:
    df = pd.read_csv(f)

# Or pass filesystem directly (recommended for performance)
df = pd.read_parquet(
    "s3://bucket/data.parquet",
    filesystem=fs,
    columns=["id", "value"],  # Column pruning reduces data transfer
    filters=[("date", ">=", "2024-01-01")]  # Row group filtering
)
```

## PyArrow Filesystem Backend

For better Arrow integration and zero-copy transfers:

```python
import pyarrow.fs as fs
import pandas as pd

s3_fs = fs.S3FileSystem(region="us-east-1")

# Read with column filtering
df = pd.read_parquet(
    "bucket/data.parquet",  # Note: no s3:// prefix when using filesystem
    filesystem=s3_fs,
    columns=["id", "name", "value"]
)

# Write to cloud storage
df.to_parquet(
    "s3://bucket/output/",
    filesystem=s3_fs,
    partition_cols=["year", "month"]  # Partitioned write
)
```

## Partitioned Writes

Write partitioned datasets efficiently:

```python
import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3],
    "year": [2024, 2024, 2023],
    "month": [1, 2, 12],
    "value": [100.0, 200.0, 150.0]
})

# Using fsspec
fs = fsspec.filesystem("s3")
df.to_parquet(
    "s3://bucket/output/",
    partition_cols=["year", "month"],
    filesystem=fs
)
# Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet
```

## Authentication

- **Auto-detection**: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)
- **Explicit**: Pass `key=`, `secret=` to `fsspec.filesystem()` constructor
- **For S3-compatible** (MinIO, Ceph):
  ```python
  fs = fsspec.filesystem("s3", client_kwargs={
      "endpoint_url": "http://minio.local:9000"
  })
  ```

See `@data-engineering-storage-authentication` for detailed patterns.

## Performance Tips

1. **Column pruning**: `pd.read_parquet(columns=[...])` only reads needed columns
2. **Row group filtering**: Use `filters=` parameter for partitioned data
3. **Cache results**: Wrap filesystem with `simplecache::` or `filecache::`
   ```python
   cached_fs = fsspec.filesystem("simplecache", target_protocol="s3")
   df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs)
   ```
4. **Use Parquet, not CSV**: Parquet supports pushdown, compression, and typed storage
5. **For large datasets**: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)

## Limitations

- pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM
- For lazy evaluation and better performance with large files, use `@data-engineering-core` (Polars)
- Multi-file reads require manual iteration (use `fs.glob()` + list comprehension)

## Alternatives

- **Polars** (`@data-engineering-core`): Faster, memory-mapped, lazy evaluation
- **Dask**: Parallel pandas for out-of-core computation
- **PySpark**: Distributed processing for big data

---

## References

- [pandas I/O documentation](https://pandas.pydata.org/docs/user_guide/io.html)
- [fsspec documentation](https://filesystem-spec.readthedocs.io/)
- `@data-engineering-storage-remote-access/libraries/fsspec`