--- name: data-engineering-storage-remote-access-integrations-pyarrow description: "Using PyArrow's parquet and dataset modules with remote filesystems (S3, GCS, Azure). Covers native filesystems, fsspec bridge, and obstore wrapper." dependsOn: ["@data-engineering-core", "@data-engineering-storage-authentication"] --- # PyArrow Remote Storage Integration PyArrow's parquet and dataset modules work seamlessly with cloud storage through its native filesystem abstraction and fsspec compatibility. ## Native PyArrow Filesystem ```python import pyarrow.parquet as pq import pyarrow.dataset as ds import pyarrow.fs as fs # Create S3 filesystem s3_fs = fs.S3FileSystem(region="us-east-1") # Read single file with column filtering table = pq.read_table( "bucket/file.parquet", # Note: no s3:// prefix filesystem=s3_fs, columns=["id", "value"] # Column pruning ) # Dataset with filtering and partitioning dataset = ds.dataset( "bucket/dataset/", filesystem=s3_fs, format="parquet", partitioning=ds.HivePartitioning.discover() ) # Filter pushdown (only reads matching files/row groups) table = dataset.to_table( filter=(ds.field("year") == 2024) & (ds.field("value") > 100), columns=["id", "value", "timestamp"] ) # Batch scanning for large datasets scanner = dataset.scanner( filter=ds.field("value") > 0, batch_size=65536, use_threads=True ) for batch in scanner.to_batches(): process(batch) ``` ## fsspec Integration PyArrow automatically bridges to fsspec for Parquet files: ```python import fsspec import pyarrow.parquet as pq fs = fsspec.filesystem("s3") # Open via fsspec with fs.open("s3://bucket/file.parquet", "rb") as f: table = pq.read_table(f) # Or use URI directly (fsspec auto-detected if installed) table = pq.read_table("s3://bucket/file.parquet") ``` ## obstore fsspec Wrapper Use obstore's high-performance fsspec wrapper for concurrent operations: ```python from obstore.fsspec import FsspecStore import pyarrow.parquet as pq # Create obstore-backed fsspec filesystem fs = FsspecStore("s3", bucket="my-bucket", region="us-east-1") # Use with PyArrow table = pq.read_table("data/file.parquet", filesystem=fs) ``` ## Dataset Scanning Patterns See `@data-engineering-storage-remote-access/patterns.md` for advanced patterns including: - Incremental loading with checkpoint tracking - Partitioned writes with Hive partitioning - Cross-cloud copying - Performance optimizations (predicate pushdown, column pruning) ## Authentication See `@data-engineering-storage-authentication` for S3, GCS, Azure credential configuration with PyArrow filesystems. ## Performance Tips 1. **Column pruning**: Always specify `columns=[...]` to reduce data transfer 2. **Filter pushdown**: Use `dataset.scanner(filter=...)` for predicate pushdown 3. **Row group pruning**: Parquet row groups enable partial file reads 4. **Threading**: Enable `use_threads=True` in scanner for CPU-bound ops 5. **Batch size**: Tune `batch_size` based on downstream processing needs 6. **File format**: Prefer Parquet over CSV/JSON for compression and pushdown --- ## References - [PyArrow Filesystems Guide](https://arrow.apache.org/docs/python/filesystems.html) - [PyArrow Dataset Guide](https://arrow.apache.org/docs/python/dataset.html) - `@data-engineering-storage-remote-access/libraries/pyarrow-fs` - PyArrow.fs library details