--- name: data-engineering-storage-remote-access description: "Cloud storage access in Python: fsspec, pyarrow.fs, obstore libraries, plus integrations with Polars, DuckDB, PyArrow, Delta Lake, and Iceberg." dependsOn: ["@data-engineering-core", "@data-engineering-storage-authentication", "@data-engineering-storage-formats"] --- # Remote Storage Access Comprehensive guide to accessing cloud storage (S3, GCS, Azure) and remote filesystems in Python. Covers three major libraries - **fsspec**, **pyarrow.fs**, and **obstore** - and their integration with data engineering tools. ## Quick Comparison | Feature | fsspec | pyarrow.fs | obstore | |---------|--------|------------|---------| | **Best For** | Broad compatibility, ecosystem integration | Arrow-native workflows, Parquet | High-throughput, performance-critical | | **Backends** | S3, GCS, Azure, HTTP, FTP, 20+ more | S3, GCS, HDFS, local | S3, GCS, Azure, local | | **Performance** | Good (with caching) | Excellent for Parquet | **9x faster** for concurrent ops | | **Dependencies** | Backend-specific (s3fs, gcsfs) | Bundled with PyArrow | **Zero Python deps** (Rust) | | **Async Support** | Yes (aiohttp) | Limited | Native sync/async | | **DataFrame Integration** | Universal | PyArrow-native | Via fsspec wrapper | | **Maturity** | Very mature (2018+) | Mature | New (2025), rapidly evolving | ## When to Use Which? ### Use fsspec when: - You need broad ecosystem compatibility (pandas, xarray, Dask) - Working with multiple storage backends (S3, GCS, Azure, HTTP) - You need protocol chaining and caching features - Your workflow involves diverse data formats beyond Parquet ### Use pyarrow.fs when: - Your pipeline is Arrow/Parquet-native - You need zero-copy integration with PyArrow datasets - Predicate pushdown and column pruning are critical - Working with partitioned Parquet datasets ### Use obstore when: - Performance is paramount (many small files, high concurrency) - You need async/await support for concurrent operations - You want minimal dependencies (Rust-based) - Working with large-scale data ingestion/egestion ## Skill Dependencies Prerequisites: - `@data-engineering-core` - Polars, DuckDB, PyArrow basics - `@data-engineering-storage-authentication` - AWS, GCP, Azure auth patterns - `@data-engineering-storage-formats` - Parquet, Arrow, Lance, Zarr, Avro, ORC Related: - `@data-engineering-storage-lakehouse` - Delta Lake, Iceberg on cloud storage - `@data-engineering-orchestration` - dbt with cloud storage --- ## Detailed Guides ### Library Deep Dives - `@data-engineering-storage-remote-access-libraries-fsspec` - Universal filesystem interface - `@data-engineering-storage-remote-access-libraries-pyarrow-fs` - Native Arrow integration - `@data-engineering-storage-remote-access-libraries-obstore` - High-performance Rust ### DataFrame Integrations - `@data-engineering-storage-remote-access-integrations-polars` - Polars + cloud URIs - `@data-engineering-storage-remote-access-integrations-duckdb` - DuckDB HTTPFS extension - `@data-engineering-storage-remote-access-integrations-pandas` - Pandas + remote files - `@data-engineering-storage-remote-access-integrations-pyarrow` - PyArrow datasets - `@data-engineering-storage-remote-access-integrations-delta-lake` - Delta on S3/GCS/Azure - `@data-engineering-storage-remote-access-integrations-iceberg` - Iceberg with cloud catalogs ### Infrastructure Patterns - `@data-engineering-storage-authentication` - AWS, GCP, Azure auth patterns, IAM roles, service principals - See `performance.md` in this skill - Caching, concurrency, async - See `patterns.md` in this skill - Incremental loading, partitioned writes, cross-cloud copy ### Storage Formats - `@data-engineering-storage-formats` - Parquet, Arrow/Feather, Lance, Zarr, Avro, ORC --- ## Quick Start Example ```python import fsspec import pyarrow.fs as fs import obstore as obs # Method 1: fsspec (universal) s3_fs = fsspec.filesystem('s3') with s3_fs.open('s3://bucket/data.parquet', 'rb') as f: df = pl.read_parquet(f) # Method 2: pyarrow.fs (Arrow-native) s3_pa = fs.S3FileSystem(region='us-east-1') table = pq.read_table("bucket/data.parquet", filesystem=s3_pa) # Method 3: obstore (high-performance) from obstore.store import S3Store store = S3Store(bucket='my-bucket', region='us-east-1') data = obs.get(store, 'data.parquet').bytes() # All approaches work - choose based on your performance and ecosystem needs ``` --- ## Authentication All three libraries follow standard cloud authentication patterns: explicit credentials → environment variables → config files → IAM roles/Managed Identities. **See:** `@data-engineering-storage-authentication` ## Performance Optimization Key strategies: - **Caching**: fsspec's `SimpleCache` for repeated access - **Concurrency**: obstore async API for many small files - **Predicate pushdown**: Filter at storage layer using partitioning - **Column pruning**: Read only required columns **See:** `@data-engineering-storage-remote-access/performance.md` --- ## References - [fsspec Documentation](https://filesystem-spec.readthedocs.io/) - [PyArrow Filesystems](https://arrow.apache.org/docs/python/filesystems.html) - [obstore Documentation](https://developmentseed.org/obstore/) - [s3fs Documentation](https://s3fs.readthedocs.io/) - [gcsfs Documentation](https://gcsfs.readthedocs.io/)