--- name: data-engineering-storage-remote-access-libraries-fsspec description: "Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem." dependsOn: ["@data-engineering-core", "@data-engineering-storage-authentication"] --- # fsspec: Universal Filesystem Interface fsspec provides a unified API for local and remote filesystems, integrating seamlessly with pandas, xarray, Dask, and many other Python data tools. ## Installation ```bash # Core only (no remote support) pip install fsspec # With specific backends pip install fsspec[s3] # S3 via s3fs pip install fsspec[gcs] # GCS via gcsfs pip install fsspec[s3,gcs,azure] # Multiple backends # Or install backends directly pip install s3fs gcsfs adlfs ``` ## Basic Usage ```python import fsspec import pandas as pd # List available protocols print(fsspec.available_protocols()) # ['file', 'memory', 'http', 'https', 's3', 's3a', 'gcs', 'gs', 'abfss', ...] # Create filesystem instances local_fs = fsspec.filesystem('file') s3_fs = fsspec.filesystem('s3', anon=False) # Uses boto3 credentials gcs_fs = fsspec.filesystem('gcs') # Uses GCP credentials # Basic operations s3_fs.ls('my-bucket/data/') # List files s3_fs.exists('my-bucket/data/file.csv') # Check existence s3_fs.mkdir('my-bucket/new-folder') # Create directory # Read file as bytes with s3_fs.open('s3://my-bucket/data/file.txt', 'rb') as f: content = f.read() # Read CSV directly into pandas with s3_fs.open('s3://my-bucket/data/large.csv', 'rb') as f: df = pd.read_csv(f, compression='gzip') ``` ## Protocol Chaining & Caching ```python # SimpleCache: Cache remote files locally for faster repeated access import fsspec # First read downloads, subsequent reads use cache cached_file = fsspec.open_local( "simplecache::s3://my-bucket/large-file.nc", simplecache={'cache_storage': '/tmp/fsspec_cache', 'compression': None} ) # Chain multiple protocols # Read from HTTPS, cache locally, decompress on the fly with fsspec.open( "simplecache::gzip::https://example.com/data.csv.gz", compression='gzip' ) as f: df = pd.read_csv(f) # Other useful wrappers: # - "filecache::" - Persistent disk cache # - "gzip::" - Decompression # - "zip::" - Zip file access ``` ## Advanced S3 Features ```python import s3fs # Detailed S3 configuration fs = s3fs.S3FileSystem( key='AKIA...', secret='...', token='...', # Temporary session token client_kwargs={ 'region_name': 'us-east-1', 'endpoint_url': 'https://s3-compatible.local', # MinIO, etc. }, config_kwargs={ 'max_pool_connections': 50, 'retries': {'max_attempts': 5} }, skip_instance_cache=True # Don't cache bucket listings ) # Async operations import asyncio async def read_multiple(): fs = s3fs.S3FileSystem(asynchronous=True) await fs.set_session() # Establish async session # Concurrent reads (use _cat_file for bytes) data = await asyncio.gather( fs._cat_file('bucket/file1.parquet'), fs._cat_file('bucket/file2.parquet'), fs._cat_file('bucket/file3.parquet') ) return data # S3-specific features fs.find('my-bucket', prefix='data/2024') # List with prefix fs.du('my-bucket/data') # Disk usage fs.rm('my-bucket/temp/', recursive=True) # Recursive delete ``` ## Authentication fsspec backends follow standard cloud authentication: 1. Explicit credentials (passed to constructor) 2. Environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc.) 3. Config files (~/.aws/credentials, gcloud CLI) 4. IAM roles / managed identities See `@data-engineering-storage-authentication` for detailed patterns. ## When to Use fsspec Choose fsspec when: - You need broad ecosystem compatibility (pandas, xarray, Dask) - Working with multiple storage backends (S3, GCS, Azure, HTTP) - You need protocol chaining and caching features - Your workflow involves diverse data formats beyond Parquet ## Performance Considerations - ✅ Use `filecache::` instead of `simplecache::` for persistent caching across sessions - ✅ Increase `max_pool_connections` for high concurrency - ✅ Use async API for many concurrent small file operations - ⚠️ For pure Parquet workflows with high throughput, consider `pyarrow.fs` instead - ⚠️ For maximum performance on large concurrent operations, consider `obstore` ## Integration with Data Engineering Tools - **Polars**: `pl.read_parquet("s3://bucket/file.parquet", storage_options={...})` - **DuckDB**: `duckdb.register_filesystem(fsspec.filesystem('s3'))` - **Pandas**: `pd.read_csv("s3://bucket/file.csv")` (auto-detects fsspec) - **PyArrow**: Wrap fsspec with `pyarrow.fs.PyFileSystem(fs.FSSpecHandler(fs))` For detailed integration patterns, see: - `@data-engineering-storage-remote-access/integrations/polars` - `@data-engineering-storage-remote-access/integrations/duckdb` - `@data-engineering-storage-remote-access/integrations/pandas` --- ## References - [fsspec Documentation](https://filesystem-spec.readthedocs.io/) - [s3fs Documentation](https://s3fs.readthedocs.io/) - [gcsfs Documentation](https://gcsfs.readthedocs.io/) - [adlfs Documentation](https://github.com/fsspec/adlfs)