ETL & Data Pipelines ==================== SQLSpec works well for ETL (Extract-Transform-Load) workflows. Use multiple database configs to move data between systems, and leverage Arrow-based methods for high-performance bulk transfers. Multi-Database ETL ------------------ Register source and target databases on a single ``SQLSpec`` instance. Extract from one, transform in Python, and load into the other. .. literalinclude:: /examples/patterns/etl_pipeline.py :language: python :caption: ``multi-database ETL pipeline`` :start-after: # start-example :end-before: # end-example :dedent: 4 :no-upgrade: Arrow-Based Bulk Transfer ------------------------- For large datasets, use ``select_to_arrow()`` to get results as Apache Arrow tables. This avoids per-row Python object overhead and enables zero-copy transfers between databases that support native Arrow (ADBC, DuckDB, BigQuery). .. code-block:: python # Extract as Arrow table arrow_result = await source_session.select_to_arrow( "SELECT * FROM large_table WHERE updated > :since", {"since": last_sync}, ) # Arrow table can be converted to pandas, polars, or written to Parquet df = arrow_result.to_pandas() # Or use native Arrow paths for zero-copy with DuckDB/ADBC arrow_result = await session.select_to_arrow( "SELECT * FROM events", return_format="table", # "table", "batch", "batches", or "reader" batch_size=10000, # rows per batch ) Supported ``return_format`` values: - ``"table"`` -- single ``pyarrow.Table`` (default) - ``"batch"`` -- single ``RecordBatch`` - ``"batches"`` -- iterator of ``RecordBatch`` objects - ``"reader"`` -- ``RecordBatchReader`` for streaming DuckDB as Staging Layer ----------------------- DuckDB excels as an ETL staging layer because it can read Parquet, CSV, and JSON files natively and attach to external PostgreSQL databases. .. code-block:: python from sqlspec import SQLSpec from sqlspec.adapters.duckdb import DuckDBConfig spec = SQLSpec() config = spec.add_config( DuckDBConfig(connection_config={"database": "/tmp/staging.db"}) ) with spec.provide_session(config) as session: # Read directly from Parquet files session.execute( "CREATE TABLE staging AS SELECT * FROM read_parquet('data/*.parquet')" ) # Transform and aggregate session.execute( "CREATE TABLE summary AS " "SELECT date, COUNT(*) as events " "FROM staging GROUP BY date" ) # Export results result = session.select("SELECT * FROM summary ORDER BY date") Related Guides -------------- - :doc:`configuration` for multi-database setup. - :doc:`drivers_and_querying` for the full query API. - :doc:`../reference/adapters` for adapter-specific Arrow capabilities.