Bulk Ingest =========== SQLSpec exposes native bulk-ingest fast paths through a small, adapter-agnostic storage-bridge API. High-volume writes use each driver's database primitive -- ``COPY``, ``LOAD DATA LOCAL INFILE``, direct path load, ``BulkCopy``, Arrow ingest, load jobs, or mutations -- instead of generic row-by-row execution. The API ------- Three methods cover the common shapes. They share return type ``StorageBridgeJob`` (its ``telemetry`` dict reports ``rows_processed``): - ``load_from_arrow(table, source, *, overwrite=False)`` -- load an Arrow table (or anything coercible to one) using the adapter's native ingest path. - ``load_from_storage(table, source, *, file_format, overwrite=False)`` -- load a staged artifact (a local path or cloud URI) into a table. - ``load_from_records(table, records, *, columns=None, overwrite=False)`` -- load in-memory rows. ``records`` may be mappings (columns derived from the keys) or positional sequences (``columns`` required). Records are normalized and routed through the adapter's native ``load_from_arrow`` path, so every adapter that supports Arrow ingest supports records too. .. code-block:: python # dict records -- columns inferred from keys driver.load_from_records("orders", [{"id": 1, "total": 9.99}, {"id": 2, "total": 4.50}]) # positional records -- columns required driver.load_from_records("orders", [(3, 1.0), (4, 2.0)], columns=["id", "total"]) Empty input, mismatched mapping keys, or a positional/column width mismatch raise :class:`~sqlspec.exceptions.ImproperConfigurationError`. Capability matrix ----------------- .. list-table:: :header-rows: 1 :widths: 18 42 22 18 * - Adapter - Native ingest path - Transactionality - Gate / opt-in * - asyncpg - ``COPY`` (``copy_records_to_table``) - Atomic; exact row counts - Always on * - psycopg (sync/async) - ``COPY`` streaming ``write_row`` - Atomic; exact row counts - Always on * - psqlpy - Binary ``COPY`` with ``INSERT`` fallback - Atomic - Always on * - adbc - ``adbc_ingest`` (append/replace) - Driver-dependent; FlightSQL falls back to per-row - Always on * - duckdb - ``register`` + ``INSERT ... SELECT`` - Single connection transaction - Always on * - sqlite / aiosqlite - ``executemany`` inside one ``BEGIN IMMEDIATE`` - Atomic when the driver owns the transaction; rolls back on error - Always on * - oracledb - direct path load (Thin mode, default); ``executemany`` fallback - Per ``execute_many``; array-DML row counts available - ``enable_direct_path_load=False`` to force fallback; ``oracle_batch_errors`` / ``oracle_array_dml_row_counts`` execution args * - MySQL family (pymysql, asyncmy, aiomysql, mysql-connector) - ``executemany`` (default); ``LOAD DATA LOCAL INFILE`` (opt-in) - Server-managed - ``enable_local_infile_bulk_load`` + connection ``local_infile`` / ``allow_local_infile`` * - bigquery - Parquet load job (default); Arrow Storage Write API (opt-in) - All-or-nothing load job / PENDING write stream - ``enable_storage_write_api``; load retry/timeout via job-control features * - spanner - ``Transaction.insert_or_update`` mutations (upsert); Batch Write API (opt-in) - In-transaction (default); independently committed groups (Batch Write) - Always on; ``enable_batch_write_api`` for high-throughput groups * - mssql-python - ``cursor.bulkcopy()`` via ``load_from_arrow`` - Driver-managed - Always on * - arrow_odbc - ``bulk_insert_arrow`` via ``load_from_arrow`` - Driver-managed - Always on Security and opt-in paths ------------------------- Some fast paths are opt-in because they read local files or change semantics: - **MySQL ``LOAD DATA LOCAL INFILE``** requires both the adapter feature ``enable_local_infile_bulk_load`` and the connection's local-infile setting (``local_infile=True`` for pymysql/aiomysql/asyncmy, ``allow_local_infile=True`` for mysql-connector). Enabling the feature without the connection gate raises :class:`~sqlspec.exceptions.ImproperConfigurationError` at config construction. The MySQL server must also have ``local_infile`` enabled. mysql-connector additionally honors ``allow_local_infile_in_path`` -- the staged temp file must live under that directory when it is set. - **Oracle direct path load** is the default bulk-ingest transport in Thin mode. Set ``enable_direct_path_load=False`` to force ``executemany``. Connections that do not expose the Direct Path Load API, including Thick-mode connections, silently fall back to ``executemany``. - **BigQuery Storage Write API** (``enable_storage_write_api``) streams Arrow rows for ``load_from_arrow`` appends and falls back to the Parquet load job when the Storage client is unavailable; ``overwrite=True`` always uses a Parquet ``WRITE_TRUNCATE`` load job. - **Spanner Batch Write API** (``enable_batch_write_api``) routes ``load_from_arrow`` through ``Database.mutation_groups().batch_write()`` for high-throughput, independently committed ``insert_or_update`` groups instead of a single in-transaction flush. The upsert semantics keep each group idempotent on replay. Examples -------- MySQL ``LOAD DATA LOCAL INFILE``: .. code-block:: python from sqlspec.adapters.pymysql import PyMysqlConfig config = PyMysqlConfig( connection_config={"host": "localhost", "local_infile": True}, driver_features={"enable_local_infile_bulk_load": True}, ) with config.provide_session() as driver: driver.load_from_arrow("orders", arrow_table) Oracle per-call batch error and array-DML row-count reporting: .. code-block:: python statement_config = driver.statement_config.replace( execution_args={"oracle_batch_errors": True, "oracle_array_dml_row_counts": True} ) result = driver.execute_many( "INSERT INTO orders (id, total) VALUES (:1, :2)", rows, statement_config=statement_config ) failures = result.metadata["oracle_batch_errors"] # list of {offset, code, message} row_counts = result.metadata["oracle_dml_row_counts"] # per-statement affected rows .. note:: Spanner ``load_from_arrow`` uses ``insert_or_update`` mutations, so re-running the same rows is an idempotent upsert rather than a primary-key collision.