--- github_repository: https://github.com/duckdb/duckdb-avro layout: docu title: Avro Extension --- The `avro` extension enables DuckDB to read [Apache Avro](https://avro.apache.org) files. ## The `read_avro` Function The extension adds a single DuckDB function, `read_avro`. This function can be used like so: ```sql FROM read_avro('⟨some_file⟩.avro'); ``` This function will expose the contents of the Avro file as a DuckDB table. You can then use any arbitrary SQL constructs to further transform this table. ## File IO The `read_avro` function is integrated into DuckDB's file system abstraction, meaning you can read Avro files directly from, e.g., HTTP or S3 sources. For example: ```sql FROM read_avro('http://blobs.duckdb.org/data/userdata1.avro'); FROM read_avro('s3://⟨your_bucket⟩/⟨some_file⟩.avro'); ``` should "just" work. You can also *glob* multiple files in a single read call or pass a list of files to the functions: ```sql FROM read_avro('some_file_*.avro'); FROM read_avro(['some_file_1.avro', 'some_file_2.avro']); ``` If the filenames somehow contain valuable information (as is unfortunately all-too-common), you can pass the `filename` argument to `read_avro`: ```sql FROM read_avro('some_file_*.avro', filename=true); ``` This will result in an additional column in the result set that contains the actual filename of the Avro file. ## Schema Conversion This extension automatically translates the Avro Schema to the DuckDB schema. *All* Avro types can be translated, except for *recursive type definitions*, which DuckDB does not support. The type mapping is very straightforward except for Avro's "unique" way of handling `NULL`. Unlike other systems, Avro does not treat `NULL` as a possible value in a range of e.g. `INTEGER` but instead represents `NULL` as a union of the actual type with a special `NULL` type. This is different to DuckDB, where any value can be `NULL`. Of course DuckDB also supports `UNION` types, but this would be quite cumbersome to work with. This extension *simplifies* the Avro schema where possible: An Avro union of any type and the special null type is simplified to just the non-null type. For example, an Avro record of the union type `["int","null"]` becomes a DuckDB `INTEGER`, which just happens to be `NULL` sometimes. Similarly, an Avro union that contains only a single type is converted to the type it contains. For example, an Avro record of the union type `["int"]` also becomes a DuckDB `INTEGER`. The extension also "flattens" the Avro schema. Avro defines tables as root-level "record" fields, which are the same as DuckDB `STRUCT` fields. For more convenient handling, this extension turns the entries of a single top-level record into top-level columns. ## Implementation Internally, this extension uses the "official" [Apache Avro C API](https://avro.apache.org/docs/++version++/api/c/), albeit with some minor patching to allow reading of Avro files from memory. ## Limitations and Future Plans * This extension currently does not make use of **parallelism** when reading either a single (large) Avro file or when reading a list of files. Adding support for parallelism in the latter case is on the roadmap. * There is currently no support for neither projection nor filter **pushdown**, but this is also planned at a later stage. * There is currently no support for the WASM or the Windows-MinGW builds of DuckDB due to issues with the Avro library dependency (sigh again). We plan to fix this eventually. * As mentioned above, DuckDB cannot express recursive type definitions that Avro has, this is unlikely to ever change. * There is no support to allow users to provide a separate Avro schema file. This is unlikely to change, all Avro files we have seen so far had their schema embedded. * There is currently no support for the `union_by_name` flag that other readers in DuckDB support. This is planned for the future.