name: Table Format Vocabulary description: >- Domain vocabulary and taxonomy for open table formats used in data lakehouses, covering Apache Iceberg, Delta Lake, Apache Hudi, and related catalog and storage concepts. created: '2026-05-03' modified: '2026-05-03' tags: - Data Lakehouse - Open Table Format - Apache Iceberg - Delta Lake - Apache Hudi terms: - term: Open Table Format definition: >- A specification that defines how tabular data is organized, tracked, and accessed in object storage. Open table formats add ACID transactions, schema evolution, time travel, and efficient query planning to data lakes without requiring a dedicated storage engine. The three dominant formats are Apache Iceberg, Delta Lake, and Apache Hudi. category: Core Concept tags: - Data Lakehouse - ACID - term: Apache Iceberg definition: >- An open table format specification originally developed at Netflix, now an Apache Software Foundation top-level project. Iceberg uses a hierarchical tree of metadata files (metadata.json → manifests → data files) for snapshot isolation, schema evolution, partition evolution, and hidden partitioning. It has become the dominant open table format. category: Table Format related: - Delta Lake - Apache Hudi tags: - Apache Iceberg - term: Delta Lake definition: >- An open-source storage layer developed by Databricks that adds ACID transactions to data lakes. Delta Lake maintains a delta log (JSON-based transaction log) recording all changes. It is deeply integrated with Apache Spark and supports both batch and streaming workloads. category: Table Format related: - Apache Iceberg - Apache Hudi tags: - Delta Lake - Databricks - term: Apache Hudi definition: >- An open table format (Hadoop Upserts Deletes and Incrementals) optimized for record-level upserts and deletes. Hudi supports two table types: Copy-on-Write (COW) for read-optimized workloads and Merge-on-Read (MOR) for write-optimized workloads. Popular for CDC-based data pipelines. category: Table Format related: - Apache Iceberg - Delta Lake tags: - Apache Hudi - CDC - term: Snapshot definition: >- An immutable, point-in-time view of an Iceberg table's state. Each write operation creates a new snapshot. Snapshots enable time travel queries and concurrent reads without locking. Each snapshot references a manifest list file that points to all data files in that version of the table. category: Apache Iceberg tags: - Time Travel - ACID - term: Manifest File definition: >- An Avro file that tracks a subset of data files in an Iceberg table snapshot. Manifests record file-level statistics (row counts, min/max values per column) enabling partition pruning and predicate pushdown without reading data files. category: Apache Iceberg tags: - Apache Iceberg - Metadata - term: Manifest List definition: >- An Avro file referenced by each Iceberg snapshot that lists all manifest files for that snapshot. The manifest list enables efficient snapshot management and incremental metadata operations. category: Apache Iceberg tags: - Apache Iceberg - Metadata - term: Catalog definition: >- A service that tracks table locations, schema versions, and metadata file locations. Catalogs provide the mapping from table name to metadata file. Common Iceberg catalog implementations include Apache Polaris, Project Nessie, AWS Glue, Hive Metastore, and JDBC catalogs. category: Architecture tags: - Catalog - Metadata - term: REST Catalog definition: >- A standard HTTP/REST API specification for Iceberg catalog operations, defined as an OpenAPI spec in the Apache Iceberg project. REST catalog allows any catalog service (Polaris, Nessie, BigLake, Glue) to expose a common interface, enabling catalog interoperability. category: Apache Iceberg tags: - REST API - Catalog - Interoperability - term: Schema Evolution definition: >- The ability to add, drop, rename, or reorder columns in a table without rewriting all data files. Iceberg, Delta Lake, and Hudi all support schema evolution with backward and forward compatibility guarantees. category: Core Concept tags: - Schema Management - term: Partition Evolution definition: >- The ability to change a table's partitioning strategy without rewriting existing data. Apache Iceberg supports partition evolution, allowing old data to use the original partitioning while new data uses the updated partitioning scheme. category: Apache Iceberg tags: - Partitioning - Apache Iceberg - term: Hidden Partitioning definition: >- Apache Iceberg's approach where partition transforms (e.g., day(timestamp), bucket(user_id, 16)) are applied automatically by the engine, without requiring users to add separate partition columns. This eliminates common data engineering mistakes. category: Apache Iceberg tags: - Partitioning - Apache Iceberg - term: Time Travel definition: >- The ability to query historical versions of a table by specifying a snapshot ID or timestamp. All three major table formats support time travel, enabling auditing, rollback, and reproducible analytics. category: Core Concept tags: - Time Travel - ACID - term: Copy-on-Write (COW) definition: >- A table storage strategy where updates and deletes rewrite entire data files, producing clean Parquet files optimized for read performance. Used in Hudi and Iceberg for read-heavy workloads. category: Storage Strategy tags: - Apache Hudi - Performance - term: Merge-on-Read (MOR) definition: >- A table storage strategy where updates and deletes are recorded as delta files (log files or delete files) that are merged with base files during reads. Optimizes write performance at the cost of additional read-time merging. category: Storage Strategy tags: - Apache Hudi - Apache Iceberg - Performance - term: Delta Log definition: >- The transaction log used by Delta Lake, stored as a sequence of JSON files in the _delta_log/ directory. Each log entry records the files added, removed, and schema changes in a single transaction. category: Delta Lake tags: - Delta Lake - Transaction Log - term: Z-Order definition: >- A data layout optimization technique (used in Delta Lake and Iceberg) that co-locates related data within the same set of files to improve query performance for multi-dimensional filters. category: Performance tags: - Performance - Data Layout - term: Lakehouse definition: >- An architecture that combines the low-cost scalable storage of data lakes with the transactional reliability and governance features of data warehouses. Open table formats are the foundational layer enabling the lakehouse paradigm. category: Architecture tags: - Data Architecture - Data Lakehouse - term: ACID Transactions definition: >- Atomicity, Consistency, Isolation, and Durability guarantees applied to table operations. Open table formats bring ACID transactions to object storage by using metadata-level coordination rather than row-level locking. category: Core Concept tags: - ACID - Data Integrity