Apache Iceberg Adopt
Overview
Apache Iceberg is an open table format for large analytic datasets that brings SQL-table reliability to data lakes while allowing multiple engines to safely work with the same tables. The Apache project describes Iceberg as a high-performance format for huge analytic tables that supports engines such as Spark, Trino, Flink, Presto, Hive, and Impala working on the same tables concurrently (Apache Iceberg). Its core value is separating table semantics from a single compute engine or warehouse so shared analytical data products can remain open, versioned, and reproducible.
Iceberg's table specification is mature enough for adoption. Versions 1, 2, and 3 of the spec are complete and community-adopted, while version 4 remains under active development and is not yet formally adopted (Apache Iceberg Spec). The spec defines snapshot-based metadata, optimistic commits, serializable isolation goals, schema evolution, partition evolution, position and equality deletes, deletion vectors in v3, and compatibility rules across format versions (Apache Iceberg Spec).
The strongest current adoption driver is open lakehouse interoperability. Iceberg's REST catalog specification defines a common OpenAPI-based API for interacting with any Iceberg catalog, enabling new languages and engines to support catalogs with one client implementation, while supporting secure table sharing through credential vending or remote signing (Apache Iceberg REST Catalog Spec). This makes Iceberg especially relevant for data platforms that need governed access from multiple engines, warehouses, and ML/AI workloads without copying datasets into each platform.
Adoption Signals
- Apache Iceberg's latest listed release is 1.11.0, and the project states that Iceberg 1.0.0 officially guaranteed API stability after the API had already been integrated with many processing engines (Apache Iceberg Releases).
- Snowflake supports Iceberg tables across all accounts, cloud platforms, and regions, with read/write support, Snowflake-managed and external catalog options, Snowflake Open Catalog integration, and support for Iceberg spec versions 1, 2, and 3 with caveats (Snowflake Documentation).
- Databricks announced public preview support for managed Iceberg tables in Unity Catalog, including read and write access from Databricks and external Iceberg engines through Unity Catalog's Iceberg REST Catalog API, plus governance for Iceberg tables managed by foreign catalogs such as AWS Glue, Hive Metastores, and Snowflake Horizon Catalog (Databricks).
- AWS Prescriptive Guidance describes native Iceberg support across Amazon EMR, AWS Glue, Amazon Athena, and Amazon Redshift for building transactional data lakes on Amazon S3, and states that the next-generation Amazon SageMaker lakehouse is fully compatible with Iceberg and can query data in place using the Iceberg REST API (AWS Prescriptive Guidance).
- The REST catalog has become a de facto interoperability focus: Snowflake supports remote Iceberg REST catalogs including AWS Glue and Snowflake Open Catalog, while Databricks exposes Unity Catalog through Iceberg REST Catalog APIs for compatible clients such as Spark, Flink, Trino, PyIceberg, Kafka Connect, and Redpanda (Snowflake Documentation, Databricks).
Risks
- Operational maintenance is mandatory. Iceberg recommends expiring snapshots, removing old metadata files, deleting orphan files, and optionally compacting data files and rewriting manifests; otherwise tables can accumulate metadata, small files, stale snapshots, and unreferenced objects that hurt performance and storage cost (Apache Iceberg Maintenance).
- Maintenance can be dangerous if misconfigured. Iceberg warns that deleting orphan files with a retention interval shorter than the expected write duration can corrupt a table by deleting in-progress files, and that path-string mismatches on some file systems can lead to data loss during orphan-file removal (Apache Iceberg Maintenance).
- Catalog strategy is a platform decision, not an implementation detail. Snowflake distinguishes between Snowflake-managed Iceberg tables with full platform support and externally managed Iceberg tables with limited platform support, and notes that Snowflake does not sync remote catalog access control for users or roles in catalog-linked databases (Snowflake Documentation).
- Engine support is uneven. Snowflake supports some Iceberg v2/v3 features but not equality delete files, has restrictions around external query-engine writes, and documents numerous caveats for external catalogs, row-level deletes, private connectivity, metadata consistency, replication, streams, and fine-grained access control policies (Snowflake Documentation).
- Open format does not automatically equal open governance. Teams still need a chosen catalog, access-control model, lineage model, data-quality checks, lifecycle policies, cost controls, and ownership conventions across engines; otherwise Iceberg can become another unmanaged data lake layout rather than a governed data-product foundation.
Pros & Cons
Advantages
- Provides an open, engine-neutral table format for large analytical datasets on object storage.
- Supports ACID-style table updates, schema evolution, partition evolution, hidden partitioning, snapshots, time travel, rollback, and row-level deletes.
- Improves interoperability across Spark, Flink, Trino, Presto, Hive, Impala, cloud warehouses, catalogs, and lakehouse platforms.
Disadvantages
- Requires operational ownership for compaction, snapshot expiration, metadata cleanup, orphan-file removal, and manifest maintenance.
- Catalog choice can create governance, interoperability, and lock-in trade-offs even when the table format itself is open.
- Feature support differs across engines, especially around writes, equality deletes, v3 features, external catalogs, and fine-grained access policies.
Recommendation
Adopt Apache Iceberg as the default open table format for shared analytical datasets that need open storage, multi-engine access, reproducibility, governance, and long-lived interoperability. It is especially relevant for AI and ML data foundations because feature pipelines, RAG ingestion, analytics, model evaluation, lineage, and backtesting all depend on consistent, versioned, high-quality data that can be accessed by different engines without duplicating storage.
Adoption should be platform-led rather than project-by-project. Standardize catalog strategy, table naming, ownership, access-control integration, snapshot retention, compaction, orphan-file cleanup, metadata cleanup, branch/tag usage, schema evolution rules, and compatibility expectations across engines. Treat the Iceberg REST catalog as a key architectural boundary, and test real read/write interoperability among the engines that matter before declaring a table "open."
Use managed table services where they reduce operational burden, but keep ownership of the portability contract. Validate which engine or catalog is authoritative for writes, which platform performs maintenance, how access policies are enforced across engines, and what feature subset is safe for production. Move workloads to Iceberg when data products require cross-engine use; avoid adopting it solely as a file-layout change without governance and maintenance automation.