Lädt...


🔧 Despite Uniform and Apache XTable, your choice of Table Format still matters (Apache Iceberg, Apache Hudi, and Delta Lake)


Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to

The concept of a data lakehouse aims to operationalize data lakes to function as data warehouses, delivering data faster and more efficiently to data consumers. This is achieved by deconstructing the individual components of a data warehouse into modular, composable pieces such as storage (object storage, Hadoop), storage format (Apache Parquet), table metadata format (Apache Iceberg, Apache Hudi, Delta Lake), catalog of data assets (Nessie, Polaris, Unity, Gravitino), and data processing (Dremio, Apache Spark, Apache Flink, Snowflake, Trino, Starrocks). One of the most consequential and foundational choices within these layers is the table format, which specifies how metadata is tracked around your tables, enabling ACID transactions, schema evolution, and time travel.

The Efforts for Interoperability

Every data processing tool and library can support the three table formats, but doing so requires significant development effort. This creates a tension, as the need to implement different table operations across three formats divides the development resources of each data processing platform. Consequently, some tools may not support all formats, or may not support them equally due to limited engineering hours and competing platform needs. This poses a conundrum for those architecting their lakehouse: which table format provides the most certainty of support and ecosystem compatibility?

Recent developments have positioned Apache Iceberg as the clear frontrunner. However, Apache Hudi and Delta Lake also offer unique features that make them desirable for certain use cases. To address this challenge, two projects have emerged to facilitate table format interoperability.

  • Apache XTable: Originating from Onehouse, the company founded by the creators of Apache Hudi, Apache XTable provides true bidirectional conversion between the formats. This means you can convert:

    • Apache Iceberg -> Delta Lake
    • Apache Iceberg -> Apache Hudi
    • Apache Hudi -> Delta Lake
    • Apache Hudi -> Apache Iceberg
    • Delta Lake -> Apache Iceberg
    • Delta Lake -> Apache Hudi

This capability is extremely useful for translating metadata into formats that tools may not natively support. For example, you can stream data into Hudi but read it as Apache Iceberg with Dremio.

  • Delta Lake Uniform: Introduced with Delta Lake 3.0, the Uniform feature allows you to write primarily to Delta Lake and then sync that data into Apache Iceberg or Hudi metadata within Unity Catalog on Databricks or a Hive catalog in OSS.

While both tools enable the use of the full "read ecosystem," they do not address the significant distinctions between these formats on the write side. Therefore, your initial choice of table format remains crucial.

What You Lose On The Write Side

Each format's metadata includes mechanisms to provide file-level statistics for file skipping, resulting in faster queries. However, each format's metadata structure is optimized for the specific way it expects the underlying Parquet data files to be organized and written. This optimization is lost when converting metadata from one format to another.

Iceberg Partitioning

In Apache Iceberg, a feature called hidden partitioning reduces the complexities of partitioning. This feature cannot be used if the table is written from another format, meaning extra partition columns may need to be stored in your data files, and additional query predicates may be required in your analysts' queries. Additionally, Apache Iceberg's partition evolution is another feature you would miss out on by writing to another format.

Hudi Delta Streamer

In Apache Hudi, a parallel service called the "Delta Streamer" performs maintenance operations as data is streamed into Hudi tables. This is one of the primary reasons to use Hudi. If you write in another format and then translate to Hudi metadata, you lose the benefits of this feature.

Delta Lake Uniform

When using Delta Uniform, not only do you lose the benefits of writing from the other formats, but you also lose many of the benefits of using Delta Lake. These benefits include features like deletion vectors (Delta's version of merge-on-read). Additionally, the converted metadata is written asynchronously, and sometimes multiple transactions are batched into one conversion, meaning you do not have a true mirror of table history for time travel in the destination format.

While the trade-offs may become less significant in the future, interoperability currently results in losing many of the benefits of choosing a format to fully leverage its data management features. These tools are great for bridging gaps in the read ecosystem, but the write-time data management is where each format differs most in approach and value proposition. Each format makes slightly different trade-offs that are, in some ways, mutually exclusive.

Conclusion

We can all agree that efforts to build robust interoperability are worthwhile and beneficial, but they don't address some of the fundamental trade-offs between different data lake table formats. This doesn't even take into account catalog-level benefits such as the data versioning provided to Iceberg tables with catalogs like Nessie, which enable the creation of zero-copy environments and multi-table transactions.

As the conversation begins to shift towards data lakehouse catalogs, driven by the continued standardization of the open lakehouse stack, the discussion about table formats is far from over. The nuances and trade-offs inherent in each format still play a crucial role in the architecture and performance of data lakehouses.

...

🔧 Delta, Hudi, and Iceberg: The Data Lakehouse Trifecta


📈 64.6 Punkte
🔧 Programmierung

📰 Choice matters, choice happens


📈 40.05 Punkte
🐧 Unix Server

📰 EcoFlow Delta Pro 3, Delta 3 und Delta 3 Plus offiziell vorgestellt


📈 37.83 Punkte
📰 IT Nachrichten

📰 Apache Software Foundation erhebt Hudi zum Top-Level-Projekt


📈 35.67 Punkte
📰 IT Nachrichten

🔧 Apache Hudi on AWS Glue


📈 35.67 Punkte
🔧 Programmierung

🔧 Apache Hudi: A Deep Dive With Python Code Examples


📈 35.67 Punkte
🔧 Programmierung

🔧 Apache Iceberg Table Management


📈 35.57 Punkte
🔧 Programmierung

🔧 Expand Tanstack Table Row to display non-uniform data


📈 34.13 Punkte
🔧 Programmierung

📰 Databricks Open-Sources Delta Lake To Make Delta Lakes More Reliable


📈 33.63 Punkte
📰 IT Security Nachrichten

📰 Streaming Iceberg Table, an Alternative to Kafka?


📈 31.11 Punkte
🔧 AI Nachrichten

📰 Leveraging Azure Event Grid to Create a Java Iceberg Table


📈 31.11 Punkte
🔧 AI Nachrichten

📰 Why the Arms Trade Treaty Matters – and Why It Matters That the US Is Walking Away


📈 29.29 Punkte
📰 IT Security Nachrichten

📰 Why the Arms Trade Treaty Matters – and Why It Matters That the US Is Walking Away


📈 29.29 Punkte
📰 IT Security Nachrichten

🔧 Guide to Starting or Joining Your Local Apache Iceberg Community


📈 27.56 Punkte
🔧 Programmierung

🔧 Clustering vs Partitioning your Apache Iceberg Tables


📈 27.56 Punkte
🔧 Programmierung

📰 5 Reasons Why Linux Choice Matters


📈 27.11 Punkte
🐧 Unix Server

🔧 Why Data Analysts, Engineers, Architects and Scientists Should Care about Dremio and Apache Iceberg


📈 26.2 Punkte
🔧 Programmierung

🔧 PostgreSQL. How to move old historical records from your main table to a separate table?


📈 25.84 Punkte
🔧 Programmierung

📰 Dremio’s open lakehouse now supports SQL DML and DDL operations on Apache Iceberg


📈 25.24 Punkte
📰 IT Security Nachrichten

📰 ­­Speed ML development using SageMaker Feature Store and Apache Iceberg offline store compaction


📈 25.24 Punkte
🔧 AI Nachrichten

🔧 Stateful Stream Processing With Memphis and Apache Iceberg


📈 25.24 Punkte
🔧 Programmierung

🔧 Apache Iceberg and Data Lakehouse Partitioning


📈 25.24 Punkte
🔧 Programmierung

🔧 How Snowflake Is Powering the Future of Big Data With Apache Iceberg and Polaris


📈 25.24 Punkte
🔧 Programmierung

🔧 Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 1


📈 25.24 Punkte
🔧 Programmierung

🔧 Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 2


📈 25.24 Punkte
🔧 Programmierung

🔧 ACID Guarantees and Apache Iceberg: Turning Any Storage into a Data Warehouse


📈 25.24 Punkte
🔧 Programmierung

🔧 What Apache Iceberg REST Catalog is and isn't


📈 25.24 Punkte
🔧 Programmierung

📰 CPU-Gerüchte: Viel Comet Lake, dazu Tiger Lake und Jasper Lake


📈 25.22 Punkte
📰 IT Nachrichten

📰 Cypress Cove: Rocket Lake mischt Ice-Lake-Kerne mit Tiger-Lake-Grafik


📈 25.22 Punkte
📰 IT Nachrichten

matomo