Apache iceberg example

10/2/2023

Apache Hive was initially released in 2010 to provide a SQL interface on top of raw data files. However, collecting data is just the first step. These object stores provide a simple way of storing and retrieving vast amounts of information at a relatively low price point, and with little operational overhead.

This is aided by the rise of distributed file stores such as Hadoop Distributed File System (HDFS), as well as cloud storage solutions like AWS Simple Storage Service (S3), Azure Blob, and Google Cloud Storage (GCS). Statista's research department estimates that the global data creation will reach more than 180 zettabytes by 2025. Companies, both large and small, collect an ever-growing number of data points. Why it's so interesting? Because Apache Iceberg can somehow manage this to not overwrite the data files but only the metadata.Īpache Iceberg comes with a set of logical rules and one of them is RewriteDelete.The amount of data produced today is staggering. Since there is no 1-1 relationship between the high-level and low-level writers, I prepared a summary here: Metadata deleteĪpart from the writing classes, another point from the documentation that spotted my attention was the DELETE operation, and more precisely, its version working on the whole partitions. It's used when the table format is set to ORC. This rolling writer creates a new data file whenever the number of buffered records reaches the targeted threshold from write.target-file-size-bytes. It's also used as a physical writing layer in the FanoutDataWriter. This one is used for all data formats but ORC when the table is not partitioned. On the other hand, it's more demanding in CPU since it requires the data to be sorted before writing. This writer is more optimal than the FanoutDataWriter in terms of memory usage since it only keeps the writer for one partition open at a time. When the fanout remains disabled, Apache Iceberg partitioning writer uses the ClusteredDataWriter instance. Its usage is controled by the flag explained before. It's a specialization of the PartitionedDataWriter. Instead, they act as wrappers adding an optional partitioning logic. This task is delegated to a specialized data writer. Both don't perform the physical row writing, though. They're high-level data writer abstractions used when the partitioning information is, respectively, missing and present.

UnpartitionedDataWriter and PartitionedDataWriter.
Checks if the input and table schemas are the same.Īpache Iceberg writes the data files from one of 4 DataWriter implementations: Can be used to enforce null checks on the fields. Instead, the writer keeps one writer open for each partition.īesides, Apache Iceberg also has some Apache Spark-specific configuration properties for the writer: By enabling this property you eliminate the need for the local sort before writing the rows. Enables the usage of Apache Spark fanout writer. The feature is disabled by default, though. Both add an extra step of deleting old metadata information after committing the write. It can shuffle the data by partition id (good if the partitions are evenly distributed), use range partitioning (good to mitigate data skew), or nothing (good for few partitions, otherwise may lead to small files problem). Configures the data distribution for the writing step. Defines the target size of the generated files. They define the storage for the Parquet files composing the table. Sets respectively, the Parquet row group size, dict size, and page size.
-group-size-bytes, -size-bytes, -size-bytes.
Again, among subjectively chosen the most important ones, you'll find: Write configuration properties are defined in TableProperties class as constants prefixed with WRITE_. Apache Iceberg uses it when the condition from DELETE matches the partition value.

The partition delete is a metadata-only operation that will not overwrite the data files. Although it exists as a specialization of the regular DELETE, it's worth a separate mentioning. An insert but executed from Structured Streaming. The documentation recommends using MERGE instead of the INSERT OVERWRITE to replace only affected (changed) rows, though. The static mode on the other hand, the overwrite must have a PARTITION clause aligned with the query filter.

The former impacts only the rows located in the partition path from the query. A more advanced insert version that replaces the table. Those are classical DML operations supporting row-level operations, such as deleting a particular item from the dataset by id. Since Apache Iceberg doesn't have different table types, let's see the available writing operations: As previously, I'll start this time with some general info first.

0 Comments

Apache iceberg example

Leave a Reply.

Author

Archives

Categories