Data Engineering Annotated Monthly – April 2022

Long time no see! Sorry about the silence, but luckily we’re back.

Hi, I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, catch me on Twitter and suggest a topic, link, or anything else you want to see. And please feel free to subscribe to this newsletter to get it in your email inbox every month.

News

A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.

Airflow 2.3.0 – This popular orchestrator got a new release. Some say it’s “almost 3.0”, and yes, it does bring a lot of changes. Take the new dynamic tasks, for example. Based on the “map-reduce” paradigm, they allow you to compute the next DAGs from the current state – a very useful feature, which incidentally has been available in Luigi for a while. Additionally, the Tree view has been replaced by the Grid view, which, in my opinion, is much more informative.

Apache Hudi 1.11.0 – This release of the well-known data lake has added many interesting changes. First, they’ve implemented asynchronous indexing. Second, they’ve significantly improved Spark integration. Third, Google BigQuery now has support for Hudi as an external source. I could go on and on. Now’s a good time to update your Hudi!

YuniKorn 1.0.0 – If you’ve been anxiously waiting for Kubernetes to come to data engineering, your wishes have been granted. A top-level ASF project, YuniKorn 1.0 is a scheduler targeting big data and ML workflows, and of course, it is cloud-native.

Kyuubi 1.5.1 – Kyuubi is a JDBC server built over Apache Spark, but as of version 1.5.0, it supports two more SQL engines, Flink and Trino/Presto. The team has also added the ability to run Scala for the SparkSQL engine.

Apache Pulsar 2.0.10 – No fewer than 14 PIPs (Pulsar Improvement Proposals) were implemented in this version! Notably, cluster failover is now supported on the client-side. Read more about Pulsar 2.0.10 here.

RocketMQ Streams 1.0.1 preview – I’ve mentioned RocketMQ before, in the November Annotated, but here’s a good reason to write about it again. Virtually every technology seems to be adding some kind of streaming API these days. Kafka was the first, and soon enough, everybody was trying to grab their own share of the market. In the case of RocketMQ, their attempt is very interesting because, unlike Kafka and Pulsar, RocketMQ is closer to traditional MQs like ActiveMQ (which isn’t really surprising, seeing how it’s based on ActiveMQ).

Flink 1.15.0 – What I like about this release of Flink, a top framework for streaming data processing, is that it comes with quality documentation. The docs clarify the semantics of checkpoints and savepoints, making them much easier to understand. The release isn’t short on technical improvements, either, such as elastic scaling with reactive mode and an adaptive scheduler.

Future improvements

Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and which you may want to keep an eye on.

Kafka: Shareable State Stores – This improvement in Kafka looks very interesting. Authors promise us that under certain conditions, it will be possible to share data between topics without needing to copy it around over nodes. However, this improvement depends on the implementation of tiered storage support, which hasn’t yet landed.

Kafka: Add support for different unix precisions in TimestampConverter SMT – Have you ever been in a situation where the timestamp is of type Long and you can’t understand what it represents? This is an inherent issue with the Unix timestamp. Some systems think that it should be in milliseconds, and some think that it should be in seconds. This KIP promises to resolve the issue by making the TimestampConverter class support the precision of Unix timestamps.

Flink: Introduce Flink Kubernetes Operator – Since Kubernetes is dominating virtually everywhere, other data engineering tools are having to catch up and introduce k8s integration. It’s true that there is a scheduler for data engineering for k8s – YuniKorn – but some would prefer to run Flink ad hoc, and that requires these tools to implement the k8s operator.

Spark: Add support for forwarding Spark History requests to a live running driver when present – While a Spark job is running, we can see it on a Spark history server, but that doesn’t provide us with full and up-to-date information. This enhancement, which has already been implemented, automatically redirects us from a history server to the live driver, where we can find the complete information. Neat!

Articles

This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.

Analyzing the Panama Papers With Neo4j: Data Models, Queries, and More – Graph databases are extremely useful, but few of us have a lot of experience with them. Most of us have some difficulty identifying whether problems could be better solved with the help of a graph database. In addition, typical examples of using graph databases are oversimplified. That is why this example from the creators of Neo4j was so insightful for me. It provides a great view into the different aspects of using graph databases in general, and it covers the specifics of Neo4j in detail, as well.

Architecture for High-Throughput Low-Latency Big Data Pipeline on Cloud – The title of the article speaks for itself. The premise might sound familiar, but this isn’t some boring repetition of what we all know already. There’s at least one interesting twist that goes like this: “A data pipeline has five stages grouped into three heads.” If that sounds intriguing, read the article to find out more.

Corrections in data lakehouse table format comparisons – Quasi-mutable (a.k.a. data lake) formats are improving almost at the speed of thought. Sooner or later, we data engineers will have to choose which one to make our standard! This live document is a big and growing set of corrections to the original and very well-known comparison by Dremio. Check back often if you want to keep up with what’s new in the world of Hudi, Iceberg, and DeltaLake.


That wraps up April’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other interesting data engineering articles you come across!

image description