Data Engineering Annotated Monthly – May 2022

It’s the start of June. That means it’s time to start taking summer vacations and enjoying some fresh juice alongside your fresh news! Hi, I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, catch me on Twitter and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to receive this information as an email, you can subscribe to the newsletter here.

News

A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.

DataHub 0.8.36 – Metadata management is a big and complicated topic. There are several solutions. Some of them are free, some of them are paid, but none of them are particularly easy to use. I’ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn’t able to make it do what I wanted it to. On top of that, it’s a part of the Hadoop platform, which created additional work that we otherwise would not have had to do. DataHub is a completely independent product by LinkedIn, and the folks there definitely know what metadata is and how important it is. If you haven’t found your perfect metadata management system just yet, maybe it’s time to try DataHub! This new release brings exciting features like support for Apache Iceberg!

Feathr 0.4.0 – This feature store by LinkedIn is developing quickly. I know that many companies have not been able to find a suitable feature store on the market and have had to write their own. This task is not easy, and it takes a very long time and significant engineering resources to do properly. Meanwhile, it looks like LinkedIn has the necessary resources and is even ready to open up its solution to external contributors! The most notable change in the latest release is support for streaming, which means you can now ingest data from streaming sources.

Pulsar Manager 0.3.0 – Lots of enterprise systems lack a nice management interface. They need to be configured with configuration files or via the command line. I am an old-school guy. I adore command line, vim, and so on, but I also understand that sometimes configuration is such a complex task that would really be easier to do once with a UI and then just not have to think about it again. Apache Pulsar takes a step in this direction and adds an official management UI! In this release, there are some improvements to the dashboard, as well as several bug fixes.

Bookkeeper 4.15.0 – And while we’re on the subject of Pulsar, we should not forget to mention the engine behind Pulsar: Bookkeeper. Bookkeeper is usually perceived as exclusively a backend behind Pulsar, but the truth is that nothing can stop you from using it in your own systems. Bookkeeper’s team presents it as a “fault-tolerant and low-latency storage service optimized for append-only workloads”, so if you need to store something in a distributed manner, you may not need a traditional database. Perhaps Bookkeeper would suit your needs better! In the latest version BP-46: Running without a journal has been implemented, along with several other features.

Impala 4.1.0 – While almost all data engineering SQL query engines are written in JVM languages, Impala is written in C++. This means that the Impala authors had to go above and beyond to integrate it with different Java/Python-oriented systems. And yet it is still compatible with different clouds, storage formats (including Kudu, Ozone, and many others), and storage engines. It shouldn’t come as a surprise that Cloudera managed to achieve this, as they know how to create on-premise data engineering products. I don’t know how this happened, but there is not even an official changelog yet at the time of writing. However, you can find a diff with the 4.0.0 version on GitHub.

RocksDB 7.2.2 – We often forget that certain data engineering products only work so well because they have other powerful tools under the hood. For proof of this, look no further than systems like Flink and Camunda, which rely on RocksDB. RocksDB is a storage engine with a key/value interface, where keys and values are arbitrary byte streams written as a C++ library. It can store data virtually everywhere, for example in memory or on any kind of permanent storage device. And yes, it pays attention to correctness and effectiveness when storing data.

Future improvements

Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.

Kafka: Mark KRaft as Production Ready – One of the most interesting changes to Kafka from recent years is that it now works without ZooKeeper. This is possible thanks to implementations of KRaft, a Raft consensus protocol designed specifically for the needs of Kafka. This Kafka Improvement Proposal’s goal is to declare KRaft production-ready and to make support and operations related to Kafka clusters much easier.

Flink: Support Advanced Function DDL – SQL query engines like Hive and Spark have supported external functions in SQL for quite some time. This allows developers and data engineers to enrich traditional SQL with their own extensions, which can be useful when you need to perform business-specific operations inside a regular query. Hopefully with the implementation of this Flink Improvement Proposal, Flink will support them too.

Spark: Use Parquet in predicate for Spark In filter – Though it is usually hidden behind the scenes, one of the most popular storage formats – Parquet – is evolving too. At this point in time, filters have been implemented on the storage level in Parquet, and Spark needs to catch up by adding support for native filtering. This improvement can make our queries dramatically faster in some cases!

Articles

This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.

RocksDB Is Eating the Database World – Continuing on the topic of RocksDB, here is an older, but still very interesting, article on what RocksDB is and how it works. It also provides some insight into why its popularity is growing rapidly.

Replicated Log – Here’s a relatively long and detailed article about replicated logs. A replicated log is a way to synchronize data among nodes in a distributed system. There are multiple ways to implement a replicated log, and most of them are somehow related to what are called consensus protocols, for example, Paxos and Raft.

Events

Current 2022: The Next Generation of Kafka Summit – This most popular conference related to Kafka is organized by one of its main maintainers, Confluent. Of course, the main topic is data streaming.

Big Data Event: London – Thousands of attendees are expected to participate in this big data event in London. They’ve already booked a large number of speakers from a wide range of companies, including the widely known Aerospike, StackOverflow, and Snowflake.

That wraps up May’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other interesting data engineering articles you come across!

image description