Data Engineering Annotated Monthly – January 2022

Due to the public holidays in Russia and my own vacation time, I didn’t get a chance to write an Annotated for December. Waiting a little longer might not be such a bad thing in this case, because now we have even more interesting releases to talk about! Hi, I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering sector and highlight new ideas from the wider community. If you think I missed something worthwhile, you can find me on Twitter and suggest a topic, link, or anything else you want to see. If you would prefer to receive this news in email form, you can subscribe to the newsletter here.

News

Learning new things and keeping a finger on the pulse of new technologies are major aspects of engineering. Here’s what’s happening in the world of data engineering right now.

Ambari is dead — This came as quite a shock to me, and it looks like free distributions of Hadoop do not exist anymore. It is almost impossible to set up a production-grade Hadoop without managers like Ambari. Theoretically, all of the components may be available, but the setup process is just a pain. The one remaining free tool I’m aware of is Arenadata Cluster Manager, but the free version doesn’t allow the user to do certain things, like deploy HA name nodes. R.I.P. Ambari – we love you.

Apache Hop 1.1 — The number of no-code tools is snowballing. We all know Apache NiFi, a stream processing tool with its own processing engine. It has a web interface, allowing you to build the pipeline you need. Apache Hop is different in many ways. For one, it uses Apache Beam as an engine. Furthermore, its interface is not web, but rather a desktop application written in Java (but with a native look and feel). When the workflow is ready, it should be deployed to the special Hop server and executed there.

DolphinScheduler 2.0.3 — Apache DolphinScheduler is described on its own website as a “distributed and easy-to-extend visual workflow scheduler system.” It is another example of an orchestrator, this time written in Java. One nice feature is its ability to be deployed to Kubernetes out of the box from the Bitnami help repository. It also supports a different set of connectors compared to Airflow. More info is available on their documentation.

SeaTunnel 1.5.7 is a tiny release, but I wanted to introduce you to one more new tool called SeaTunnel (formerly “WaterDrop”). Honestly, I was unaware of it and I wish I’d heard about it sooner. SeaTunnel is a tool that addresses one of the pain points our plugin solves: data synchronization between different sources. While our plugin does this in the UI right inside the IDE, SeaTunnel works on a different scale and provides users with a way to describe synchronization config declaratively.

Future improvements

Data engineering tools are evolving every day. This section is about technological updates that are in the works and that you may want to keep an eye on.

Kafka: Add range and scan query over kv-store in IQv2 — The name of this KIP speaks for itself. Currently, KV-store does not support range queries. Demand was high enough and the implementation is simple enough, so the PR has already been accepted and will hopefully be released soon (the current target release is 3.2.0).

Kafka: Add session and window query over kv-store in IQv2 — A complement to the previous KIP, but this time, it’s about window functions. This pair of KIPs gives us an impression of what direction they’re taking things for the KV store of Kafka. They’re aiming to make it more analytical-friendly.

Flink: Incremental savepoints – The current Flink savepoint mechanism has been proven to work, but it is slow when the state is big. This change is intended to improve the situation. The Flink Improvement Proposal page says it best: “It will be possible to request each savepoint independently (via CLI) to be either in the canonical format (the current behavior) or the native format. When native format is selected, the state.backend.incremental setting will decide the type of native format snapshot and will take effect for both checkpoints and savepoints (with native type).”

Spark: Ability to turn off auto commit in JDBC source for read only operations – In read-only transactions, Spark is currently able to read a huge amount of data in a single request, even if the fetch size is limited. For example, this is the case for PostgreSQL, and this behavior is even described in the docs. This change will add something like an autocommit flag to the JDBC source.

Articles

This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.

How I started out with dbt® – For some time now, I’ve noticed that dbt® is gaining popularity. I’ve been seeing more questions and more success stories, so a couple of days ago I decided to try it out. In this blog post I describe what dbt is and how it can be used while providing readers with several examples of usage.

Apache Spark Performance Boosting – Spark’s performance is one of the hottest topics in the data engineering community. Not because its performance is bad, but because the tool is extremely popular and there are always lots of corner cases. This post is essentially a checklist of what can be done to potentially improve performance in each application.

7 Must-Know Data Buzzwords in 2022 — It’s important to know trends. Doing so allows us to maintain – or even increase – our value as experts on the market. The simplest way to understand trends is to read articles dedicated to them. Here is an article about 7 trends with links to relevant posts!

HelloFresh Journey to the Data Mesh — Data Mesh is another buzzword! It’s not an easy thing to do, especially when you’re a big company like HelloFresh. It was a lot of work and a long journey to adopt data mesh, and now they’re sharing their experience.

Podcast

DEbrief — Recently, my friend, Dr. Igor Mosyagin, and I started a podcast called “DEbrief”. It overlaps with this digest to some extent, but always offers a little extra something of its own, too. So if you prefer listening to reading, check it out! It may be perfect for you!

That wraps up January’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other interesting data engineering articles you come across!

image description