Data Engineering Annotated Monthly – November 2021

The holiday season is almost upon us! And what better time than the holidays to catch up on the latest news and read about other interesting topics? Hi, I’m Pasha Finkelshteyn, and I’ll be your guide today through this month’s installment of the Data Engineering Annotated Monthly. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something noteworthy, you can catch me on Twitter and suggest a topic, link, or anything else you want to see. Also, if you would prefer to get this as an email, you can subscribe to the newsletter here.

News

A lot of what we do in engineering involves learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.

Apache Arrow 6.0.1 – Apache Arrow presents itself as a cross-language development platform for in-memory analytics. Of course, you probably already know that if you’re doing data engineering in Python or, for example, Go – because the 6.0 release of Apache Arrow brings much better support for the Go language! It’s an exciting time right now as tons of features are being implemented for C++, which means that there are more and more languages coming to data engineering!

Apache Geode – Was anyone even thinking about data engineering 19 years ago, back when Apache Geode first came on the scene? Well, I know I definitely wasn’t. And while 1.12.5 is not a huge release, just a regular update to the 1.12 branch, I want to take this opportunity to tell you a bit about what Apache Geode is. It’s basically a distributed cache with transaction support, which is somewhat surprising in the data engineering world! Nevertheless, caches have their uses. For example, they can be used to store data for further processing. Typically we use tools like Redis to cache things. But actually, there is an area for improvement in Redis clusters, while Geode is built as a distributed-first transactional solution.

Apache Pinot 0.9.0 is a real-time distributed OLAP datastore, designed to answer OLAP queries with low latency. It’s developed by LinkedIn, which means it has very tight integrations with other LinkedIn tools, like Apache Kafka! This release brings 2 big features: Segment Merge and Rollup, both of which can be used for better (i.e. more beautiful) aggregation on tree-like structures (for example, cities in states in countries).

Apache RocketMQ released version 4.9.2. It’s just a patch release, but I’d still like to say a few words about what RocketMQ is. RocketMQ is a message queue broker that’s built on top of Apache ActiveMQ. In contrast to ActiveMQ, RocketMQ can form a cluster. This means you can scale it horizontally. It also supports transactions, which may be important if preserving messages is crucial for you. And, unlike Kafka, it doesn’t need ZooKeeper and it supports message scheduling! 

NATS 2.6.5 is another mostly-patch release of a message broker for us to include in this month’s Annotated. NATS is interesting because for a long time its main goal was performance and availability, but since version 2.3 they’ve built JetStream, which is actually a persistent message queue system inside NATS. But at its core, it is still the same highly available, self-healing cluster of NATS servers.

Future improvements

Data engineering technologies are evolving every day. This section is about updates for technologies that you may want to keep your eye on.

Pulsar: PIP 106: Negative acknowledgment backoff – I think that one day Apache Pulsar will replace Apache Kafka. This hasn’t happened yet, but they are trying hard. This PIP is another step in that direction, I hope. It proposes to add a nice way to configure negative acknowledgment backoff, which is almost like configuring a backoff “strategy”, but not that sophisticated.

BookKeeper: BP-42: New Client API – list ledgers – Pulsar itself is built on a technology called BookKeeper. It is a reinvention of what Kafka did to store data, but now in an alienable way. When this enhancement proposal is implemented, it will provide users with a new API for listing ledgers, which are just sequences of records (read more about BookKeeper’s concepts and architecture). The implementation of this proposal will improve transparency and give us more control over what goes inside BookKeeper.

Articles

This section is just filled with inspiration. Here you will find some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.

How the Modern Data Stack is Reshaping Data Engineering – In this article, Max Beauchemin offers a very interesting overview of how data engineering has changed over the last 4 years. It also looks at the new tools and aspects of work we data engineers have discovered in our own sphere of responsibilities.

Dealing with nulls in Spark – When I was getting started with Spark, I found nulls to be so painful that eventually I came to JetBrains hoping to improve the situation. Now I know that I’m not the only one who suffered!

Python Performance Tuning: 20 Simple Tips – Some say that Python is slow. Of course, that’s not entirely accurate, but in some circumstances it can be easy to misuse some functionality and get less than stellar results. And since we typically use Python in data engineering, it may be worthwhile to have a look at these 20 common pitfalls that can potentially save us an enormous amount of time.

Tools

November’s tool of the month is VisiData

Usually, I say that data engineering starts when there is not enough data for Excel to handle. But frankly, sometimes that’s just not true. Sometimes we need to explore data and not build some heavy-weight process on top of it. And VisiData is a tool that helps us do so. It’s sort of like MS Excel, but it has a TUI rather than a GUI and it also supports Python instead of whatever Excel has (what actually is it, by the way?).

Some of our work with VisiData can be performed with the arrow keys, but many of the more advanced shortcuts are not so intuitive. Luckily, there is a manual on their site, which is also built into the application.

This tool has all the best benefits of a TUI: responsiveness and feature-richness. I’ve prepared a video that demonstrates these benefits in one use case. You can see what I’m pressing by looking at the very bottom line of terminal in this video:

And here is a step-by-step description of what I do in the video.

  • First I set the type of the movieId column by pressing #.
  • Now I want to extract the year from the title: I press ; and then type the regular expression \((\d{4})\)\s*$.
  • I use ^ to rename the newly-created column year. 
  • If no year is given for some of the movies, I should delete their rows. I will do this by typing [ (sort ascending), , (select all rows with the same value in this column as in current cell) gd (delete all selected rows)”.
  • I need to do the same for movies with genre “(no genre provided)”.
  • Now I want to split the genres (currently delimited by |) into an array. This is where we can make use of Python. In the genres  column type = and then the expression genres.split('|'). Note how I use column names in auto-completion to work with the columns in Python.
  • To expand (or explode in Spark’s terms, or flatMap in functional terms) the array into rows, I use zM, which is the shortcut to expand the current column of lists. After that, I use - to remove all the unnecessary columns and then rename the new column genre.
  • Next I want to build a table that will show me what genre was most popular in which year. For that, I’m marking the year column as a key (with !). Then I say that the title column should be aggregated as count (by + and selecting count in the list). Finally, I press Shift+W (the shortcut for pivot) on the genre column. And that’s it!

Yes, it does take some time to memorize all these commands, but using them can make you so much faster! And, even better, you can save all your actions to a separate file and just reproduce them when the data is changed to obtain a predictable result (unless the format has changed of course). VisiData can also open lots of different file formats, including SQLite and Excel files.

That wraps up November’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to hear about any other interesting data engineering articles you come across!

image description