Worth reading for data engineers - part 2 on waitingforcode.com

Welcome to the 2nd part of the series with great streaming and project organization blog posts summaries!

Building Scalable Real Time Event Processing with Kafka and Flink by Allen Wang

In his blog post, Allen Wang shares the DoorDash journey to build a real-time event processing system named Igauzu. The goal was to simplify the data delivery and availability for various consumers (ML, data warehouse, ...). How? Let's take a quick look.

Kafka Rest Proxy for event publishing. Instead of configuring the data generation part on the clients, the new architecture exposes a RESTful API for the data ingestion. Besides providing the high-level and easy to manage abstraction, the Proxy also integrates with the Schema Registry for a better input validation and provides asynchronous data writing.
Apache Flink jobs factory. There is a whole standardized process to create Apache Flink jobs. The engineers don't start from scratch. Instead, the factory provides ready-to-use templates for the jobs and infrastructure.
Dynamic job generation. Using only Apache Flink programmatic API limits the system usage to the engineers. That's why Igauzu also has a way to generate the jobs from SQL included in a YAML template.
Data validation. There are 2 types of events: internal and external. The former come from the fully controlled data producers, so they can be validated against the schema registry. If the validation fails, the producer won't ever send them. On the other hand, the external events come from web or mobile apps and are defined in JSON. There is a dedicated pipeline to validate them and transform to the expected format.
Schema evolution. It's a part of the CI/CD pipeline. Whenever a schema is updated, the validation process begins. It verifies whether the evolution follows the schema compatibility strategy and if not, discards the merge.
Event-driven data warehouse loading. The data gets processed in real-time and one of the supported destinations is S3, where Apache Flink writes Apache Parquet files. The object creation event is later sent to SQS and handled by Snowpipe to load the data to the data warehouse layer (Snowflake). The files are created with Apache Flink's StreamingFileSink to guarantee strong consistency and exactly-once delivery.

Link to the blog post: https://doordash.engineering/2022/08/02/building-scalable-real-time-event-processing-with-kafka-and-flink.

You can reach out to the author on LinkedIn: Allen (Xiaozhong) Wang.

O Kafka, Where Art Thou? by Gunnar Morling

In his blog post, Gunnar Morling analyzes the impact of an unreliable Apache Kafka on data delivery. Instead of giving a firm answer for what to use, he lists various scenarios that require different implementation:

State of the Kafka cluster can be inconsistent with the state of the database, i.e. delivered events are considered as optional. Not being able to write the event to Apache Kafka shouldn't block the pipeline here.
If the consistency is required and your Kafka broker is unavailable, you should buffer the messages. There are multiple options for that:
- Local disk may be a good start but it can also be too volatile, e.g. for a deployed pod.
- Persistent volume is a serious alternative to the local disk but it can go out-of-space.
- Fully managed services with high SLA and "virtually unlimited" storage are other mentioned choices. However, Gunnar points out a risk here too. If the producer cannot reach Apache Kafka broker due to some network issues, it will probably be unable to reach these services too.
- Writing to the database first and streaming all the changes with the Change Data Capture pattern. It also has a drawback of an additional latency added to the data delivery.

To conclude, let me quote Gunnar. The quote highlights pretty well the challenges we face and the fact that there is often no perfect solution!

In the end it's all about trade-offs, probabilities and acceptable risks. For instance, would you receive and acknowledge that purchase order request as long as you can store it in a replicated database in the local availability zone, or would you rather reject it, as long as you cannot safely persist it in a multi-AZ Kafka cluster?

Link to the blog post: https://www.morling.dev/blog/kafka-where-art-thou/.

You can reach out to the author on LinkedIn: Gunnar Morling.

Capturing late data by Immerok

Even though this article comes from the documentation, it is a great example of... how to write a good documentation example! Immerok is specialized in Apache Flink and Apache Flink snippets are part of its documentation under the "Apache Flink Cookbook" section.

The first recipe I've found there was the one about late data handling in Apache Flink. From the page you can learn that:

Event-time processing is a term describing data processing where payloads contain a timestamp when the event occurred.
Apache Flink drops late data automatically only for a subset of operations, including: joins, windows, Complex Event Processing, and process functions (if dropping is enabled explicitly).

If you want to send late data to a different sink, you can use side output pattern and include a simple if-else-based logic for dispatching:

final long currentWatermark = ctx.timerService().currentWatermark();

long eventTimestamp = ctx.timestamp();

if (eventTimestamp < currentWatermark) {
	ctx.output(lateDataOutputTag, value);
} else {
	out.collect(value);
}

Link to the blog post: https://docs.immerok.cloud/docs/how-to-guides/development/capturing-late-data-and-send-to-separate-sink-with-apache-flink/.

You can reach out to the author on LinkedIn: Immerok.

How we built our Lakeless Data Warehouse by Iliana Iankoulova

Iliana shares quite interesting lessons learned in a blog post from her experience on building a data warehouse at Picnic. It's not a deep technical dive but a great presentation of an iterative product building and small victories leading to the final success:

Start small but think big. Iliana shares her feelings about listing 500 KPIs and trying to choose the one to work on. Well, it was not obvious. Having a simple start project is always better and simpler to build.
Amid this waterfall process to get all our wishes down in writing, we started a small project with one of the founders and an analyst to generate daily stats posted in Slack. Nothing fancy: # orders, % of completed orders, % on-time deliveries, # active customers, order rating. However, I enjoyed this project very much, as it was hands-on. This is how the first tables on the DWH were born, from the need of powering stats in Slack.

The pattern that resonated with me was that the further downstream the data structuring responsibility is placed, the more disputed the results of the data is, as everyone calculated metrics a bit differently.

"perfect is an enemy of the good" or another variant I also like to mention, "done is better than perfect". Iliana gives an example of the data orchestration for the ETL process that had been done initially on her computer. The scheduling later moved to a CronJob and only in the end to a more advanced data orchestrator. Despite that, the team could get insight from the data very early in the project lifecycle.
Read and do research! Probably there is somebody who already solved the problem you're facing. This reflex is not always obvious in greenfield projects moving at a fast pace.
Taxonomy glossary and standardized naming patterns are very useful. They help avoid misusing the exposed data.
Be aware of the Master Data Management (MDM) changes and their impact on the data warehouse layer.
Reduce the risk. Iliana and her team started with a...PostgreSQL database for the data warehouse layer. Despite this really-no data warehousing nature, they were able to immediately provide business value at low risk, since they knew the tool very well.
Data contracts! The article is 2 years old but the backend team has already implemented a contract-based API at this time. Amazing how they were ahead of time, when just a bunch of people thought about it as data contracts.
If it's a timestamp, it should be in UTC! Storing temporal values with a unified timezone helps avoid many conversion issues, as well the daylight saving changes.
Data Vault. Iliana recommends 2 resources if you want to implement this pattern in your data warehouse: https://www.ukdatavaultusergroup.co.uk/multiple-speakers-my-first-data-vault-project-download-dec-2020-pop-up/ and https://blog.picnic.nl/data-vault-new-weaponry-in-your-data-science-toolkit-e21278c9c60a.
Lean process, including:
- No single person is a bottleneck for production releases.
- Minimize the tools, languages, and environments needed for a feature implementation.
- Ease peer review. Obviously, low-code tools don't make it easy.
- Enable quick prototyping of the business users with a high-level abstraction, such as SQL.
- Give DataOps control and responsibility with independently deployed and scheduled jobs.

Link to the blog post: https://blog.picnic.nl/how-we-built-our-lakeless-data-warehouse-38178f6cee12.

You can reach out to the author on LinkedIn: Iliana Iankoulova.

I know, some of the shared blog posts are 2-years old. However, they prove that even in our ever-changing data domain, some of the concepts have a long shelf life!

TAGS: #worth reading for data engineers