Data Engineering Digest

Event time skew in stream processing

Waitingforcode

APRIL 24, 2024

As a data engineer you're certainly familiar with data skew. Yes, this bad phenomena where one task takes considerably more input than the others and often causes unexpected latency or failures. Turns out, stream processing also has its skew but more related to time.

Process

Process Data Engineering Data Engineer Engineering

Stopping a Structured Streaming query

Waitingforcode

APRIL 18, 2024

Streaming jobs are supposed to run continuously but it applies to the data processing logic. After all, sometimes you may need to release a new job package with upgraded dependencies or improved business logic. What happens then?

Data Process

Data Process Process IT Data

Data enrichment strategies in Apache Flink

Waitingforcode

APRIL 11, 2024

Data enrichment is a crucial step in making data more usable by the business users. Doing that with a batch is relatively easy due to the static nature of the dataset. When it comes to streaming, the task is more challenging.

Datasets

Datasets Data IT

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Rolling history logs in Spark History UI

Waitingforcode

APRIL 5, 2024

Stream processing is great but it brings some gotchas that are not obvious. Logs are one of them.

Process

Process IT

Schema tracking in Delta Lake

Waitingforcode

MARCH 27, 2024

Streaming Delta tables is slightly different from streaming native streaming sources, such as Apache Kafka topics. One of the significant differences is schema enforcement. It leads to the job failure in case of schema changes of the streamed table.

Kafka

Kafka IT

StreamingQueryListener, from states to questions

Waitingforcode

MARCH 22, 2024

Apache Spark leverages the observer design pattern for the framework-to-code communication. One of the consumers' implementations is StreamingQueryListener.

Coding

Coding Designing

Processing time trigger, to be or not to be?

Waitingforcode

MARCH 12, 2024

That's the question. The lack of the processing time trigger means more a reactive micro-batch triggering but it cannot be considered as the single true best practice. Let's see why.

Process

Process IT

Apache Flink and the input data reading

Waitingforcode

MARCH 5, 2024

I'm writing this unexpected blog post because I got stuck with watermarks and checkpoints and felt that I was missing some basics. Even though this introduction is a bit negative, the exploration for the data reading enabled my other discoveries.

Data

Anatomy of a Structured Streaming job

Waitingforcode

FEBRUARY 27, 2024

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

Min rate limits for Apache Kafka

Waitingforcode

FEBRUARY 20, 2024

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Kafka

Kafka IT Data

What's new on the cloud for data engineers - part 12 (10.2023-02.2024)

Waitingforcode

FEBRUARY 13, 2024

It's time for another part of "What's new on the cloud for data engineers" Let's see what happened in the last 5 months.

Data Engineering

Data Engineering Data Engineer Cloud Engineering

Table file formats - streaming writer: Delta Lake

Waitingforcode

FEBRUARY 6, 2024

The previous blog from the series we discovered streaming reader. However, an end-to-end streaming Delta Lake pipeline also requires a writer which will be our focus today.

Apache Flink and cluster components deep dive

Waitingforcode

JANUARY 30, 2024

Previously you could read about transformation of a user job definition into an executable stream graph. Since this explanation was relatively high-level, I decided to deep dive into the final step executing the code.

Coding

Static enrichment dataset with Delta Lake

Waitingforcode

JANUARY 23, 2024

Data enrichment is one of common data engineering tasks. It's relatively easy to implement with static datasets because of the data availability. However, this apparently easy task can become a nightmare if used with inappropriate technologies.

Datasets

Datasets Data Engineering Data Engineer Technology

Table file formats - streaming reader: Delta Lake

Waitingforcode

JANUARY 17, 2024

Even though I'm into streaming these days, I haven't really covered streaming in Delta Lake yet. I only slightly blogged about Change Data Feed but completely missed the fundamentals. Hopefully, this and next blog posts will change this!

Data

Files streaming is quite a challenge

Waitingforcode

JANUARY 9, 2024

It's technically possible to process files in a continuous way from a streaming job. However, if you are expecting some latency sensitive job, this will always be slower than processing data directly from a streaming broker. Why?

Process

Process IT Data

Stream processing models

Waitingforcode

JANUARY 2, 2024

If you're interested in stream processing, I bet your thinking is technology-based. It's not wrong, after all, the ability to use a tool gives you and me a job. However, for a long-term consideration it's better to reason in terms of patterns or models. Being aware of a more general vision helps assimilate new tools.

Process

Process Technology IT

2023 retrospective on waitingforcode.com

Waitingforcode

JANUARY 2, 2024

This is one of my favorite blog posts, the yearly retrospective. Every year I summarize what happened in the past 12 months and share with you my future plans. It's time for the 2023 Edition!

IT

Streamhouse, the next house to move into?

Waitingforcode

DECEMBER 26, 2023

I must admit it, if you want to catch my attention, you can use some keywords. One of them is "stream". Knowing that, the topic of my new blog post shouldn't surprise you.

IT

Order is king for the performance

Waitingforcode

DECEMBER 19, 2023

Even though nowadays data processing frameworks and data stores have smart query planners, they don't take our responsibility to correctly design the job logic.

Designing

Designing Data Process Process Data

Data+AI Summit 2023, retrospective part 2

Waitingforcode

DECEMBER 12, 2023

One week later than initially announced, but here it is, the second part for Data+AI Summit 2023 retrospective. I don't know how, but I managed to include some streaming-related talks here too!

Data

Data Management IT

Vertical autoscaling for data processing on the cloud

Waitingforcode

DECEMBER 5, 2023

The "vertical scaling" has caught my attention a few times already when I have been reading about cloud updates. I've always considered horizontal scaling as the single true scaling policy for elastic data processing pipelines. Have I been wrong?

Data Process

Data Process Process Cloud Data

Accumulators and reliability

Waitingforcode

NOVEMBER 28, 2023

In March I wrote a blog showing how to use accumulators to know the application of each filter statement. Turns out, the solution may not be perfect as mentioned by Aravind in one of the comments. I bet you already have an idea but if not, keep reading. Everything will be clear in the end!

Data+AI Summit 2023, retrospective part 1 - streaming

Waitingforcode

NOVEMBER 21, 2023

Even though you may be thinking now about Data+AI Summit 2024, I still owe you my retrospective for the 2023 edition. Let's start with the first part covering stream processing talks!

Data

Data Process

Apache Flink - anatomy of a job

Waitingforcode

NOVEMBER 14, 2023

Have you written your first successful Apache Flink job and are still wondering the high-level API translates into the executable details? I did and decided to answer the question in the new blog post.

Table file formats - checkpoints: Delta Lake

Waitingforcode

NOVEMBER 7, 2023

Checkpoints are a well-known fault-tolerance mechanism in stream processing. But what does it have to do with Delta Lake?

Process

Process IT

What's new in Apache Spark 3.5.0 - watermark propagation

Waitingforcode

NOVEMBER 1, 2023

Watermark, or rather multiple watermarks management, has been a thorn in the side of Apache Spark Structured Streaming. It has improved in the previous release (3.4.0) but still had some room for improvement. Well, it did have because the 3.5.0 release brought a serious fix for the multiple watermarks scenario.

Management

Management IT

What's new in Apache Spark 3.5.0 - Structured Streaming

Waitingforcode

OCTOBER 24, 2023

It's time to start the series covering Apache Spark 3.5.0 features. As the first topic I'm going to cover Structured Streaming which has got a lot of RocksDB improvements and some major API changes.

IT

Watermark and input data filtering in Apache Spark Structured Streaming

Waitingforcode

OCTOBER 17, 2023

I've already written about watermarks in a few places in the blog but despite that, I still find things to refresh. One of them is the watermark used to filter out the late data, which will be the topic of this blog post.

Data

Table file formats - vacuum: Delta Lake

Waitingforcode

OCTOBER 10, 2023

If you have some experience with RDBMS, who doesn't btw, you have probably run a VACUUM command to reclaim the storage space occupied by deleted or obsolete rows. If you're now working with Delta Lake, you can do the same!

Making applyInPandasWithState less painful

Waitingforcode

OCTOBER 4, 2023

Do not get the title wrong! Having applyInPandasWithState in the PySpark API is huge! However, due to Python duck typing, some operations are more difficult and more risky to express in the code than in the strongly typed Scala API.

Scala

Scala Python Coding

Arbitrary stateful processing in PySpark with applyInPandasWithState

Waitingforcode

SEPTEMBER 27, 2023

It's always a huge pleasure to see the PySpark API covering more and more Scala API features. Starting from Apache Spark 3.4.0 you can even write arbitrary stateful processing jobs! But since the API is a little bit different than the one available on the Scala side, I wanted to take a deeper look.

Process

Process Scala IT

What's new on the cloud for data engineers - part 11 (06-09.2023)

Waitingforcode

SEPTEMBER 20, 2023

It's time for another part of "What's new on the cloud for data engineers" Let's see what happened in the last 4 months.

Data Engineering

Data Engineering Data Engineer Cloud Engineering

Apache Flink best practices - Flink Forward lessons learned

Waitingforcode

SEPTEMBER 14, 2023

I won't hide it, I'm still a fresher in the Apache Flink world and despite my past streaming experiences with Apache Spark Structured Streaming and GCP Dataflow, I need to learn. And to learn a new tool or concept, there is nothing better than watching some conference talks!

IT

ETL vs. ELT?

Waitingforcode

SEPTEMBER 6, 2023

In our social media and marketing-driven era, it's quite hard to get things right. For me there is one common misconception brought by the Modern Data Stack idea that everything should be now ELT. In fact no, it shouldn't but only can.

Media

Media IT Data

Table file formats - isolation levels: Delta Lake

Waitingforcode

AUGUST 29, 2023

If Delta Lake implemented the commits only, I could stop exploring this transactional part after the previous article. But as for RDBMS, Delta Lake implements other ACID-related concepts. One of these are isolation levels.

Table file formats - commits: Delta Lake

Waitingforcode

AUGUST 22, 2023

One of the great features of modern table file formats is the ability to handle write conflicts. It wouldn't be possible without commits that are the topic of this new blog post.

IT

Don't sleep when you code.about sleep issue in KPL

Waitingforcode

AUGUST 17, 2023

Lessons learned why it's always worth checking the code implementation to avoid surprises later. Even for vendor-supported solutions.

Coding

Coding IT

_spark_metadata in Apache Spark Structured Streaming issue is no more!

Waitingforcode

AUGUST 10, 2023

There are probably not that many people working today on the flat files with Structured Streaming than 5 years ago thanks to the table file formats. However, if you are in this group and are still generating CSVs or JSONs with the streaming sink, brace yourself, the memory problems are coming if you don't take action!

The first state in Apache Spark Structured Streaming arbitrary stateful processing

Waitingforcode

AUGUST 2, 2023

When you wrote your first arbitrary stateful processing pipelines, the state expiration is maybe the first tricky point you had to deal with. Why is that? After all, it's just about setting the timeout, doesn't it? Most of the time, yes, but there is an exception.

Process

Process IT

State expiration in stream-to-stream joins with event time range condition

Waitingforcode

JULY 25, 2023

You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.

IT

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Waitingforcode

JULY 21, 2023

Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!

IT

Berlin Buzzwords 2023 - notes for data engineers

Waitingforcode

JULY 13, 2023

That's the conference I've heard only recently about. What a huge mistake! Despite the lack of "data" word in the name, it covers many interesting data topics and before I share with you my notes from this year's Data+AI Summit, let me do the same for Berlin Buzzwords!

Data Engineering

Data Engineering Data Engineer Engineering Data

Multiple queries running in Apache Spark Structured Streaming

Waitingforcode

JULY 6, 2023

That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.

Data

Waitingforcode

Event time skew in stream processing

Stopping a Structured Streaming query

Webinars

Trending Sources

Data enrichment strategies in Apache Flink

Webinars

Rolling history logs in Spark History UI

Schema tracking in Delta Lake

StreamingQueryListener, from states to questions

Processing time trigger, to be or not to be?

Apache Flink and the input data reading

Anatomy of a Structured Streaming job

Min rate limits for Apache Kafka

What's new on the cloud for data engineers - part 12 (10.2023-02.2024)

Table file formats - streaming writer: Delta Lake

Apache Flink and cluster components deep dive

Static enrichment dataset with Delta Lake

Table file formats - streaming reader: Delta Lake

Files streaming is quite a challenge

Stream processing models

2023 retrospective on waitingforcode.com

Streamhouse, the next house to move into?

Order is king for the performance

Data+AI Summit 2023, retrospective part 2

Vertical autoscaling for data processing on the cloud

Accumulators and reliability

Data+AI Summit 2023, retrospective part 1 - streaming

Apache Flink - anatomy of a job

Table file formats - checkpoints: Delta Lake

What's new in Apache Spark 3.5.0 - watermark propagation

What's new in Apache Spark 3.5.0 - Structured Streaming

Watermark and input data filtering in Apache Spark Structured Streaming

Table file formats - vacuum: Delta Lake

Making applyInPandasWithState less painful

Arbitrary stateful processing in PySpark with applyInPandasWithState

What's new on the cloud for data engineers - part 11 (06-09.2023)

Apache Flink best practices - Flink Forward lessons learned

ETL vs. ELT?

Table file formats - isolation levels: Delta Lake

Table file formats - commits: Delta Lake

Don't sleep when you code.about sleep issue in KPL

_spark_metadata in Apache Spark Structured Streaming issue is no more!

The first state in Apache Spark Structured Streaming arbitrary stateful processing

State expiration in stream-to-stream joins with event time range condition

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Berlin Buzzwords 2023 - notes for data engineers

Multiple queries running in Apache Spark Structured Streaming

Stay Connected