Waitingforcode

article thumbnail

Event time skew in stream processing

Waitingforcode

As a data engineer you're certainly familiar with data skew. Yes, this bad phenomena where one task takes considerably more input than the others and often causes unexpected latency or failures. Turns out, stream processing also has its skew but more related to time.

Process 130
article thumbnail

Stopping a Structured Streaming query

Waitingforcode

Streaming jobs are supposed to run continuously but it applies to the data processing logic. After all, sometimes you may need to release a new job package with upgraded dependencies or improved business logic. What happens then?

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Data enrichment strategies in Apache Flink

Waitingforcode

Data enrichment is a crucial step in making data more usable by the business users. Doing that with a batch is relatively easy due to the static nature of the dataset. When it comes to streaming, the task is more challenging.

Datasets 130
article thumbnail

Rolling history logs in Spark History UI

Waitingforcode

Stream processing is great but it brings some gotchas that are not obvious. Logs are one of them.

Process 130
article thumbnail

Schema tracking in Delta Lake

Waitingforcode

Streaming Delta tables is slightly different from streaming native streaming sources, such as Apache Kafka topics. One of the significant differences is schema enforcement. It leads to the job failure in case of schema changes of the streamed table.

Kafka 130
article thumbnail

StreamingQueryListener, from states to questions

Waitingforcode

Apache Spark leverages the observer design pattern for the framework-to-code communication. One of the consumers' implementations is StreamingQueryListener.

Coding 130
article thumbnail

Processing time trigger, to be or not to be?

Waitingforcode

That's the question. The lack of the processing time trigger means more a reactive micro-batch triggering but it cannot be considered as the single true best practice. Let's see why.

Process 130