Gotchas of Streaming Pipelines: Profiling & Performance Improvements

Rakesh Kumar

Published in

Lyft Engineering

9 min readJun 6, 2023

Discover how Lyft identified and fixed performance issues in our streaming pipelines.

Streaming Pipeline Profiling & Performance Improvements

Background

Every streaming pipeline is unique. When reviewing a pipeline’s performance, we ask the following questions: “Is there a bottleneck?”, “Is the pipeline performing optimally?”, “Will it continue to scale with increased load?”
Regularly asking these questions are vital to avoid scrambling to fix performance issues at the last minute. By doing so, one can tune pipelines to perform optimally, consistently satisfy SLAs, and reduce resource waste.

This article will cover the following topics:

Performance improvement process
Strategies to profile streaming pipelines
Common performance problems
General guidelines to improve performance

Performance Improvement Process

The performance improvement of any software system is not an independent and isolated task but an iterative process. It entails the following steps:

Measure / profile performance
Identify root cause
Fix
Go to step one

The process is repeated until you get the desired performance (at the targeted scale) or you exhaust all performance measures.

Profiling Your Pipeline

Identifying a performance issue without any profiling tool is a shot in the dark. Profiling is the first step of the process, and requires the right tools. Tools help you identify the most impactful performance issues faster. They can also be integrated with development environments to provide a comprehensive report early in the development lifecycle.

Memory & CPU Profiler

It is critical to have memory and CPU profiling integrated with the execution environment. Even the most trivial code can hide severe performance bottlenecks. Eyeballing the code will not help to identify the issue, thus it is important to have a profiler that can give you a more accurate picture and help find the underlying cause swiftly.

Since most of our pipelines are based on Apache Beam and with pipeline code in Python, we use Pyflame and async profiler to generate a heat map of CPU utilization. This helps understand the performance characteristics and identify offending code. At the time of writing, Pyflame has been deprecated and is no longer supported, though similar tools exist such as flameprof.

If your pipeline is JVM based, you can use various JVM profilers to identify bottlenecks.

Flink Dashboard

Lately, the Flink console has added visual tools that show various performance data points in order to identify performance bottlenecks. For example, you can see CPU utilization for the individual operator that can potentially cause back-pressure. This data is represented at the operator level and can provide you the information needed to limit your search area.

As you can see in Figure 2, one operator is 100% busy. This can potentially result in back-pressure in the pipeline. It also provides valuable insights into which pipeline operator require closer scrutiny to identify any bottlenecks.
The dashboard provides operator-level record throughput which helps to identify any data skewness (hot shard) issues in the pipeline.
The Flink dashboard also has an inbuilt flamegraph (Figure 3) which is useful for identifying bottlenecks in JVM based pipelines. This flamegraph can be enabled by setting a flag in the config:

rest.flamegraph.enabled: true

Flink’s Metrics System

Flink also generates various system performance metrics at the task and operator level:

Throughput
Latency (watermark)
JVM metrics (Heap, direct memory used, GC etc.)

These metrics can be emitted to a separate metrics monitoring system in order to create a dashboard (Figure 3) to continually monitor performance.

Apart from the above mentioned metrics, alerts are set on the following metrics:

Checkpoint size
Checkpoint failure
Pipeline restart
Pipeline output per one minute

These metrics are important from a stability point of view and can surface pipeline issues immediately.
The application owner can also define business related metrics and SLAs that are important to the overall health of the application. In some cases, system metrics may be within range, but business metrics tell a different story. For example, end-to-end latency can increase significantly. In our case, this was caused by two factors: 1) data skewness and 2) upstream event generation issues.

Common Performance Problems

After extensive experience in pipeline operations, scaling them to process hundreds of thousands of events per second, we have realized that there are four categories of performance issues which are common across pipelines. These issues are listed down in their severity order:

Data skewness (hot shard)
Large window size
Interaction with low speed services
Serialization & de-serialization

Data Skewness (hot shard)

Data skewness or hot shard is the most notorious one among all performance issues. In 80% of the cases, we have identified it as a major culprit of performance issues. This is primarily due to higher concentration of data on individual keys. Some of its side effects include an increase in end-to-end latency and under-utilization of resources.
The knee-jerk reaction for such issue is to provide more computing power to the pipeline, but it doesn’t solve the problem and only leads to further under-utilization of resources. This can be identified with help of the Flink dashboard, to carefully identify operators with a nonuniform distribution of input records. For more details, check out our earlier blog post on data skewness.

Large Window Size

Large window size causes huge inflight data volume at the time of window materialization and increases the checkpointing size. Some of the side effects include increased end-to-end latency and a nonuniform (e.g. sawtooth) resource utilization pattern. This gets worse, especially for sliding windows because Flink stores a copy of the data for each window which increases the overall data size, and increases memory and CPU utilization.

This can be identified by looking at the system dashboard and closely identifying the affected operator with a large window size.

Interaction With Higher Latency Services

Some of the use cases require a pipeline to interact with external services, either to store output or hydrate the incoming record with additional data.

External service interaction comes with a cost of uncontrolled latency or indeterministic state, when the external service cannot keep up. The effects of this issue are back-pressure in upstream operators and higher end-to-end latency. Ideally such operations should be avoided in pipelines. In the case that they cannot be avoided, the effect can be minimized by having the right parallelism set for operators and performing operations in batches.

Serialization & De-serialization

In most of the cases, serialization and de-serialization performance issues exist in plain sight. The issue can be identified by looking at the flame graph. In some extreme cases, we have noticed ~20% of the resource utilization can be attributed to serialization costs, which is compounded when huge numbers of events are being ingested. Ideally, such operations should be minimized and pushed to the pipeline boundary.

General Guideline

Every pipeline is different. There are no one-size-fits-all performance improvements, but there are general guidelines available that may bring to light some quick wins.

Avoid Duplicate Operations

More often than not, duplicate operations sneak into pipeline code. Regularly check for duplicate operations. For example, a pipeline may have different parallel subsections which could be deserializing the same data again and again. This operation can be pushed to the upstream operator so it can be performed once.

Avoid Unnecessary Shuffling

Shuffling of data is the most common cause of performance degradation. Having a good understanding of operators and how they cause shuffling helps you to design a better pipeline. The general guideline is to minimize such operators’ usage. This can significantly increase performance and reduce end-to-end latency.

It is possible to identify shuffling using the Flink dashboard. Whenever an edge between two operators is denoted by Hash or Rebalance (Figure 5), that means the data is being reshuffled by the upstream operator.

Generally, Flink provides the names of the methods in which shuffling occurs. If the reshuffling is unnecessary, then it should be removed from the pipeline.
One of the high-level goals of the pipeline design is to minimize the number of shuffles in the pipeline.

Enable Cython

This is specifically applicable to pipelines written in Python (Apache Beam). In our case, Cython netted a perforamance gained of 5%. Enabling Cython in Beam is easy, install the Cython in the build & execution environment (ensuring the Cython version is ≥ 0.28.1).

Furthermore, you can use Pyflame profiling to identify slow functions and cythonize them.

Drop Unnecessary Data Early

Generally, streaming pipelines are meant to process huge amounts of data, but in any case processing unnecessary data will affect performance across CPU, memory, and network utilization. Always follow the principle of least data processing. We aggressively drop unnecessary data or attributes at the very beginning of the pipeline. This has increased our performance on average by 7%.

Use Protobuf

As you may be aware, there is an increase in latency when operators send data across the network. We tried to reduce the data size by converting the data in to protobuf object. Protobuf is a binary format with a well-known schema, resulting in much smaller data over the wire. We use one protobuf message to pass the data within the operators in a pipeline.

Figure 6: Final size comparison (Credit to OAuth)

Avoid Data Skewness

As mentioned in the previous post (Gotchas of Streaming pipeline: Data Skewness), avoid data skewness as much as possible. This is the most common culprit that hinders optimal performance and reduces the optimal usage of resources. Check the previous post to get ideas about identifying data skewness and fixing it.
The gist of the the strategy is to identify a key which distributes the data. as near to uniform as possible. For example, sharding based on a region or city could lead to data skewness (some cities are larger than others), whereas sharding based on the user ID gives a higher chance of uniformly distributing the data and avoiding data skewness.

Checkpointing

Checkpointing can also be a source of performance degradation if the checkpoint alignment takes more than several seconds. Longer checkpoints are automatically canceled by Flink. Frequent checkpointing reduces performance and hence, it is imperative to find the optimal interval between successive checkpointing. If not, the pipeline will end up spending its most of the time in checkpointing and not doing the actual work. You can test out various configurations, measure the performance, and choose the configuration that fits most closely to your SLA requirements. You can also try unaligned checkpoints, which have not shown any negative effects in our tests.

Additionally, monitor checkpoint size (Figure 7). Avoid storing unnecessary data in state to reduce the overall checkpoint size.

Figure 7: Flink dashboard for checkpoint size

Smaller State Size

Sometimes it is necessary to use stateful processing, but it is always recommended to keep the state size small. This helps to keep the checkpoint size smaller and minimize data transfer.

In order to keep the state size small, you need to ensure that state is purged regularly if it is associated with a global window. A global window is unbounded and never materializes; thus the state is kept in memory forever. This could lead to out of memory (OOM) crashes. It is better to have some logic that purges the state data regularly.

Moreover, a longer sliding window tends to have a bigger state size because every window has a copy of data. This increases the overall data volume and hence the bigger state size.

Network

Most of the services are deployed on commodity hardware that uses multi-regions and availability zones (AZs) for better availability. However, a streaming pipeline is different. It can’t be partially down. Also, network speed is a vital part of the pipeline performance and hence a pipeline should be deployed with locality (e.g. to a single AZ) to keep all the instances/task managers close to minimize network latency.
We monitor the pipelines on 3 axes: latency, availability, cost, and optimizing each according business requirements.

Closing Notes

There are various strategies available to improve performance, though mileage may vary based on data and pipeline design. While building a pipeline, do not pre-optimize the pipeline. Rather, consider it to be an iterative process that is followed throughout the lifecycle of the pipeline, and gradually improve performance.

In case you are interested in reading previous blog posts in this series, you can find them here:

Special thanks to Jim Chiang, Ravi Kiran Magham, and Seth Saperstein for contributing to this post.