A Gentle Introduction to Analytical Stream Processing

Building a Mental Model for Engineers and Anyone in Between

Scott Haines
Towards Data Science
17 min readMar 31, 2023

--

The image captures streams naturally coalescing into a waterfall. It is gentle before things begin to move faster, this is similar to data networks and streaming environments.
Stream Processing can be handled gently and with care, or wildly, and almost out of control! You be the judge of what future you’d rather embrace. credit: @psalms original_photo

Introduction

In many cases, processing data in-stream, or as it becomes available, can help reduce an enormous data problem (due to the volume and scale of the flow of data) into a more manageable one. By processing a smaller set of data, more often, you effectively divide and conquer a data problem that may otherwise be cost and time prohibitive.

How you transition from a batch mindset to a streaming mindset although can also be tricky, so let’s start small and build.

From Enormous Data back to Big Data

Say you are tasked with building an analytics application that must process around 1 billion events (1,000,000,000) a day. While this might feel far-fetched at first, due to the sheer size of the data, it often helps to step back and think about the intention of the application (what does it do?) and what you are processing (what does the data look like)? Asking yourself if the event data can be broken down (divided and partitioned) and processed in parallel as a streaming operation (aka in-stream), or must you process things in series, across multiple steps? In either case, if you modify the perspective of the application to look at bounded windows of time, then you now only need to create an application that can ingest, and processing, a mere 11.5 thousand (k) events a second (or around 695k events a minute if the event stream is constant), which is an easier number to rationalize.

While these numbers may still seem out of reach, this is where distributed stream processing can really shine. Essentially, you are reducing the perspective, or scope, of the problem, to accomplish a goal over time, across a partitioned data set. While not all problems can be handled in-stream, a surprising number of problems do lend themselves to this processing pattern.

Note: This chapter is part of my book “Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications”. The book takes you on the journey building from simple scripting, to composing applications, and finally deploying and monitoring your mission critical Apache Spark applications.

What you will learn this Chapter

This chapter will act as a gentle introduction to stream processing making room for us to jump directly into building our own end to end Structured Streaming application in chapter 10 without the need to backtrack and discuss a lot of the theory behind the decision-making process.

By the end of the chapter, you should understand the following (at a high level):

  1. How to Reduce Streaming Data Problems in Data Problems over Time
  2. The Trouble with Time, Timestamps, and Event Perspective
  3. The Different Processing Modes for Shifting from a Batch to Streaming Mental Model

Stream Processing

Streaming data is not stationary. In fact, you can think of it as being alive (if even for a short while). This is because streaming data is data that encapsulates the now, it records events and actions as they occur in flight. Let’s look at a practical, albeit theoretical, example that begins with a simple event stream of sensor data. Fix into your mind’s eye the last parking lot (or parking garage) you visited.

Use Case: Real-Time Parking Availability

Parking is a Nightmare: The problem with most parking infrastructure, or a common pain point for the customer, is more often than not finding an available spot while still being able to get places on time. Photo via Unspash and @ryansearle

Imagine you just found a parking spot all thanks to some helpful signs that pointed you to an open space. Now let’s say that this was all because of the data being emitted from a connected network of local parking sensors. Sensors which operate with the sole purpose of being used to identify the number of available parking spaces available at that precise moment in time.

This is a real-time data problem where the real-time accuracy is both measurable, as well as physically noticeable, by a user of the parking structure. Enabling these capabilities all began with the declaration of the system scenario.

Product Pitch: “We’d like to create a system that keeps track of the status of all available parking spaces, that identifies when a car parks, how long the car remains in a given spot, and lastly this process should be automated as much as possible”

Optimizing a system like this can begin with a simple sensor located in each parking spot (associated with a sensor.id / spot.id reference). Each sensor would be responsible for emitting data in the form of an event with a spot identifier, timestamp, and simple bit (0 or 1), to denote if a spot is empty or occupied. This data can then be encoded into a compact message format, like the example from Listing 9–1, and be efficiently sent periodically from each parking spot.

Listing 9–1. An example sensor event (encapsulated in the Google Protocol Buffer message format) is shown for clarity.

message ParkingSensorStatus {
uint32 sensor_id = 1;
uint32 space_id = 2;
uint64 timestamp = 3;
bool available = 4;
}

During the normal flow of traffic throughout the day, the state (availability of a parking spot) via the sensors would flip on or off (binary states) as cars arrive or leave each spot. This behavior is unpredictable due to the dynamic schedules of each individual drivers, but patterns always emerge at scale.

Using the real-time state provided by the collected sensor data , it is easily feasible to build real-time, real-life (IRL) “reporting” to update drivers on the active state of the parking structure: is the parking infrastructure full, or not, and if it isn’t full, that there are now X total number of available spots in the garage.

What the Sensor Data Achieves

This data can help to automate the human decision-making process for drivers and could even be made available online, through a simple web service, for real-time status tracking, since ultimately drivers just want to park already and not waste time! Additionally, this data can also be used to track when each sensor last checked in (refreshed) which can be used to diagnosis faulty sensors, and even track how often sensors go offline or fail.

Nowadays, more technologically advanced garages even go so far as to direct the driver (via directional signs and cues) to the available spots within the structure. This acts to both reduce inter-garage traffic and congestion, which in turn raises customer satisfaction, all by simply capturing a live stream of sensor data and processing it in near-real-time.

Surge Pricing and Data Driven Decision Making

Given the temporal (timestamp) information gathered from these streams of sensor events, a savvy garage operation could use prior trends to even decrease or increase the daily or hourly prices, based on the demand for parking spots, with respect to current availability (number of spots left) in real-time. By optimizing the pricing (within realistic limits) an operator could find the perfect threshold where the price per hour / price per day leads to a full garage more times than it doesn’t. In other words, “at what price will most people park and spots don’t go unused?”.

This is an example of an optimization problem that stems from the collection of real-time sensor data. It is becoming more common for organizations to look at how they reuse data to solve multiple problems at the same time. The Internet of Things (IOT) use cases are just one of the numerous possible streams of data you could be working with when writing streaming applications.

Earlier in the book we discussed “creating a system that could take information about Coffee store occupancy, which would inform folk what shop nearest to them has seating for a party of their size” at that point in the story we simply created a synthetic table that could be joined to showcase this example, but this is another problem that can be solved with sensors, or something as simple as a check-in system, that emits relevant event data to be passed reliably downstream via our friend the streaming data pipeline.

Both examples discussed here (parking infrastructure and coffee empire expansion) employ basic analytics (statistics) and can benefit from simple machine learning to uncovering new patterns of behavior that lead to more optimal operations. Before we get too far ahead of ourselves, let’s take a short break to dive deeper into the capabilities streaming data networks provide.

Time Series Data and Event Streams

Moving from a stationary data mindset, about a fixed view or moment in time, to one that interprets data as it flows over time, in terms of streams of unbounded data across many views and moments in time, is an exercise in perspective but also one that can be challenging to adopt at first. Often when you think about streaming systems, the notion of streams of continuous events bubble to the surface. This is one of the more common use cases and can be used as more of a gentle introduction to the concept of streaming data. Take for example the abstract time series shown in Figure 9–1.

Time is unbounded. How we perceive time is bounded to a scope. This is represented as a contiguous line with views (windows) over time represented by w sub 1, and broken into finite moments represented by T sub 1 to T sub 4
Figure 9–1: Events occur at precise moments of time and can be collected and processed individually (t1->t4), or can be aggregated across windows of time (w1). Image Credit: Author (Scott Haines)

As you can see, data itself exists across various states depending on the perspective or vantage point applied by a given system (or application). Each event (T1->T4) individually understand only what has occurred within their narrow pane of reference, or to put that differently, events capture a limited (relative) perspective of time. When a series of events are processed together in a bounded collection (window), then you have a series of data points (events) that encapsulate either fully realized ideas, or partially realized ideas. When you zoom out and look at the entire timeline then you can paint a more accurate story of what happened from first event to last.

Let’s take this idea one step further.

Do Events Stand Alone?

Consider this simple truth. Your event data exists as a complete idea, or as partial ideas or thoughts. I have found that thinking of data as a story over time helps to give life to these bytes of data. Each data point is therefore responsible for helping to compose a complete story, as a series of interwoven ideas and thoughts that assemble or materialize over time.

Data composition is a useful lens through which to look as you work on adopting a distributed data view of things. I also find it lends itself well while building up and defining new distributed data models, as well as, while working on real world data networks (fabrics) at scale. Viewed as a composition, these events come together to tell a specific story, whose event-based breadcrumbs can inform of the order in which something came to be and is greatly enhanced with the timestamp of each occurrence. Events without time paint a flat view of how something occurred while the addition of time grants you the notion of momentum or speed, or a slowing down and stretching of the time between events or for a full series of data points. Understanding the behavior of the data flowing through the many pipelines and data channels is essential to data operations and requires reliable monitoring to keep data flowing at optimal speeds.

Let’s look at a use case where the dimension of time helps paint a better story of a real-world scenario.

Use Case: Tracking Customer Satisfaction

A welcoming, simple and clean, independent coffee shop. A barista is seen making a drink behind the bar.
A quiet coffee shop pouring love with every cup. Photo by Nafinia Putra on Unsplash

Put yourself in the shoes of a data engineer working with the data applications feature teams in a fake coffee empire named “CoffeeCo”, the conversation is about what data paints a good story of customer satisfaction over time (time series analysis).

What if I told you two customers came into our coffee shop, ordered drinks and left the store with their drinks. You might ask me why I bothered to tell you that since that is what happens in coffee shops. What if I told you that the two coffee orders were made around the same and that the first customer in the story was in and out of the coffee shop in under five minutes. What if I told you, it was a weekday, and this story took place during morning rush hour? What if I told you that the second customer, who happened to be next in line (or right after the first customer) and was in the coffee shop for thirty minutes? You might ask if the customer stayed to read the paper or maybe use the facilities. Both are valid questions.

If I told you that the second customer was waiting around because of an error that occurred between step 3 and 4 of a four-step coffee pipeline, then we’d have a better understanding of how to streamline the customer experience in the future.

The four steps are:

1. Customer Orders: {customer.order:initialized}
2. Payment Made {customer.order:payment:processed}
3. Order Queued: {customer.order:queued}
4. Order Fulfilled: {customer.order:fulfilled}

Whether the error was in the automation, or because of a breakdown in the real-world system (printer jam, barista missed an order, or any other reason), the result here is that the customer needed to step in (human in the loop) and inform the operation (coffee pipeline) that “it appears that someone forgot to make my drink”.

At this point the discussion could turn towards how to handle the customers emotional response, which could swing widely across both positive and negative reactions: from happy to help (1), to mild frustration (4), all the way to outright anger (10) at the delay and breakdown of the coffee pipeline. But by walking through a hypothetical use case, we are all now more familiar with how the art of capturing good data can be leveraged for all kinds of things.

The Event Time, Order of Events Captured, and the Delay Between Events All Tell a Story

Without the knowledge of how much time elapsed from the first event (customer.order:initialized) until the terminal event (customer.order:fulfilled), or how long each step typically takes to accomplish, we’d have no way to score the experience or really understand what happened, essentially creating a blind spot to abnormal delays or faults in the system. It pays to know the statistics (average, median, and 99th percentiles) of the time a customer typically waits for a variable sized order, as these historic data points can be used via automation to step in to fix a problem preemptively when, for example, an order is taking longer than expected. It can literally mean the difference between an annoyed customer, and a lifetime customer.

This is one of the big reasons why companies solicit feedback from their customers — be it a thumbs up / thumbs down on an experience, rewarding application-based participation (spend your points on free goods and services), and to track real-time feedback like in the case of “your order is taking longer than expected, here is $2 off your next coffee. Just use the app to redeem”. This data, collected and captured through real-world interactions, encoded as events, and processed for your benefit, are worth it in the end if it positively affects the operations and reputation of the company. Just be sure to follow data privacy rules and regulations and ultimately don’t creep out your customers.

This little thought experiment was intended to shed light on the fact that the details captured within your event data (as well as the lineage of the data story over time) can be a game changer and furthermore that time is the dimension that gives these journeys momentum or speed. There is just one problem with time.

The Trouble with Time

While events occur at precise moments in time the trouble with time is that it is also subject to the problems of time and space (location). Einstein used his theory of relativity to explain this problem on a cosmic scale, but this is also a problem on a more localized scale as well. For example, I have family living in different parts of the United States. It can be difficult to coordinate time where everyone’s schedule syncs up. This happens for simple events like catching up with everyone over video (remotely) or meeting up in the real-world for reunions (locally). Even when everything is all coordinated, people have a habit of just running a little bit late.

Zooming out from the perspective of my family, or people in general, with respect to central coordination of events, you will start to see that the problem isn’t just an issue relating to synchronization across time zones (east / central or west coast), but if you look closer you can see that time, relative to our local / physical space, is subject to some amount of temporal drift or clock skew.

Take the modern digital clock. It runs as a process on your smart phone, watch or any number of many “smart” connected devices. What remains constant is that time stays noticeably in sync (even if the drift is on the order of milliseconds). Many people still have analog, non-digital, clocks. These devices run the full spectrum from incredibly accurate, in the case of high-end watches (“timepieces”) to cheap clocks that sometimes need to be reset every few days.

The bottom line here is that it is rare that two systems agree on the precise time in the same way that two or more people share similar trouble coordinating within both time and space. Therefore, a central reference (point of view) must be used to synchronize the time with respect to systems running across many time zones.

Correcting Time

Servers running in any modern cloud infrastructures utilize a process called Network Time Protocol (NTP) to correct the problem of time drift. The ntp process is charged with synchronizing the local server clock using a reliable central time server. This process corrects the local time to within a few milliseconds of the Universal Coordinated Time (UTC). This is an important concept to keep in mind since an application running within a large network, producing event data, will be responsible for creating timestamps, and these timestamps need to be precise in order for distributed events to line up. There is also the sneaky problem of daylight savings (gain or lose an hour ever 6 months) so coordinating data from systems across time zones as well as across local datetime semantics (globally) requires time to be viewed from this central, synchronized, perspective.

We’ve looked at time as it theoretically relates to event-based data but to round out the background we should also look at time as it relates to the priority in which data needs to be captured and processed within a system (streaming or otherwise).

Priority Ordered Event Processing Patterns

You may be familiar with this quote. Time is of the essence. This is a way of saying something is important and a top priority. The speed to resolution matters. This sense of priority can be used as an instrument, or defining metric, to make the case for real-time, near-real-time, batch or eventual (on-demand) processing when process critical data. These four processing patterns handles time in a different way by creating a narrow, or wide focus on the data problem at hand. The scope here is based on the speed in which a process must complete which in turn limits the complexity of the job as a factor of time. Think of these styles of processing as being deadline driven, there is only a certain amount of time in which to complete an action.

Real-Time Processing

The expectations of real-time systems are that end-to-end latency from the time an upstream system emits an event, until the time that event is processed and available to be used for analytics and insights, occurs in the milliseconds to low seconds. These events are emitted (written) directly to an event stream processing service, like Apache Kafka, which under normal circumstances enables listeners (consumers) to immediately use that event once it is written. There are many typical use cases for true real-time systems, including logistics (like the parking space example as well as finding a table at a coffee shop), and then processes that impact a business on a whole new level like fraud detection, active network intrusion detection or other bad actor detection where the longer the mean time to detection (average milliseconds / seconds to detection) can lead to devastating consequences both in terms of reputation, financially or both.

For other systems, it is more than acceptable to run in near real-time. Given that answering tough problems requires time, real-time decision making requires a performant, pre-computed or low-latency answer to the questions it will ask. This really is pure in-memory stream processing.

Near Real-Time Processing

Near real-time is what most people think of when they consider real-time. A similar pattern occurs here as you just read about under Real-Time, the only difference is that the expectations of end-to-end latency are relaxed to a high number of seconds to a handful of minutes. For most systems, there is no real reason to react immediately to every event as it arrives, so while time is still of the essence, the priority of the SLA for data availability is extended.

Operational dashboards and metric systems that are kept up to date (refreshing graphs and checking monitors every 30s — 5 minutes) are usually fast enough to catch problems and give a close representation of the world. For all other data systems, you have the notion of batch or on-demand.

Batch Processing

We covered batch processing and reoccurring scheduling in the last two chapters but for clarity having periodic jobs that push data from a reliable source of truth (data lake or database) into other connected systems has been, and continues, to be how much of the worlds data is processed.

The simple reason for this is cost. Which factors down to both the cost of operations and the human cost for maintaining large streaming systems.

Streaming systems demand full time access to a variable number of resources from CPUs and GPUs to Network IO and RAM, with an expectation that these resources won’t be scarce since delays (blockage) in stream processing can pile up quick. Batch on the other hand can be easier to maintain in the long run assuming the consumers of the data understand that there will always be a gap from the time data is first emitted upstream, until the data becomes available for use downstream.

The last consideration to keep in mind is on-demand processing (or just-in-time processing).

On-Demand or Just-In-Time Processing

Let’s face it. Some questions (aka queries) are asked so rarely, or in a way that is just not suitable to any predefined pattern.

For example, custom reporting jobs and exploratory data analysis are two styles of data access that lend themselves nicely to these paradigms. Most of the time, the backing data to answer these queries is loaded directly from the data lake, and then processed using shared compute resources, or isolated compute clusters. The data that is made available for these queries may be the by-product of other real-time or near-real-time systems, that were processed and stored for batch or historic analysis.

Using this pattern data, can be defrosted, and loaded on-demand by importing records from slower commodity object storage like Amazon S3 into memory, or across fast-access solid state drives (SSDs), or depending on the size, format, and layout of the data, can be queried directly from the cloud object store. This pattern can be easily delegated to Apache Spark using SparkSQL. This enables ad-hoc analysis via tools like Apache Zeppelin, or directly in-app through JDBC bindings using the Apache Spark thrift-server and the Apache Hive Metastore.

The differentiator between these four flavors of processing is time.

Circling back to the notion of views and perspective, each approach or pattern, has its time and place. Stream processing deals with events captured at specific moments in time and as we’ve discussed during the first half of this chapter, how we associate time and how we capture and measure a series of events (as data) all come together to paint a picture of what is happening now, or what has happened in the past. As we move through this gentle introduction to stream processing it is important to also talk about the foundations of stream processing. In this next section, we’ll walk through some of the common problems and solutions for dealing with continuous, unbounded streams of data. It would only make sense to therefore discuss data as a central pillar and expand outward from there.

--

--

Distinguished Software Engineer @ Nike. I write about all things data, my views are my own.