Data Observability for Analytics and ML teams

Principles, practices, and examples for ensuring high quality data flows

Published in

Towards Data Science

8 min readApr 6, 2023

Source: DreamStudio (generated by author)

Nearly 100% of companies today rely on data to power business opportunities and 76% use data as an integral part of forming a business strategy. In today’s age of digital business, an increasing number of decisions companies make when it comes to delivering customer experience, building trust, and shaping their business strategy begins with accurate data. Poor data quality can not only make it difficult for companies to understand what customers want, but it can end up as a guessing game when it doesn’t have to be. Data quality is critical to delivering good customer experiences.

Data observability is a set of principles that can be implemented in tools to ensure data is accurate, up-to-date, and complete. If you’re looking to improve data quality at your organization, here is why data observability may be your answer and how to implement it.

How to know if you need data observability

Data observability is increasingly necessary, especially as traditional approaches to software monitoring fall short for high-volume, high-variety data. Unit tests, which assess small pieces of code for performance on discrete, deterministic tasks, get overwhelmed by the variety of acceptable shapes and values that real-world data can take. For example, a unit test can verify that a column intended to be a boolean is indeed a boolean, but what if the percentage “true” in that column shifted a lot between one day and the next? Or even just a little bit. Alternatively, end-to-end tests, which assess a full system, stretching across repos and services, get overwhelmed by the cross-team complexity of dynamic data pipelines. Unit tests and end-to-end testing are necessary but insufficient to ensure high data quality in organizations with complex data needs and complex tables.

There are three main signs your organization needs data observability — and it’s not only related to ML:

Upstream data changes regularly break downstream applications, despite upstream teams’ prophylactic efforts
Data issues are regularly discovered by customers (internal or external) rather than the team that owns the table in question
You’re moving towards a centralized data team

I’ve worked at Opendoor — an e-commerce platform for residential real estate transactions and large buyer and seller of homes — for the past four years and the data we use to assess home values is rich but often self-contradicting. We use hundreds of data feeds and maintain thousands of tables — including public data, third-party data, and proprietary data — which often disagree with one another. For instance, a home may have square footage available from a recent MLS listing and a public tax assessment that differs. Homeowners may have stated the highest possible square footage when selling the home, but stated the lowest possible area when dealing with tax authorities. Getting to the “ground truth” is not always easy, but we improve data accuracy by synthesizing across multiple sources — and that’s when data observability comes in.

Home data example, highlighting source system disagreements: Source: Opendoor, with permission

Define a healthy table

Data observability, put simply, means applying frameworks that quantify the health of dynamic tables. To check if the rows and columns of your table are what you expect them to be, consider these factors and questions:

Rows

Freshness — when was the data last updated?
Volume — how many rows were added or updated recently?
Duplicates — are any rows redundant?

Columns:

Schema — are all the columns you expect present (and some columns you don’t?)
Distributions — how have statistics that describe the data changed?

Freshness, volume, duplicate, and schema checks are all relatively easy to implement with deterministic checks (that is, if you expect the shape of your data to be stable over time).

Or you can assess these with simple time-series models that adjust deterministic check parameters over time if the shape of your data is changing in a gradual and predictable way. For example, if you’re growing customer volume by X%, you can set the row volume check to have an acceptable window that moves up over time in line with X. At Opendoor, we know that very few real estate transactions tend to occur on holidays, so we’ve been able to set rules that adjust alerting windows on those days.

Column distribution checks are where most of the complexity and focus ends up being. They tend to be the hardest to get right, but provide the highest reward when done well. Types of column distribution checks include the following:

Numerical — mean, median, Xth percentile, …
Categorical — column cardinality, most common value, 2nd most common value, …
Percent null

When your tables are healthy, analytics and product teams can be confident that downstream uses and data-driven insights are solid and that they are building on a reliable foundation. When tables are not healthy, all downstream applications require a critical eye.

Anomaly detection

Having a framework for data health is a helpful first step, but it’s critical to be able to turn that framework into code that runs reliably, generates useful alerts, and is easy to configure and maintain. Here are several things to consider as you go from data quality abstractions to launching a live anomaly detection system:

Detection logic: If it’s easy to define in advance what constitutes row- or column-level violations, a system focused on deterministic checks (where the developer manually writes those out) is probably best. If you know an anomaly when you see it (but can’t describe it in advance via deterministic rules), then a system focused on probabilistic detection is likely better. The same is true if the number of key tables requiring checks is so great that manually writing out the logic is infeasible.
Integrations: Your system should integrate with the core systems you already have, including databases, alerting (e.g., PagerDuty), and — if you have one — a data catalog (e.g., SelectStar).
Cost: If you have a small eng team but budget is no barrier, skew towards a third-party solution. If you have a small budget but a large engineering team — and highly unique needs — skew towards a first-party solution built in-house.
Data types: Anomaly detection looks different depending on if the data is structured, semi-structured, or unstructured, so it’s important to know what you’re working with.

When it comes to detecting anomalies in unstructured data (e.g., text, images, video, audio), it’s difficult to calculate meaningful column-level descriptive statistics. Unstructured data is high dimensional — for instance, a small 100x100 pixel image may have 30,000 values (10,000 pixels x three colors). Rather than checking for shifts in image types across 10,000 columns in a database, you can instead translate images into a small number of dimensions and apply column-level checks to those. This dimensionality-reduction process is called embedding the data, and it can be applied to any unstructured data format.

Here’s an example we’ve encountered at Opendoor: we receive 100,000 images on Day 1, and 20% are labeled “is_kitchen_image=True” . The next day, we receive 100,000 images and 50% are labeled “is_kitchen_image= False”. That’s possibly correct — but the size of the distributional shift should definitely lead to an anomaly alert!

If your team is focused on unstructured data, consider anomaly detection that has built-in embeddings support.

Automated data catalogs

Automating your data catalog makes data more accessible to developers, analysts, and non-technical teammates, which leads to better, data-driven decision-making. As you build out your data catalog, here are a few key questions to ask:

Table documentation

What does each row represent?
What does each column represent?
Table ownership — when there is a problem with the table, who in the organization do I call?

Table lineage (code relationships)

What tables are upstream? How are they queried or transformed?
What tables, dashboards, or reports are downstream?

Real-world use

How popular is this table?
How is this table and/or column commonly used in queries?
Who in my organization uses this table?

At Opendoor, we’ve found that table documentation is challenging to automate, and the key to success has been a clear delineation of responsibility amongst our engineering and analytics teams for filling out these definitions in a well-defined place. On the other hand, we’ve found that automatically detecting table lineage and real-world use (via parsing of SQL code, both code checked into Github and more “ad hoc” SQL powering dashboards) has given us high coverage and accuracy for these pieces of metadata, without the need for manual metadata annotations.

The result is that people know where to find data, what data to use (and not use) and they better understand what they are using.

An ML-specific strategy

ML data is different when it comes to data observability for two reasons. First, ML code paths are often ripe for subtle bugs. ML systems often have two code paths that do similar but slightly different things: model training, focused on parallel computation and tolerating high latency, and model serving, focused on low latency computation and often done sequentially. These dual code paths present opportunities for bugs to reach serving, especially if testing is focused just on the training path. This challenge can be addressed with two strategies:

Assess serving inferences using a “golden set” (or “testing in prod”). Start by assembling a set of inputs where the correct output is known in advance, or at least known within reasonably tight bounds (e.g., a set of home prices where Opendoor has high confidence in the sales prices). Next, query your production system for these inputs and compare the product system outputs with the “ground truth.”
Apply distribution checks to serving inputs. Let’s say Opendoor trains our model using data where the distribution of home square footage is 1,000 square feet in the 25th percentile, 2,000 square feet in the 50th percentile, and 3,000 square feet in the 75th percentile. We would establish bounds based on this distribution — for instance, the 25th percentile should be 1,000 square feet +/- 10% — and collect calls to the serving system and run the checks for each batch.

The other way that ML data differs in terms of data observability is that “correct” output is not always obvious. Oftentimes, users won’t know what is a bug, or they may not be incentivized to report it. To address this, analytics and ML teams can solicit user feedback, aggregate it and analyze the trends for external users and internal users/domain experts.

Whether focusing on ML data or your entire repository, data observability can make your life easier. It helps analytics and ML teams gain insight into system performance and health, improve end-to-end visibility and monitoring across disconnected tools and quickly identify issues no matter where they come from. As digital businesses continue to evolve, grow and transform, establishing this healthy foundation will make all the difference.