A Step-by-Step Guide to Building an Effective Data Quality Strategy from Scratch

How to build an interpretable data quality framework based on user expectations

David Rubio
Towards Data Science

--

Photo by Rémi Müller on Unsplash

As data engineers, we are (or should be) responsible for the quality of the data we provide. This is nothing new, but every time I join a data project I ask myself the same questions:

  • When should I start working on data quality?
  • How much should I worry about data quality?
  • What aspects of data quality should I focus on?
  • Where do I start?
  • When is my data good enough for consumption?
  • How can I highlight my data quality to stakeholders?

Perfection does not exist, and you do not want to lose momentum to show all the value your data can bring to business. You need to find a balance between quality and time spent. Answering these questions is key to finding your balance.

The goal of this article is to share a step-by-step guide to get all the answers you need for building an effective data quality strategy that fulfils the needs of business. This process involves collaboration among stakeholders, product owners, developers and sharing data quality metrics with potential users.

Additionally, I will showcase practical artefacts developed for a data product that would provide data for a marketing campaign reporting tool, demonstrating how the strategy finally translates into business value.

To end I will go through how data products within a data-mesh implementation help us to share the quality level of our data with our users even before accessing the data.

Let’s start with the first question

When should I start working on data quality?

I think we all have an inner voice with the answer to the first question: since day zero. Working and understanding data quality expectations from the beginning is the key to ensuring trust and early user adoption. This leads to receiving early feedback helping us to build improvements as we develop. And as data producers, we do not want to end up in a situation where our data’s credibility is damaged by any initial quality issue.

How much should we worry about data quality?

This question is use case specific. To answer this, your team must understand the nature of the need our data is to solve. The starting point is knowing how our data will be used.

We can conduct a session with stakeholders and business owners to gain insights into how they intend to use the data. By this collaboration, we will set the data quality standards that are aligned with the actual needs and expectations of our users.

This would be the artefact for our practical example, data consolidation for a marketing campaign reporting tool

Example of Data Usage Pattern outcome (image by the author)

From this example we got:

  • How often and how many people is going to access our data, so we understand what type of performance we need to provide
  • How complete and accurate our data needs to be, so we understand what type of controls we need to add to our data pipeline. To provide a high level of accuracy and completeness involves uniqueness, completeness and inconsistency management checks.
  • How fresh our data needs to be, so we understand how often we need to run our transformations to refresh the data
  • When our data needs to be accessible, so we understand the availability we need to provide

Service Level Objectives

The final outcome of this practice is to draw the baseline for our Service Level Objectives. In data quality, a Service Level Objective (SLO) is a specific and measurable goal that defines the expected level of data quality for a particular data service or process. SLOs set quantifiable metrics and thresholds to ensure that data meets predefined quality standards and aligns with the needs and expectations of users and stakeholders.

In our scenario, one of the SLOs we can define is that our data should be recalculated every 6 hours. In case the data is older than this threshold, it does not fulfil this specific SLO.

What aspects of data quality should I focus on?

Now we are in a position within the team to lower the abstraction level to data quality dimensions. A data quality dimension represents a specific facet of data quality that holds some specific characteristics. Each data quality dimension focuses on a particular aspect of data and helps identify areas that may require improvement.

Some of these dimensions are:

  • Accuracy: The degree to which data values reflect reality and are free from errors.
  • Completeness: The measure to which all required data elements are present without missing values.
  • Consistency: The level of harmony and conformity of data across different sources or within the same dataset.
  • Timeliness: The measure of how up-to-date the data is.
  • Uniqueness: The degree to which each record is distinct and not duplicated in the dataset.

By understanding the usage pattern for our data and SLOs, we identify the dimensions we should work on and attach each of them to the real value they bring to our scenario. This helps us to identify the most relevant aspects of data quality we need to work on and start thinking in specific actions

Data quality dimensions directly linked to business real value in our example (image by the author)

Following our example, we linked the data quality dimensions identified in the Usage Pattern session with the business value that it directly provides

Where do I start?

Once the data quality dimensions are identified along with the corresponding business value they provide, we will run a collaborative session within the team to set specific, measurable, and achievable goals for effectively fulfilling each dimension. These goals will serve as the foundation for defining actionable tasks, such as adding data quality tests in the transformation phase, performing gap analysis or incorporating robust data cleaning processes. By aligning our data quality efforts with these well-defined goals, we ensure that our actions are directly addressing business needs, and enhancing the overall data quality.

All the actions found in the process will be added to our backlog and prioritised by the team. The final outcome is a tailored data quality framework adapted to business needs that allows us to track our progress

Data quality framework with our goals and actions to ensure business value of our data in our example (image by the author)

Having a data quality framework that is visible and easy to interpret for stakeholders has some benefits:

  • It provides clarity on how data quality is managed, monitored, and improved within the organisation.
  • Promotes trust and transparency in data management practices
  • Reduces the chances of misinterpretation of data quality standards
  • Demonstrates team and organisation commitment to data quality and its importance in driving business success.

When is my data good enough for consumption?

Your framework will answer this. Once you have achieved all your goals preparing your data to fulfil business expectations, you can be confident enough to give it to users and seek their feedback for further improvements.

Remember that the input for your work was the Service Level Objectives that were identified in the Usage Pattern session. As your data is aligned with these objectives there is no reason to hold it fearing that it does not meet the requirements yet.

What to do once your data is released?

Monitoring

All the actions and goals defined in your data quality strategy need to be actively monitored. Utilising monitoring tools that can build alerts and communicate through various channels is essential for early detection.

Also it is crucial to log your incidents, and categorise them based on their impacted dimensions. This practice allows you to focus your attention on specific areas and identify potential gaps in your strategy. Even better, if you maintain an incident report, it enables you to reflect on how your work in specific areas contributes to reducing the number of incidents over time.

Incident log for month and by data quality dimension. In the stickers would be a brief description of the incident (image by the author)

Periodical revisions of the framework

Your team must review the incident log periodically and update your data quality framework accordingly to fill the identified gaps. This ensures your actions and goals reflect reality and are up to date.

Service Level Indicators and Transparency

It is essential to measure the fulfilment of your Service Level Objectives. For every SLO, you should have a Service Level Indicator (SLI) that shows the fulfilment of the SLO. For instance, in our example you could have a SLI that shows the percentage of success in the last X days of not having data that is older than 6 hours in production (timeliness dimension). This helps users understand how the data behaves and builds trust in its quality.

Service Level Indicators for our data quality dimensions (image by the author)

Transparency in practice is key to increase user adoption and Service Level Indicators are the ones in charge of providing this transparency.

Sharing our Data Quality Metrics

For sharing our data quality metrics (SLIs), I really like embracing the data product concept within a data-mesh implementation.

Our data quality strategy has these characteristics:

  • It is business specific as the objectives comes from a business need
  • Transparent as we can share and want to share it with users
  • Understandable as our data quality framework is easy to interpret

This aligns perfectly with the definition data-mesh gives to data products. I totally recommend using a data-mesh approach encapsulating data and its quality metrics into data products to enhance transparency.

Why data products for sharing our data quality metrics

Per definition, a data product in data-mesh is a self-contained, business-specific unit of data capabilities. They encapsulate data, processing logic and data quality checks, promoting decentralised data ownership and seamless integration into the broader data ecosystem. They are designed to serve specific business needs. They are easily findable and transparent. As integral components of our data quality framework, data products ensure that our strategy aligns precisely with the unique business requirements, providing clarity and transparency for business-specific data quality.

One of the key advantages of data products in the context of data quality is their ability to hold their own SLIs. By integrating data quality indicators directly into the data products and making them visible through a user-friendly catalog, we empower users to search, request access, and explore data with full knowledge of its quality. This transparency and clarity enhance user confidence and encourage greater adoption.

Conclusion

Throughout this step-by-step guide, we’ve learned how to set measurable Service Level Objectives that cover business needs, identify data quality dimensions, and align our actions with goals to fulfil expectations defined by SLOs. Embracing the transparency and understandability offered by data products, we can share our data quality metrics effectively to build trust and increase user adoption. Remember, perfection does not exist. Continuous monitoring, incident logging, and periodical revisions help us keep our data quality framework up-to-date.

Following these steps, you will be able to create a robust data quality framework and build a set of artefacts that serve as a shareable knowledge base for data quality and are easy to interpret for stakeholders and team members. And even better, your data quality framework holds a perfect balance effort-needs that will enable your team to release your data as soon as it is ready to cover the business requirements.

Happy data engineering!

--

--