Data Engineering

How to Keep Track of Data Versions Using Versatile Data Kit

Learn about slow change dimensions (SCD) and how to implement SCD Type 2 in VDK

Angelica Lo Duca
Towards Data Science
7 min readMay 3, 2023

--

Photo by Joshua Sortino on Unsplash

Data is the backbone of any organization, and in today’s fast-paced world, it is crucial to keep track of its versions. As businesses grow and evolve, data undergoes numerous changes that can quickly become overwhelming without a streamlined system. Technology has come to our rescue with various tools that help us manage data efficiently.

One such tool is the Versatile Data Kit (VDK), which offers a comprehensive solution for controlling your data versioning needs. VDK helps you easily perform complex operations, such as data ingestion and processing from different sources, using SQL or Python. You can use VDK to build data lakes and ingest raw data extracted from different sources, including structured, semi-structured, and unstructured data.

In this article, we will explore how VDK can make your life easier by providing an intuitive interface and robust features to keep track of all changes made to your critical business information.

We’ll cover the following topics:

  • What are Slowly Changing Dimensions (SCD)?
  • SCD Types
  • How to implement SCD Type 2 in VDK.

Note for images: all the images are by the author, except where explicitly stated.

What are Slowly Changing Dimensions (SCD)?

In a data warehouse, a dimension is a structure that categorizes facts and measures to enable users to answer business questions. Commonly used dimensions are people, products, place, and time.

Slowly changing dimensions (SCD) help you track changes in dimension data over time. They store and manage current and historical data in a data warehouse. The word slowly means that they change slowly over time rather than regularly. Examples of SCD are address, email, salary, etc.

SCDs are important for tracking how data changes over time. We can use them to track customers, products, or other information that might change over time.
Using SCD makes it easier to track changes in data and keep historical records of how the data has changed. This can be valuable for analyzing trends or answering questions about how specific data has evolved.

SCD Types

There are three types of SCD: type 1, type 2, and type 3. To illustrate how each SCD type works, we use the following example. Consider the following customer dimension table:

Figure 1 — The customer dimension table of the example

What happens if John Smith changes his phone number to 555-5668?

SCD Type 1

This SCD type does not track changes in data; new data overwrites the old data. This type is appropriate when the data history is unimportant, and you need to know the current state of the data. Type 1 is the simplest and most common type of SCD.

Consider the example in Figure 1. In SCD Type 1, the new phone number replaces the old one, and there is no record of the previous one, as shown in the table below.

Figure 2 — The customer dimension table after applying SCD Type 1

SCD Type 2

This type tracks historical changes by adding new records to the dimension table. The old data is still available but marked as superseded by the new. Each record contains a unique identifier, a start date, and an end date.

Consider the example in Figure 1 again. In SCD Type 2, a new record is created with the new telephone number, unique identifier, and start date, while the previous record has an end date. The start date represents the date the new record takes effect, and the end date represents the date the old record becomes obsolete.

Figure 3— The customer dimension table after applying SCD Type 2

SCD Type 3

This type tracks only the most recent change by adding a column to the dimension table to hold the new value. Use this type when tracking the complete history of changes is not necessary.

Consider the example in Figure 1 again. In SCD Type 3, the record is updated with two new columns: new_phone_number, to hold the new value, and effective_date.

Figure 4— The customer dimension table after applying SCD Type 3

How to implement SCD Type 2 in VDK

VDK is a very powerful framework released by VMware as open source. Use VDK to build a data lake and merge multiple sources. If you are new to VDK, read its official documentation or this introductory article.

VDK supports SCD Type 1 and Type 2. To implement an SCD in VDK, use the SQL processing template. An SQL processing template is a data loading template, a conceptual structure provided by VDK. According to the VDK official documentation,

a data loading template consumes a source_view located in a source_schema and load the source data in a target_table located in a target_schema (extracted from VDK official documentation).

In practice, the template simplifies how data is extracted from a source and loaded into a target table. VDK provides the SQL data processing templates for Impala and Trino databases.

To illustrate how VDK manages SCD Type 2, consider the following scenario.

Figure 5 — How to use VDK to manage SCD Type 2

Starting from the left, there is the Data Source (e.g. the customer dimension table of Figure 1), which must define a source schema and a source view. VDK ingests data from the Data Source. To manage SCD Type 2, VDK uses the SCD2 Template, an SQL processing template. Through the vdk-trino plugin, VDK stores data using a Trino DB into a Data Lake. The Data Lake must contain the Target Schema, which is the schema that will store SCD Type 2.

VDK implements the SCD2 Template as an additional method of the Input Job. The following code shows how to use the SCD2 Template:

def run(job_input: IJobInput) -> None:
# ...
job_input.execute_template(
template_name='scd2',
template_args={
'source_schema': 'customer_schema',
'source_view': 'customer_view',
'target_schema': 'customer_target_schema',
'target_table': 'customer_target_table',
'id_column': 'customer_id',
'sk_column': 'SID',
'value_columns': ['name', 'address', 'phone_number'],
'tracked_columns': ['phone_number'],
},
)

The code is extracted directly from the VDK documentation and adapted to the example in Figure 3, so refer to it for further details. In practice, the template receives the source and target schemas as input and other parameters, such as the column ID and the columns to track.

Thanks to VDK, you can easily manage SCD in your database! Refer to the VDK documentation for a complete and detailed example.

To explain how the SCD2 template works, consider the customer dimension table again. The previous code tracked only the phone_number column, so if there is a change in that column, the system stores the change in a new row, as shown in the following figure:

Figure 6 — If a tracked column changes, the system stores it as a new row.

However, if an untracked column changes, the system overwrites it, as shown in the figure below:

Figure 7— If an untracked column changes, the system overwrites it.

Summary

Congratulations! You have just learned what SCD is and how to implement it in VDK.

SCD is important for organizations that want to manage and analyze their data efficiently. By identifying which attributes of a dataset change rarely or infrequently, you can streamline your data processing and storage and reduce the likelihood of errors or inconsistencies in their analytics.

To easily identify and manage SCD, you can use VDK, which helps you automate the changes to slow-changing attributes over time.

Overall, SCD may seem like a minor detail in the grand scheme of big data management. Still, it can significantly impact the accuracy and efficiency of analytics processes. By leveraging tools like VDK, you can stay ahead of the curve and maximize the value of your data assets.

You may also be interested in…

There are many related topics you might be interested in:

How to configure VDK to work with Trino DB

A complete example using the Versatile Data Kit and Trino DB.

How to build a web app using VDK

A step-by-step tutorial on how to build a web application, combining the Streamlit Python library and Versatile Data Kit.

How to use VDK plugins

A step-by-step tutorial on manipulating a table in your data lake by writing a VDK custom plugin.

How to handle missing values using

A tutorial on how to build data pipelines using VDK to handle missing values

Just one more word…

Don’t forget to join the VDK Slack channel to stay updated about VDK!

--

--

Researcher | +50k monthly views | I write on Data Science, Python, Tutorials, and, occasionally, Web Applications | Book Author of Comet for Data Science