Terms You Should Know If You’re Planning To Use Change Data Capture

Terms You Should Know If You’re Planning To Use Change Data Capture

April 29, 2024 big data data engineering 0
change data capture time

If you’ve worked in data long enough, then you’ve likely come across the term change data capture.

Often called CDC, change data capture involves tracking and recording changes in a database as they happen, and then transmitting these changes to designated targets. This can be crucial because some pipelines, in particular batch pipelines, don’t capture every change in a source and even if they do, rarely in near-real-time.

In turn, change data capture is a popular choice to implement. It, as referenced above, can capture all the changes to records and be streamed near real-time.

So, if your team is considering changing data capture as a solution, it’ll first be good to go over some common tools and technologies you should know.

Terms You Should Know When Considering Change Data Capture

Data Latency

You’ll often hear the term latency or data latency used all the time. All it really means in the CDC world is the time it takes for data changes in the source system to be reflected in the target system. Lower latency is often a key objective of CDC solutions. More broadly, latency describes the delay between the time data is requested and the time it is received and processed. So oftentimes, as companies try to build real-time systems, they need to reduce latency to near zero. This is also oftentimes why some people dislike the term “real-time,” as very few systems are truly real-time. There is generally always some level of delay.

Debezium

Debezium is a robust, open-source distributed platform that simplifies the development of streaming CDC (Change Data Capture) pipelines. Although it commonly integrates with the Apache Kafka streaming framework, its use is not limited to Kafka. Debezium eases the connection of database changes from sources like Postgres to Kafka, minimizing the need for extensive manual programming.

Once captured, these changes, referred to as events, are quickly moved into Debezium and forwarded to Kafka in real time. A major advantage of Debezium is its capability to preserve the chronological order of these events exactly as they occur. This feature is essential in scenarios where transaction sequences are critical, such as in financial or inventory management systems.

Setting up Debezium for CDC might be challenging, as it typically requires the operation of Kafka, Kafka Connect, and ZooKeeper. Consequently, your data teams may need assistance of their DevOps team or an external service provider to successfully deploy it.

Log-based CDC

In my experience, I have seen multiple CDC implementations. The first is using the WAL(which we’ll talk about what WAL means later); in this approach, the CDC tool (or service) monitors the transaction or redo logs generated by the source database.

These logs contain a sequential record of all changes made to the data, including the specific modified data and the type of operation performed (insert, update, delete). The CDC tool continuously reads and analyzes these logs, identifying and capturing the changes as individual events.

Once it captures these changes, the CDC tool applies the necessary transformations and formatting and then propagates the changes to the target system or systems. This replication process can occur in real-time or near real-time, ensuring that the target systems are updated with the latest data changes from the source.

Trigger-Based CDC

Another common approach to implementing CDC is using a trigger. If you’ve worked on databases, this likely is pretty clear why. When a new row of data is created or a row is modified, you’ll have a trigger that fires.

Boom. 

The data is now in your data warehouse or data lake

Of course, both of these methods assume that you have some system that can also ensure that all the data that is processed from your source makes it to the destination while also scaling.

That’s easier said than done.

So besides the what, the more important question is, why?

Write Ahead Logs – WAL 

Provide durability guarantee without the storage data structures to be flushed to disk, by persisting every state change as a command to the append only log.

Source:Martin Fowler

Or!

The main functionality of a write-ahead log can be summarized as:

Allow the page cache to buffer updates to disk-resident pages while ensuring durability semantics in the larger context of a database system.

Persist all operations on disk until the cached copies of pages affected by these operations are synchronized on disk. Every operation that modifies the database state has to be logged on disk before the contents on the associated pages can be modified.

Allow lost in-memory changes to be reconstructed from the operation log in case of a crash.

Source: Database Internals: a deep dive into how distributed data systems work

A write-ahead log (WAL) is a crucial component of many database systems that ensures that the data inside of a database is ACID (Atomicity, Consistency, Isolation, Durability).

The idea of the WAL is that changes to data are first recorded in this log before they are applied to the actual database. 

Why?

In case of a database crash or a power failure, the system can use the WAL to “replay” the recorded transactions upon restart, thereby recovering the database to a consistent state. 

It’s important to note that the specifics of how the write-ahead log works can depend on the particular database system. For example, in PostgreSQL, the WAL is a set of log files containing information about all changes made to the database’s data files. This can include changes to table data and to the database’s schema.

One of the reasons I do enjoy the concept of CDC is that it can be a great gateway into helping new engineers gain an understanding of how some databases actually work. 

Are You Looking To Implement Change Data Capture?

Now that you’ve learned some key terms for change data capture as well as some methods for implementing it, you’ll now likely want to consider what technology you’ll use to deploy your CDC data pipelines.

There are plenty of options, including self hosting Debezium or you could look for out of the box solutions. Our team has partnered with Estuary to help deliver change data capture and we’re already processing terabytes upon terabytes a month. So if feel free to reach out to our team and we’ll help you deploy your CDC pipelines with ease!

Or if you’d like to read more about how you can implement change data capture for MySQL, SQL Server or MongoDB, then the links on the different databases should help.

Thanks for reading! If you’d like to read more about data engineering, then check out the articles below.

Migrate Data From DynamoDB to MySQL – Two Easy Methods

Is Everyone’s Data A Mess – The Truth About Working As A Data Engineer

Normalization Vs. Denormalization – Taking A Step Back

What Is Change Data Capture – Understanding Data Engineering 101

Why Everyone Cares About Snowflake

 

Leave a Reply

Your email address will not be published. Required fields are marked *