Unleashing the Power of CDC With Snowflake

Reading Time: 9 minutes

Picture a vibrant landscape of bustling tech hubs, where businesses thrive and decisions are made in the blink of an eye. In this dynamic realm of data engineering, a monumental challenge takes centre stage: efficiently managing the ever-changing tides of real-time data.

Data, the lifeblood of organisations, holds the key to unlocking untapped potential and propelling businesses forward. Timely analytics, the secret sauce that empowers companies to gain a competitive edge, are in high demand. Imagine businesses across diverse industries – from finance to healthcare, manufacturing to retail – feverishly generating vast volumes of data and information. The question then arises: how can they keep pace with this deluge of data?

Enter Change Data Capture (CDC), a transformative concept that seamlessly weaves its way into the very fabric of data engineering. Like a skilled conductor, CDC orchestrates a symphony of capturing and propagating data modifications effortlessly. It ensures that organisations stay at the forefront by capturing every twist and turn in the data landscape.

With CDC by their side, organisations unlock the power of informed decision-making, safeguard data integrity, and enable lightning-fast analytics. As the data ecosystem evolves, CDC becomes the compass that guides businesses in the right direction, equipping them to navigate through turbulent waters with ease.

Now, envision a world where data updates flow effortlessly, converging like tributaries into a mighty river. Organisations harness this stream of information, leveraging it to innovate, optimise, and thrive. The possibilities become boundless when data-driven insights are at your fingertips.

So, embrace the power of Change Data Capture, and embark on a captivating journey where the magic of real-time data awaits. It’s time to harness the pulse of the digital era and seize the endless opportunities that lie ahead.

In this blog, we will cover:

What Is CDC and Its Benefits?
- Where Is CDC Used and Who Uses It?
- Types of CDC
Hands-On
- Implement CDC in Snowflake Using Streams
Conclusion

What Is CDC and Its Benefits?

Change Data Capture (CDC) is a powerful technique that revolutionises data engineering by capturing and applying incremental changes to databases or data sources. It bridges gaps in data ecosystems, ensuring consistency and synchronisation across systems. In the past, capturing data changes was tedious and inefficient, leading to delays and discrepancies. CDC, however, propels data engineering forward by leveraging transaction logs to identify and extract changes in near real-time.

Implementing CDC offers numerous benefits. Some of them are discussed below:

Firstly, it enables precise data synchronisation, improving integrity and minimising inconsistencies. Secondly, CDC empowers organisations with near real-time integration and analytics, enhancing decision-making and competitiveness. Moreover, it facilitates the implementation of microservices architectures and event-driven systems, automating reactions to data changes without manual intervention.

With CDC, data engineering takes a leap forward, allowing organisations to stay ahead of the curve, respond swiftly to market dynamics, and unleash the full potential of their data resources.

Where Is CDC Used and Who Uses It?

CDC is widely used in data engineering across industries and organisations. It finds applications in database replication, where it captures changes from the source database and updates target databases, ensuring data consistency without manual updates.

CDC also plays a crucial role in data integration and ETL processes. It captures incremental changes from transactional databases or other sources, efficiently loading them into data warehouses or data lakes. This keeps analytical systems up to date for accurate reporting and analysis.

In real-time data streaming and event-driven architectures, CDC captures data changes to trigger actions or workflows. It enables automation in business processes, event-driven applications, and microservices, reacting to data changes in near real-time.

Data engineers, administrators, analysts, and various industries like finance, healthcare, e-commerce, and telecommunications rely on CDC. It facilitates data synchronisation, replication, real-time analytics, and event-driven processing, empowering data-driven decision-making and operational efficiency.

Types of CDC

Audit Columns: This method involves using designated columns within tables to track incremental changes. These additional columns store metadata like timestamps, user IDs, and change types, ensuring granular change tracking and auditability. It is commonly employed for compliance monitoring, data lineage, and maintaining historical records.
Log-Based: Log-based CDC captures and processes incremental changes by monitoring the transaction log of a database. The transaction log records all modifications made to the database, and log-based CDC reads and interprets the log to extract individual changes. It offers an efficient approach for data replication, real-time data warehousing, and event-driven architectures without placing additional strain on the database.
Trigger-Based: Trigger-based CDC utilises database triggers associated with specific tables to capture and track changes. These triggers automatically execute predefined actions when events such as inserts, updates, or deletes occur on the tables. The triggered actions capture the changed data, which can be processed and propagated to downstream systems for analysis, reporting, or integration with other applications.
Table Deltas: Table delta-based CDC compares two versions of a table to track incremental changes. It involves taking periodic snapshots of the table and comparing them to identify added, modified, or deleted records. By analysing the differences between snapshots, table delta-based CDC enables the extraction and propagation of changes to downstream systems for further processing or analysis.

Hands-On

In order to implement a CDC workflow we would be following the following steps:

Step 1: Create an OLTP transactional table

Start by creating a transactional database table where you can insert rows. This table will serve as the source of data, and any changes made here will be reflected in your data warehouse.

Query2:

Here, we have created a simple employee table and we are going to add a row:

The inserted data is as follows:

Step 2: Python script to load the data from our transactional database to Snowflake:

Now, we will write a Python script to load this table data from our transactional database to Snowflake. But first, we need to do certain things on Snowflake to get started.

Snowflake signup: Click on this link if you are new to Snowflake, the signup process is fairly simple. https://signup.snowflake.com/
After signing up, create a database by clicking data -> +Database -> create a database

After creating the database, go to worksheets and select the database and schema you have created and write the create table query which would create a target table for data migration. Also, grant necessary privileges to this table.

That’s it, now let’s switch to Python and write a script that would populate this snowflake table. We would be using Jupyter Notebook for Python.

1. Install the following libraries: psycopg2 and sqlalchemy

2. Create the database engine and database connection using the following commands:

3. Now, read the table data which you have just created using pandas’s read_sql_query function

4. Now we have to connect Python with Snowflake so in the next shell, run:
!pip install snowflake-connector-python==2.7.9

5. Create a connection between Snowflake and Python by following these steps and replace the placeholders with your credentials:

6. Next step is to create a cursor object and switch the role to sys admin to make databases, schemas, etc. Execute the following commands:

7. Execute the use command to use the database, warehouse, and schema.

8. Now, the data is ready to be inserted, we will use the write_pandas library to load this data frame onto Snowflake:

9. The success is true so let’s now verify this change in our snowflake table.

and we have now migrated the data from our transactional database to the Snowflake data warehouse. Now we can easily create data streams and create a CDC mechanism in Snowflake.

Implement CDC in Snowflake Using Streams

Snowflake Stream enables real-time data processing and continuous data integration, allowing for seamless and efficient data flow across various systems.

Snowflake streams act as a conduit for capturing and propagating changes made to tables, ensuring data consistency and enabling near-instantaneous updates. With their ability to efficiently handle high volumes of data and provide reliable, low-latency streaming capabilities, snowflake streams have become an indispensable tool for modern data engineering workflows. Their versatility and reliability make them a valuable asset in building robust and scalable data pipelines, facilitating timely and accurate data analysis and decision-making processes.

Table stream is an object that records the DML changes made to a source object (eg: tables). It returns additional columns like:

Metadata$Action: Insert or Delete (the operation that happened)
Metadata$Update: True or False (if the data was updated or not)
Metadata$RowId: to track changes on the row over time.

Creating a table stream

Run the following SQL command to create a stream:

create or replace stream employee_stream on table WORKFALL_DB.PUBLIC.EMPLOYEE;

We can see a stream has been created and additional audit columns have been added automatically. Let’s insert a new row and see the changes in the stream table.

After running the Python script :

As we can see, the new record is inserted as an insert into the stream table. We can now use this information to write just new rows and not existing rows each time.

Let’s create a final table and in that, we would merge the new rows.

Now execute this merge query and it will insert/update/delete records based on the operations performed on the source table.

Now let us insert a new record and update the name of the entry with id 2.

Source:

Records in the final table after updating and inserting new rows:

We can also create data workflows in the form of data pipelines using apache airflow to know more, you can check our blog at Easily Build ETL Pipeline Using Python And Airflow

Conclusion

In this blog, we demonstrated implementing a CDC workflow. In the realm of data engineering, CDC has transformed how organisations handle data modifications. By automating the capture and propagation of changes, CDC ensures consistent data and real-time integration. It’s a guiding light that keeps organisations synchronised in the ever-changing data landscape.

CDC allows data engineers to capture and use incremental changes, ensuring accurate data views. It’s essential for maintaining data integrity and enabling timely analytics, aiding informed decision-making. CDC is embraced by various industries, from finance to healthcare, e-commerce to telecommunications. It ensures granular data synchronisation, minimizing missed updates and inconsistencies.

Beyond data integrity, CDC empowers organisations with real-time integration and analytics. It enables agile decision-making in dynamic markets, where up-to-date insights are crucial for staying competitive. CDC also shapes the future of data engineering. It facilitates microservices architectures and event-driven systems, automating processes and eliminating manual interventions. As the volume and velocity of data grow, embracing CDC becomes increasingly critical for organisations striving to succeed in the data-driven era. We will come up with more such use cases in our upcoming blogs.

Meanwhile…

If you are an aspiring Snowflake enthusiast and want to explore more about the above topics, here are a few of our blogs for your reference:

Stay tuned to get all the updates about our upcoming blogs on the cloud and the latest technologies.

Keep Exploring -> Keep Learning -> Keep Mastering At Workfall, we strive to provide the best tech and pay opportunities to kickass coders around the world. If you’re looking to work with global clients, build cutting-edge products and make big bucks doing so, give it a shot at workfall.com/partner today!

Unleashing the Power of CDC With Snowflake