Data Management

Mastering Batch Data Processing with Versatile Data Kit (VDK)

A tutorial on how to use VDK to perform batch data processing

Angelica Lo Duca
Towards Data Science
5 min readNov 17, 2023

--

Photo by Mika Baumeister on Unsplash

Versatile Data Kit (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities. While VDK can handle various data integration tasks, including real-time streaming, this article will focus on how to use it in batch data processing.

This article covers:

  • Introducing Batch Data Processing
  • Creating and Managing Batch Processing Pipelines in VDK
  • Monitoring Batch Data Processing in VDK

1 Introducing Batch Data Processing

Batch data processing is a method for processing large volumes of data at specified intervals. Batch data must be:

  • Time-independent: data doesn’t require immediate processing and is typically not sensitive to real-time requirements. Unlike streaming data, which needs instant processing, batch data can be processed at scheduled intervals or when resources become available.
  • Splittable in chunks: instead of processing an entire dataset in a single, resource-intensive operation, batch data can be divided into smaller, more manageable segments. These segments can then be processed sequentially or in parallel, depending on the capabilities of the data processing system.

In addition, batch data can be processed offline, meaning it doesn’t require a constant connection to data sources or external services. This characteristic is precious when data sources may be intermittent or temporarily unavailable.

ELT (Extract, Load, Transform) is a typical use case for batch data processing. ELT comprises three main phases:

  • Extract (E): data is extracted from multiple sources in different formats, both structured and unstructured.
  • Load (L): data is loaded into a target destination, such as a data warehouse.
  • Transform (T): the extracted data typically requires preliminary processing, such as cleaning, harmonization, and transformations into a common format.

Now that you have learned what batch data processing is, let’s move on to the next step: creating and managing batch processing pipelines in VDK.

2 Creating and Managing Batch Processing Pipelines in VDK

VDK adopts a component-based approach, enabling you to build data processing pipelines quickly. For an introduction to VDK, refer to my previous article, An Overview of Versatile Data Kit. This article assumes that you have already installed VDK on your computer.

To explain how the batch processing pipeline works in VDK, we consider a scenario where you must perform an ELT task.

Imagine you want to ingest and process, in VDK, Vincent Van Gogh’s paintings available in Europeana, a well-known European aggregator for cultural heritage. Europeana provides all cultural heritage objects through its public REST API. Regarding Vincent Van Gogh, Europeana provides more than 700 works.

The following figure shows the steps for batch data processing in this scenario.

Image by Author

Let’s investigate each point separately. You can find the complete code to implement this scenario in the VDK GitHub repository.

2.1 Extract and Load

This phase includes VDK jobs calling the Europeana REST API to extract raw data. Specifically, it defines three jobs:

  • job1 — delete the existing table (if any)
  • job2 — create a new table
  • job3 — ingest table values directly from the REST API.

This example requires an active Internet connection to work correctly to access the Europeana REST API. This operation is a batch process because it downloads data only once and does not require streamlining.

We’ll store the extracted data in a table. The difficulty of this task is building a mapping between the REST API, which is done in job3.

Writing job3 involves simply writing the Python code to perform this mapping, but instead of saving the extracted file into a local file, we call a VDK function (job_input.send_tabular_data_for_ingestion) to save the file to VDK, as shown in the following snippet of code:

import inspect
import logging
import os

import pandas as pd
import requests
from vdk.api.job_input import IJobInput


def run(job_input: IJobInput):
"""
Download datasets required by the scenario and put them in the data lake.
"""
log.info(f"Starting job step {__name__}")

api_key = job_input.get_property("api_key")

start = 1
rows = 100
basic_url = f"https://api.europeana.eu/record/v2/search.json?wskey={api_key}&query=who:%22Vincent%20Van%20Gogh%22"
url = f"{basic_url}&rows={rows}&start={start}"

response = requests.get(url)
response.raise_for_status()
payload = response.json()
n_items = int(payload["totalResults"])

while start < n_items:
if start > n_items - rows:
rows = n_items - start + 1

url = f"{basic_url}&rows={rows}&start={start}"
response = requests.get(url)
response.raise_for_status()
payload = response.json()["items"]

df = pd.DataFrame(payload)
job_input.send_tabular_data_for_ingestion(
df.itertuples(index=False),
destination_table="assets",
column_names=df.columns.tolist(),
)
start = start + rows

For the complete code, refer to the example in GitHub. Please note that you need a free API key to download data from Europeana.

The output produced during the extraction phase is a table containing the raw values.

2.2 Transform

This phase involves cleaning data and extracting only relevant information. We can implement the related jobs in VDK through two jobs:

  • job4 — delete the existing table (if any)
  • job5 — create the cleaned table.

Job5 simply involves writing an SQL query, as shown in the following snippet of code:

CREATE TABLE cleaned_assets AS (
SELECT
SUBSTRING(country, 3, LENGTH(country)-4) AS country,
SUBSTRING(edmPreview, 3, LENGTH(edmPreview)-4) AS edmPreview,
SUBSTRING(provider, 3, LENGTH(provider)-4) AS provider,
SUBSTRING(title, 3, LENGTH(title)-4) AS title,
SUBSTRING(rights, 3, LENGTH(rights)-4) AS rights
FROM assets
)

Running this job in VDK will produce another table named cleaned_asset containing the processed values. Finally, we are ready to use the cleaned data somewhere. In our case, we can build a Web app that shows the extracted paintings. You can find the complete code to perform this task in the VDK GitHub repository.

3 Monitoring Batch Data Processing in VDK

VDK provides the VDK UI, a graphical user interface to monitor data jobs. To install VDK UI, follow the official VDK video at this link. The following figure shows a snapshot of VDK UI.

Image by Author

There are two main pages:

  • Explore: This page enables you to explore data jobs, such as the job execution success rate, jobs with failed executions in the last 24 hours, and the most failed executions in the last 24 hours.
  • Manage: This page gives more job details. You can order jobs by column, search multiple parameters, filter by some of the columns, view the source for the specific job, add other columns, and so on.

Watch the following official VDK video to learn how to use VDK UI.

Summary

Congratulations! You have just learned how to implement batch data processing in VDK! It only requires ingesting raw data, manipulating it, and, finally, using it for your purposes! You can find many other examples in the VDK GitHub repository.

Stay up-to-date with the latest data processing developments and best practices in VDK. Keep exploring and refining your expertise!

Other articles you may be interested in…

--

--

Researcher | +50k monthly views | I write on Data Science, Python, Tutorials, and, occasionally, Web Applications | Book Author of Comet for Data Science