Data Engineering Digest

How to test PySpark code with pytest

Start Data Engineering

APRIL 22, 2024

1. Introduction 2. Ensure the code’s logic is working as expected with tests 2.1. Test types for data pipelines 2.2. pytest: A powerful Python library for testing 2.2.1. Set context, run code, check results & clean up 2.2.2. Tests are identified by their name 2.2.3. Use fixture to create fake data for testing 2.2.4. Define items to be shared among tests with conftest.

Coding

Coding Data Pipeline Python Data

Docker Fundamentals for Data Engineers

Start Data Engineering

APRIL 22, 2024

1. Introduction 2. Docker concepts 2.1. Define the OS and its configurations with an image 2.2. Use the image to run containers 2.2.1. Communicate between containers and local OS 2.2.2. Start containers with docker CLI or compose 3. Conclusion 1. Introduction Docker can be overwhelming to start with. Most data projects use Docker to set up the data infra locally (and often in production).

Data Engineering

Data Engineering Data Engineer Engineering Data

Data Engineering Best Practices - #2. Metadata & Logging

Start Data Engineering

FEBRUARY 22, 2024

1. Introduction 2. Setup & Logging architecture 3. Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Obtain visibility into the code’s execution sequence using text logs 3.3. Understand resource usage by tracking Metrics 3.4. Monitoring UI & Traceability 3.5.

Metadata

Metadata Data Engineering Data Engineer Engineering

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Uplevel your dbt workflow with these tools and techniques

Start Data Engineering

DECEMBER 13, 2023

1. Introduction 2. Setup 3. Ways to uplevel your dbt workflow 3.1. Reproducible environment 3.1.1. A virtual environment with Poetry 3.1.2. Use Docker to run your warehouse locally 3.2. Reduce feedback loop time when developing locally 3.2.1. Run only required dbt objects with selectors 3.2.2. Use prod datasets to build dev models with defer 3.2.3. Parallelize model building by increasing thread count 3.

Datasets

Datasets Building

What is an Open Table Format? & Why to use one?

Start Data Engineering

NOVEMBER 14, 2023

1. Introduction 2. What is an Open Table Format (OTF) 3. Why use an Open Table Format (OTF) 3.0. Setup 3.1. Evolve data and partition schema without reprocessing 3.2. See previous point-in-time table state, aka time travel 3.3. Git like branches & tags for your tables 3.4. Handle multiple reads & writes concurrently 4. Conclusion 5. Further reading 6.

Data

6 Steps to Avoid Messy Data in Your Warehouse

Start Data Engineering

OCTOBER 25, 2023

1. Introduction 2. Six Steps for a Clean Data Warehouse 2.1. Understand the business 2.2. Make data easy to use with the appropriate data model 2.3. Good input data is necessary for a good data warehouse 2.4. Define Source of Truth (SOT) and trace its usage 2.5. Keep stakeholders in the loop for a more significant impact 2.6. Watch out for org-level red flags ?

Data Warehouse

Data Warehouse Data IT

Data Engineering Best Practices - #1. Data flow & Code

Start Data Engineering

JULY 20, 2023

1. Introduction 2. Sample project 3. Best practices 3.1. Use standard patterns that progressively transform your data 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks) 3.3. Avoid data duplicates with idempotent pipelines 3.4. Write DRY code & keep I/O separate from data transformation 3.5. Know the when, how, & what (aka metadata) of pipeline runs for easier debugging 3.

Coding

Coding Data Engineering Data Engineer Engineering

What is a self-serve data platform & how to build one

Start Data Engineering

JUNE 30, 2023

1. Introduction 2. What is self-serve? 2.1. Components of a self-serve platform 3. Building a self-serve data platform 3.1. Creating dataset(s) 3.1.1. Gather requirements 3.1.2. Get data foundations right 3.2. Accessing data 3.3. Identify and remove dependencies 4. Conclusion 5. Further reading 6. References 1. Introduction Most companies want to build a self-serve data platform.

Building

Building Datasets Data Accessible

How to become a valuable data engineer

Start Data Engineering

JUNE 13, 2023

1. Introduction 2. Skills 2.1. Business Impact 2.1.1. Know your business 2.1.2. Money & Time 2.2. Technical skills 3. Build impactful projects 4. Conclusion 5. Further reading 1. Introduction So you are a new data engineer (or looking for a DE job) and want to better yourself as a data engineer. However, when you look at job postings or company tech stack, you are overwhelmed by the sheer amount of tools you have to learn!

Data Engineering

Data Engineering Data Engineer Engineering Data

Data Pipeline Design Patterns - #2. Coding patterns in Python

Start Data Engineering

JANUARY 12, 2023

Introduction Sample project Code design patterns 1. Functional design 2. Factory pattern 3. Strategy pattern 4. Singleton, & Object pool patterns Python helpers 1. Typing 2. Dataclass 3. Context Managers 4. Testing with pytest 5. Decorators Misc Conclusion Further reading References Introduction Using the appropriate code design pattern can make your code easy to read, extensible, and seamless to modify existing logic, debug, and enable developers to onboard quicker.

Designing

Designing Coding Python Data Pipeline

Books to level up your data skills!

Start Data Engineering

DECEMBER 5, 2020

1.

SQL

SQL Data Data Process Process

10 Key skills, to help you become a data engineer

Start Data Engineering

MARCH 20, 2020

This article gives you an overview of the 10 key skills you need to become a better data engineer. If you are struggling to get started on what to learn, start with the first topic and proceed through the list.

Data Engineering

Data Engineering Data Engineer Engineering Data

Scheduling a SQL script, using Apache Airflow, with an example

Start Data Engineering

MARCH 28, 2020

One of the most common use cases for Apache Airflow is to run scheduled SQL scripts. Developers who start with Airflow often ask the following questions “How to use airflow to orchestrate sql?

SQL

3 Key Points to Help You Partition Late Arriving Events

Start Data Engineering

APRIL 4, 2020

One of the most common issues when ingesting and processing user generated events is, how to deal with late arriving events. Yet this topic is not extensively discussed. Some of the general issues that data engineers usually have are “What should be considered the event time?

Data Engineering

Data Engineering Data Engineer Engineering Process

Review: Building a Real Time Data Warehouse

Start Data Engineering

APRIL 10, 2020

Many data engineers coming from traditional batch processing frameworks have questions about real time data processing systems, like “What kind of data model did you implement, for real-time processing?

Data Warehouse

Data Warehouse Building Data Data Engineering

Apache Airflow Review: the good, the bad

Start Data Engineering

APRIL 17, 2020

When getting started with Apache Airflow , data engineers have questions similar to the two below “What are people’s opinions of Airflow?

Data Engineering

Data Engineering Data Engineer Engineering Data

Advantages of Using dbt(Data Build Tool)

Start Data Engineering

APRIL 24, 2020

In this article we aim to go over the reasoning behind why someone might want to use dbt. If you are interested in learning dbt checkout this article.

Building

Building SQL Data Data Engineering

What Does It Mean for a Column to Be Indexed

Start Data Engineering

MAY 1, 2020

When optimizing queries on a database table, most developers tend to just create an index on the field to be queried.

IT

IT Database

Change Data Capture Using Debezium Kafka and Pg

Start Data Engineering

MAY 9, 2020

Change data capture is a software design pattern used to capture changes to data and take corresponding action based on that change. The change to data is usually one of read, update or delete. The corresponding action usually is supposed to occur in another system in response to the change that was made in the source system.

Kafka

Kafka Data Designing Systems

Data Engineering Project for Beginners - Batch edition

Start Data Engineering

MAY 23, 2020

Introduction Approach Project overview Engineering Design Airflow Primer: Setup Code and explanation Stage 1. pg -> file -> s3 Stage 2. file -> s3 -> EMR -> s3 Stage 3. movie_review_stage, user_purchase_stage -> Redshift table -> quality Check data Monitoring ETL Design Review Common Scenarios Next Steps Conclusion Introduction Starting out in data engineering can be a little intimidating, especially because data engineering involves a lot of moving parts.

Data Engineering

Data Engineering Data Engineer Project Engineering

Thank You

Start Data Engineering

MAY 30, 2020

Thank you for contacting us. We will get back to you shortly.

A proven approach to land a Data Engineering job

Start Data Engineering

JUNE 2, 2020

I have seen and been asked the following questions by students, backend engineers and analysts who want to get into the data engineering industry. What approach should i take to land a Data Engineering job? I really want to get into DE. What can I do to learn more about it? In this article, I will try to provide a general approach that you as a beginner, student, backend engineer or analyst can use to land your first data engineering job.

Data Engineering

Data Engineering Data Engineer Engineering Data

What, why, when to use Apache Kafka, with an example

Start Data Engineering

JUNE 11, 2020

I have seen, heard and been asked questions and comments like What is Kafka and When should I use it? I don’t understand why we have to use Kafka The objective of this post is to get you up to speed with what Apache Kafka is, when to use them and the foundational concepts of Apache Kafka with a simple example. What is Apache Kafka First let’s understand what Apache Kafka is.

Kafka

Kafka IT

3 Key techniques, to optimize your Apache Spark code

Start Data Engineering

JUNE 19, 2020

Intro A lot of tutorials show how to write spark code with just the API and code samples, but they do not explain how to write “efficient Apache Spark” code.

Coding

Coding IT Data

Aws Account

Start Data Engineering

JUNE 26, 2020

1. AWS account Sign up for an AWS account at AWS Sign Up. You will be eligible for some free services for the first time sign up, ref: AWS Free Tier get your access key by clicking on your name -> My Security Credentials on the top pane and then clicking Create New Access Key.

AWS

AWS Accessible Accessibility IT

Aws Emr

Start Data Engineering

JUNE 26, 2020

EMR AWS EMR is a managed service provided by AWS to run Spark, HDFS, HIVE and other select software.

AWS

AWS Project Management

Designing a "low-effort" ELT system, using stitch and dbt

Start Data Engineering

JULY 11, 2020

Intro A very common use case in data engineering is to build a ETL system for a data warehouse, to have data loaded in from multiple separate databases to enable data analysts/scientists to be able to run queries on this data, since the source databases are used by your applications and we do not want these analytic queries to affect our application performance and the source data is disconnected as shown below.

Systems

Systems Designing ETL System Data Warehouse

AWS RDS PostgreSQL Setup

Start Data Engineering

JULY 18, 2020

RDS AWS RDS is a managed service provided by AWS to run a relational database. We will see how to setup a postgres instance using AWS RDS. Log in to your AWS account. Go to Services -> RDS Click on Create Database, In the Create Database prompt, choose Standard Create option with PostgreSQL as engine type. In the Template section choose Free Tier and type in a DB Identifier, Master username and Master password.

PostgreSQL

PostgreSQL AWS Relational Database Database

Stitch Database to data warehouse Integration

Start Data Engineering

JULY 18, 2020

Given Source database connection details (endpoint, port, username, password and database name) Source table to replicate destination schema name run frequency can be set to 10min We are assuming the destination data warehouse is already setup in stitch. Steps Log into your stitch account. here Click on Add Integration button on your dashboard. Choose PostgreSQL option as the integration in the next page.

Data Warehouse

Data Warehouse Database PostgreSQL Data

Stitch S3 DB Integration

Start Data Engineering

JULY 18, 2020

Given Source S3 path and file delimiter data warehouse connection details (endpoint, port, username, password and database name) data warehouse schema name and table name Run frequency Steps Log into your stitch account, here Click on the Destination tab and use the data warehouse connection details to establish a destination database. Click on Add Integration button on your dashboard.

Data Warehouse

Data Warehouse Database Data

Ensuring Data Quality, With Great Expectations

Start Data Engineering

JULY 26, 2020

What is data quality As the name suggest, it refers to the quality of our data. Quality should be defined based on your project requirements. It can be as simple as ensuring a certain column has only the allowed values present or falls within a given range of values to more complex cases like, when a certain column must match a specific regex pattern, fall within a standard deviation range, etc.

Data

Data Project IT

Data Engineering Project: Stream Edition

Start Data Engineering

SEPTEMBER 26, 2020

Table of Contents Table of Contents Introduction Project description and requirements Infrastructure overview Apache Flink Apache Kafka Design Detect fraudulent accounts Log account actions Prerequisites Code Defining dependencies Inheritance Server logs generator Defining data flow in Apache Flink Create a streaming environment Creating a consumer to read events from Apache Kafka Detecting fraud and generating alert events Writing server logs to a PostgreSQL DB Fraud detection logic Open proces

Data Engineering

Data Engineering Data Engineer Project Engineering

How to submit Spark jobs to EMR cluster from Airflow

Start Data Engineering

OCTOBER 12, 2020

Table of Contents Table of Contents Introduction Design Setup Prerequisites Clone repository Get data Code Move data and script to the cloud create an EMR cluster add steps and wait to complete terminate EMR cluster Run the DAG Conclusion Further reading Introduction I have been asked and seen the questions how others are automating apache spark jobs on EMR how to submit spark jobs to an EMR cluster from Airflow ?

Cloud

Cloud Coding Designing Data

How to Pull Data from an API, Using AWS Lambda

Start Data Engineering

NOVEMBER 8, 2020

Introduction If you are looking for a simple, cheap data pipeline to pull small amounts of data from a stable API and store it in a cloud storage, then serverless functions are a good choice. This post aims to answer questions like the ones shown below My company does not have the budget to purchase a tool like fivetran, What should I use to pull data from an API ?

AWS

AWS Cloud Storage Data Pipeline Data

How to do Change Data Capture (CDC), using Singer

Start Data Engineering

JANUARY 1, 2021

Introduction Why Change Data Capture Setup Prerequisites Source setup Destination setup Source, MySQL CDC, MySQL => PostgreSQL Pros and Cons Pros Cons Conclusion References Introduction Change data capture is a software design pattern used to track every change(update, insert, delete) to the data in a database. In most databases these types of changes are added to an append only log (Binlog in MySQL, Write Ahead Log in PostgreSQL).

PostgreSQL

PostgreSQL MySQL Database Data

How to Backfill a SQL query using Apache Airflow

Start Data Engineering

JANUARY 6, 2021

What is backfilling ? Setup Prerequisites Apache Airflow - Execution Day Backfill Conclusion Further Reading References What is backfilling ? Backfilling refers to any process that involves modifying or adding new data to existing records in a dataset. This is a common use case in data engineering. Some examples can be a change in some business logic may need to be applied to an already processed dataset.

SQL

SQL Datasets Data Engineering Data Engineer

How to unit test sql transforms in dbt

Start Data Engineering

JANUARY 16, 2021

Introduction Setup Code Conditional logic to read from mock input Custom macro to test for equality Setup environment specific test Run ELT using dbt Conclusion Further reading Introduction With the recent advancements in data warehouses and tools like dbt most transformations(T of ELT) are being done directly in the data warehouse. While this provides a lot of functionality out of the box, it gets tricky when you want to test your sql code locally before deploying to production.

SQL

SQL Data Warehouse Coding IT

How to update millions of records in MySQL?

Start Data Engineering

JANUARY 30, 2021

Introduction Setup Problems with a single large update Updating in batches Conclusion Further reading Introduction When updating a large number of records in an OLTP database, such as MySQL, you have to be mindful about locking the records. If those records are locked, they will not be editable(update or delete) by other transactions on your database.

MySQL

MySQL Database

How to Join a fact and a type 2 dimension (SCD2) table

Start Data Engineering

FEBRUARY 7, 2021

Introduction What is an SCD2 table and why use it? Application table Dimension table Setup Joining fact and SCD2 tables high_spenders user_items Educating end users Conclusion Further reading Introduction If you are using a data warehouse, you would have heard of fact and dimension tables. Simply put, fact tables are used to record a business event and dimension tables are used to record the attributes of business items(eg user, item tables in an e-commerce app).

Data Warehouse

Data Warehouse Education IT Data

Apache Superset Tutorial

Start Data Engineering

FEBRUARY 13, 2021

Why data exploration Apache Superset architecture Setup Prerequisites Seed data Using Apache Superset 1. Connecting to a data warehouse 2. Querying data in SQL Lab 3. Creating a chart 4. Creating a dashboard Pros and Cons Pros Cons Conclusion Why data exploration In most companies the end users of a data warehouse are analysts, data scientists and business people.

Data Warehouse

Data Warehouse SQL Architecture Data

How to set up a dbt data-ops workflow, using dbt cloud and Snowflake

Start Data Engineering

FEBRUARY 28, 2021

Introduction Pre-requisites Setting up the data-ops pipeline Snowflake Local development environment dbt cloud Connect to Snowflake Link to github repository Setup deployment(release/prod) environment Setup CI PR -> CI -> merge cycle Schedule jobs Host data documentation Conclusion and next steps Further reading References Introduction With companies realizing the importance of having correct data, there has been a lot of attention on the data-ops side of things.

Cloud

Cloud Data

How to trigger a spark job from AWS Lambda

Start Data Engineering

MARCH 27, 2021

Event driven pipelines Lambda function to trigger spark jobs Setup and run Monitoring and logging Teardown Conclusion Further reading References Event driven pipelines Event driven systems represent a software design pattern where a logic is executed in response to an event. This event can be a file creation on S3, a new database row, API call, etc.

AWS

AWS Cloud Storage Database Cloud

How to gather requirements to re-engineer a legacy data pipeline

Start Data Engineering

APRIL 8, 2021

Introduction Gathering requirements 0. Understand the current state of the data pipeline 1. Think like the end user 2. Know the why 3. End user interviews 4. Reduce the scope 5. End user walkthrough for proposed solution 6. Timelines & deliverables Deliver iteratively Conclusion Further reading References Introduction As data engineers, you will have to re-engineer legacy data pipelines.

Data Pipeline

Data Pipeline Engineering Data Data Engineering

Writing memory efficient data pipelines in Python

Start Data Engineering

APRIL 26, 2021

Introduction 1. Using generators Using generator expression Using generator yield Mini batching Reading in batches from a database Pros & Cons 2. Using distributed frameworks Pros & Cons Conclusion Further reading References Introduction If you are Wondering how to write memory efficient data pipelines in python Working with a dataset that is too large to fit into memory Then this post is for you.

Data Pipeline

Data Pipeline Python Datasets Database

Start Data Engineering

How to test PySpark code with pytest

Docker Fundamentals for Data Engineers

Webinars

Trending Sources

Data Engineering Best Practices - #2. Metadata & Logging

Webinars

Uplevel your dbt workflow with these tools and techniques

What is an Open Table Format? & Why to use one?

6 Steps to Avoid Messy Data in Your Warehouse

Data Engineering Best Practices - #1. Data flow & Code

What is a self-serve data platform & how to build one

How to become a valuable data engineer

Data Pipeline Design Patterns - #2. Coding patterns in Python

Books to level up your data skills!

10 Key skills, to help you become a data engineer

Scheduling a SQL script, using Apache Airflow, with an example

3 Key Points to Help You Partition Late Arriving Events

Review: Building a Real Time Data Warehouse

Apache Airflow Review: the good, the bad

Advantages of Using dbt(Data Build Tool)

What Does It Mean for a Column to Be Indexed

Change Data Capture Using Debezium Kafka and Pg

Data Engineering Project for Beginners - Batch edition

Thank You

A proven approach to land a Data Engineering job

What, why, when to use Apache Kafka, with an example

3 Key techniques, to optimize your Apache Spark code

Aws Account

Aws Emr

Designing a "low-effort" ELT system, using stitch and dbt

AWS RDS PostgreSQL Setup

Stitch Database to data warehouse Integration

Stitch S3 DB Integration

Ensuring Data Quality, With Great Expectations

Data Engineering Project: Stream Edition

How to submit Spark jobs to EMR cluster from Airflow

How to Pull Data from an API, Using AWS Lambda

How to do Change Data Capture (CDC), using Singer

How to Backfill a SQL query using Apache Airflow

How to unit test sql transforms in dbt

How to update millions of records in MySQL?

How to Join a fact and a type 2 dimension (SCD2) table

Apache Superset Tutorial

How to set up a dbt data-ops workflow, using dbt cloud and Snowflake

How to trigger a spark job from AWS Lambda

How to gather requirements to re-engineer a legacy data pipeline

Writing memory efficient data pipelines in Python

Stay Connected