Start Data Engineering

article thumbnail

How to test PySpark code with pytest

Start Data Engineering

1. Introduction 2. Ensure the code’s logic is working as expected with tests 2.1. Test types for data pipelines 2.2. pytest: A powerful Python library for testing 2.2.1. Set context, run code, check results & clean up 2.2.2. Tests are identified by their name 2.2.3. Use fixture to create fake data for testing 2.2.4. Define items to be shared among tests with conftest.

Coding 130
article thumbnail

Docker Fundamentals for Data Engineers

Start Data Engineering

1. Introduction 2. Docker concepts 2.1. Define the OS and its configurations with an image 2.2. Use the image to run containers 2.2.1. Communicate between containers and local OS 2.2.2. Start containers with docker CLI or compose 3. Conclusion 1. Introduction Docker can be overwhelming to start with. Most data projects use Docker to set up the data infra locally (and often in production).

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Data Engineering Best Practices - #2. Metadata & Logging

Start Data Engineering

1. Introduction 2. Setup & Logging architecture 3. Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Obtain visibility into the code’s execution sequence using text logs 3.3. Understand resource usage by tracking Metrics 3.4. Monitoring UI & Traceability 3.5.

Metadata 130
article thumbnail

Uplevel your dbt workflow with these tools and techniques

Start Data Engineering

1. Introduction 2. Setup 3. Ways to uplevel your dbt workflow 3.1. Reproducible environment 3.1.1. A virtual environment with Poetry 3.1.2. Use Docker to run your warehouse locally 3.2. Reduce feedback loop time when developing locally 3.2.1. Run only required dbt objects with selectors 3.2.2. Use prod datasets to build dev models with defer 3.2.3. Parallelize model building by increasing thread count 3.

Datasets 130
article thumbnail

What is an Open Table Format? & Why to use one?

Start Data Engineering

1. Introduction 2. What is an Open Table Format (OTF) 3. Why use an Open Table Format (OTF) 3.0. Setup 3.1. Evolve data and partition schema without reprocessing 3.2. See previous point-in-time table state, aka time travel 3.3. Git like branches & tags for your tables 3.4. Handle multiple reads & writes concurrently 4. Conclusion 5. Further reading 6.

Data 322
article thumbnail

6 Steps to Avoid Messy Data in Your Warehouse

Start Data Engineering

1. Introduction 2. Six Steps for a Clean Data Warehouse 2.1. Understand the business 2.2. Make data easy to use with the appropriate data model 2.3. Good input data is necessary for a good data warehouse 2.4. Define Source of Truth (SOT) and trace its usage 2.5. Keep stakeholders in the loop for a more significant impact 2.6. Watch out for org-level red flags ?

article thumbnail

Data Engineering Best Practices - #1. Data flow & Code

Start Data Engineering

1. Introduction 2. Sample project 3. Best practices 3.1. Use standard patterns that progressively transform your data 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks) 3.3. Avoid data duplicates with idempotent pipelines 3.4. Write DRY code & keep I/O separate from data transformation 3.5. Know the when, how, & what (aka metadata) of pipeline runs for easier debugging 3.

Coding 130