Blog - Data Engineering Digest

Data Engineering Weekly #123

Data Engineering Weekly

MARCH 19, 2023

link] Uber: Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi Uber writes a comprehensive guide on running incremental ETL using Apache Hudi. The blog discusses implementing Type-2 SCD modeling and strategies to generate surrogate keys and bridge tables to handle many-to-many relationships.

Data Engineering

Data Engineering Data Engineer Engineering Media

Tips to Build a Robust Data Lake Infrastructure

DareData

JULY 5, 2023

In this blog post, we aim to share practical insights and techniques based on our real-world experience in developing data lake infrastructures for our clients - let's start! Learn how we build data lake infrastructures and help organizations all around the world achieving their data goals. Data Sources: How different are your data sources?

Data Lake

Data Lake Building Raw Data ETL Tools

Enhancing Efficiency: Robinhood’s Batch Processing Platform

Robinhood

FEBRUARY 7, 2024

In this blog, we explore the evolution of our in-house batch processing infrastructure and how it helps Robinhood work smarter. Our V1 batch processing architecture was robust, anchored by Apache Spark on multiple Hadoop clusters (Spark is known for effectively handling large-scale data processing). Authored by: Grace L.,

Process

Process Hadoop Architecture Accessible

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Introduction At Lyft, we have used systems like Apache ClickHouse and Apache Druid for near real-time and sub-second analytics. In this particular blog post, we explain how Druid has been used at Lyft and what led us to adopt ClickHouse for our sub-second analytic system. Written by Ritesh Varyani and Jeana Choi at Lyft.

Kafka

Kafka Data Ingestion Datasets Architecture

A Beginner’s Guide to Learning PySpark for Big Data Processing

ProjectPro

JANUARY 25, 2022

Did you know that, according to Linkedin, over 24,000 Big Data jobs in the US list Apache Spark as a required skill? One of the most in-demand technical skills these days is analyzing large data sets, and Apache Spark and Python are two of the most widely used technologies to do this. This is where Apache Spark PySpark comes in.

Big Data

Big Data Data Process Process Kafka

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

This blog will walk through the most popular and fascinating open source big data projects. This blog will walk through the most popular and fascinating open source big data projects. Apache Beam Source: Google Cloud Platform Apache Beam is an advanced unified programming open-source model launched in 2016.

Big Data

Big Data Project Metadata Programming Language

Achieving Insights and Savings with Cost Data

Airbnb Tech

APRIL 13, 2021

Apache Airflow , Apache Hive, Apache Spark ) and extensive analytics infrastructure (i.e., Minerva , Apache Druid , DataPortal , Apache Superset , SLA monitoring ) to make data-informed decisions. A foundation of robust and actionable data is essential for a successful efficiency program.

AWS

AWS Raw Data Amazon Web Services Cloud

Data Engineer Learning Path, Career Track & Roadmap for 2023

ProjectPro

JANUARY 19, 2022

It involves building pipelines that can fetch data from the source, transform it into a usable form, and analyze variables present in the data. The first step is to work on cleaning it and eliminating the unwanted information in the dataset so that data analysts and data scientists can use it for analysis. What is Data Engineering?

Data Engineering

Data Engineering Data Engineer Engineering Amazon Web Services

20+ Data Engineering Projects for Beginners with Source Code

ProjectPro

AUGUST 24, 2021

And, out of these professions, this blog will discuss the data engineering job role. Nevertheless, that is not the only job in the data world. Data professionals who work with raw data like data engineers, data analysts, machine learning scientists , and machine learning engineers also play a crucial role in any data science project.

Data Engineering

Data Engineering Data Engineer Coding Project

DataOps: What Is It, Core Principles, and Tools For Implementation

phData: Data Engineering

JANUARY 3, 2022

Software engineering practices define how to reliably and effectively build software and data products, delivering value faster to your customers. In this post, we will explore the complexities involved with software engineering with a focus on data engineering and data operations (DataOps). Want to Save This eBook for Later?

IT

IT AWS Software Engineer Software Engineering

61 Data Observability Use Cases From Real Data Teams

Monte Carlo

MAY 17, 2023

In less than three years it has gone from an idea sketched out in a Barr Moses blog post to climbing the Gartner Hype Cycle for Emerging Technology. Data observability, an organization’s ability to fully understand the health and quality of the data in their systems, has become one of the hottest technologies in modern data engineering.

Data

Data Data Pipeline Data Engineering Data Engineer

61 Data Observability Use Cases That Aren’t Totally Made Up

Monte Carlo

MAY 17, 2023

In less than three years it has gone from an idea sketched out in a Barr Moses blog post to climbing the Gartner Hype Cycle for Emerging Technology. Data observability, an organization’s ability to fully understand the health and quality of the data in their systems, has become one of the hottest technologies in modern data engineering.

Data Pipeline

Data Pipeline Data Data Engineering Data Engineer

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Cloudera

MAY 18, 2021

The framework that I built for that comparison includes three dimensions: Technology cost rationalization by converting a fixed, cost structure associated with Cloudera subscription costs per node into a variable cost model based on actual consumption. Apache Ranger (part of HDP and HDF). Apache Ranger (part of SDX).

Hadoop

Hadoop Cloud AWS Utilities

Data Engineering Digest

Data Engineering Weekly #123

Tips to Build a Robust Data Lake Infrastructure

Webinars

Trending Sources

Enhancing Efficiency: Robinhood’s Batch Processing Platform

Webinars

Druid Deprecation and ClickHouse Adoption at Lyft

A Beginner’s Guide to Learning PySpark for Big Data Processing

20 Best Open Source Big Data Projects to Contribute on GitHub

Achieving Insights and Savings with Cost Data

Data Engineer Learning Path, Career Track & Roadmap for 2023

20+ Data Engineering Projects for Beginners with Source Code

DataOps: What Is It, Core Principles, and Tools For Implementation

61 Data Observability Use Cases From Real Data Teams

61 Data Observability Use Cases That Aren’t Totally Made Up

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Stay Connected