Blog - Data Engineering Digest

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

ChatGPT> DataOps, or data operations, is a set of practices and technologies that organizations use to improve the speed, quality, and reliability of their data analytics processes. The goal of DataOps is to help organizations make better use of their data to drive business decisions and improve outcomes.

Machine Learning

Machine Learning Data Preparation Government Data Analytics

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty This blog post will cover how Psyberg helps automate the end-to-end catchup of different pipelines, including dimension tables. In the previous installments of this series, we introduced Psyberg and delved into its core operational modes: Stateless and Stateful Data Processing.

Metadata

Metadata Data Pipeline Scala Data Workflow

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Cloudera

JANUARY 17, 2024

Cloudera DataFlow for the Public Cloud (CDF-PC) is a complete self-service streaming data capture and movement platform based on Apache NiFi. It allows developers to interactively design data flows in a drag and drop designer, which can be deployed as continuously running, auto-scaling flow deployments or event-driven serverless functions.

Bytes

Bytes Architecture Designing Building

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Sub-second query systems allow for near real-time data explorations and low latency, high throughput queries, which are particularly well-suited for handling time-series data. For our customers, this means faster analytics on near real-time data and decision making. Written by Ritesh Varyani and Jeana Choi at Lyft.

Kafka

Kafka Data Ingestion Datasets Architecture

Automating dead code cleanup

Engineering at Meta

OCTOBER 24, 2023

In our last blog post on automatic product deprecation , we talked about the complexities of product deprecations, and a solution Meta has built called the Systematic Code and Asset Removal Framework (SCARF). Meta’s Systematic Code and Asset Removal Framework (SCARF) has a subsystem for identifying and removing dead code.

Coding

Coding Programming Language Python MySQL

Mastering Model Retraining in MLOps

RandomTrees

APRIL 12, 2024

Model retraining is a critical component of any robust MLOps stack, playing a fundamental role in ensuring the longevity and effectiveness of machine learning models. Model retraining, in essence, involves the creation of a new iteration of a machine learning model by rerunning the training pipeline with updated data.

Machine Learning

Machine Learning Datasets Systems Process

Data Observability Tools: Types, Capabilities, and Notable Solutions

Databand.ai

JULY 5, 2023

What Are Data Observability Tools? Data observability tools are software solutions that oversee, analyze, and improve the performance of data pipelines. In this article: Why Are Data Observability Tools Important? Faster Troubleshooting Data pipeline failures are expensive and damaging to organizations.

Data Pipeline

Data Pipeline Data Lake Data Warehouse Datasets

Data testing tools: Key capabilities you should know

Databand.ai

AUGUST 30, 2023

Data testing tools: Key capabilities you should know Helen Soloveichik August 30, 2023 Data testing tools are software applications designed to assist data engineers and other professionals in validating, analyzing and maintaining data quality. There are several types of data testing tools.

Data Cleanse

Data Cleanse Data Pipeline Datasets Data Validation

Scalable Annotation Service?—?Marken

Netflix Tech

JANUARY 25, 2023

Scalable Annotation Service — Marken by Varun Sekhri , Meenakshi Jindal Introduction At Netflix, we have hundreds of micro services each with its own data models or entities. But there are more interesting cases where users want to store temporal (time-based) data or spatial data. Allows to annotate any entity.

Algorithm

Algorithm Media Metadata Data Ingestion

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. Iceberg is a high-performance open table format for huge analytic data sets. This enables you to maximize utilization of streaming data at scale. The Catalog Type should be set to Hive. ssb_default`.`iceberg_hive_example`

Process

Process SQL Kafka Database

How DoorDash Migrated from StatsD to Prometheus

DoorDash Engineering

AUGUST 1, 2023

Just when we most needed observability data, the system would leave us in the lurch. Figure 1: DoorDash previously used StatsD proxy and server pipelines for microservices’ metrics The design shown in Figure 1 reflects our legacy architecture for observability. StatsD is a network daemon built in Node.js

AWS

AWS Transportation Programming Language Government

One Big Cluster Stuck: The Right Tool for the Right Job

Cloudera

JUNE 26, 2023

Here are some tips and tricks of the trade to prevent well-intended yet inappropriate data engineering and data science activities from cluttering or crashing the cluster. Using CDSW primarily for scheduling and automating any type of workflow is a misuse of the service.

ETL Tools

ETL Tools Programming Language Datasets Data Pipeline

DEW #131: dbt model contract, Instacart ads modularization in LakeHouse Architecture, Jira to automate Glue tables, Server-Side Tracking

Data Engineering Weekly

JUNE 8, 2023

Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives. As a model owner, if I change the columns or types in the SQL, it's usually intentional. - Mine was over six years.

Architecture

Architecture SQL Data Pipeline Data Engineering

How to Translate SQL Scripts Into Matillion Jobs

phData: Data Engineering

JULY 12, 2023

In this blog, we’ll explore how Matillion Jobs can simplify the data transformation process by allowing users to visualize the data flow of a job from start to finish. We will cover the essential components of a Matillion job and provide examples to help illustrate the concepts. With that, let’s dive in!

SQL

SQL Database Data Pipeline Coding

Cloudera Data Engineering 2021 Year End Review

Cloudera

DECEMBER 21, 2021

Since the release of Cloudera Data Engineering (CDE) more than a year ago , our number one goal was operationalizing Spark pipelines at scale with first class tooling designed to streamline automation and observability. Data pipelines are composed of multiple steps with dependencies and triggers. New in 2021.

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

We just announced the general availability of Cloudera DataFlow Designer , bringing self-service data flow development to all CDP Public Cloud customers. In our previous DataFlow Designer blog post , we introduced you to the new user interface and highlighted its key capabilities.

Data Pipeline

Data Pipeline Designing Kafka Metadata

Data Testing Tools: Key Capabilities and 6 Tools You Should Know

Databand.ai

AUGUST 30, 2023

Data Testing Tools: Key Capabilities and 6 Tools You Should Know Helen Soloveichik August 30, 2023 What Are Data Testing Tools? Data testing tools are software applications designed to assist data engineers and other professionals in validating, analyzing, and maintaining data quality.

Data Cleanse

Data Cleanse Data Validation Data Pipeline Datasets

DataOps vs. MLOps: Similarities, Differences, and How to Choose

Databand.ai

JULY 17, 2023

DataOps , short for Data Operations, is an emerging discipline that focuses on improving the collaboration, integration, and automation of data management processes. It aims to streamline the entire data lifecycle—from ingestion and preparation to analytics and reporting.

Data Pipeline

Data Pipeline Machine Learning High Quality Data BI

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Data Pipeline Observability: A Model For Data Engineers Eitan Chazbani June 29, 2023 Data pipeline observability is your ability to monitor and understand the state of a data pipeline at any time. We believe the world’s data pipelines need better data observability.

Data Pipeline

Data Pipeline Data Engineering Data Engineer Engineering

How we built consitent product launch metrics with the dbt Semantic Layer.

dbt Developer Hub

DECEMBER 11, 2023

This blog post walks through the end-to-end process we used to set up product analytics for the dbt Semantic Layer using the dbt Semantic Layer. Getting your data ready for metrics The first steps to building a product analytics pipeline with the Semantic Layer look the same as just using dbt - it’s all about data transformation.

Finance

Finance BI SQL Cloud

7 Data Testing Methods, Why You Need Them & When to Use Them

Databand.ai

AUGUST 30, 2023

7 Data Testing Methods, Why You Need Them & When to Use Them Helen Soloveichik August 30, 2023 What Is Data Testing? Data testing involves the verification and validation of datasets to confirm they adhere to specific requirements. This is part of a series of articles about data quality.

Data Validation

Data Validation Data Integration Data Database

How to Translate SQL Scripts Into Matillion Jobs

phData: Data Engineering

APRIL 21, 2023

In this blog, we’ll explore how Matillion Jobs can simplify the data transformation process by allowing users to visualize the data flow of a job from start to finish. We will cover the essential components of a Matillion job and provide examples to help illustrate the concepts.

SQL

SQL Database Data Pipeline Coding

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

Data Engineering Podcast

DECEMBER 16, 2018

Summary Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. Can you start by sharing your definition of a data pipeline?

Data Pipeline

Data Pipeline Data Lake Data Warehouse Python

Evolving And Scaling The Data Platform at Yotpo

Data Engineering Podcast

MAY 1, 2022

Summary Building a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform.

Data Warehouse

Data Warehouse Data Lake Architecture Data

Data Engineering Weekly #122

Data Engineering Weekly

MARCH 12, 2023

Contribute to the Rudderstack Transformations Library, Win $1000 RudderStack Transformations lets you customize event data in real time with your own JavaScript or Python code. link] Editor’s Note: Data Engineering Radio At Data Engineering Weekly, We strive to bring the best thought process around building and operating data.

Data Engineering

Data Engineering Data Engineer Engineering SQL

Streams Replication Manager Prefixless Replication

Cloudera

JANUARY 31, 2024

Replication is a crucial capability in distributed systems to address challenges related to fault tolerance, high availability, load balancing, scalability, data locality, network efficiency, and data durability. SRM replicates data at high performance and keeps topic properties in sync across clusters.

Management

Management Kafka Big Data Cloud

Building Real-time Machine Learning Foundations at Lyft

Lyft Engineering

JUNE 28, 2023

However, streaming data was not supported as a first-class citizen across many of the platform’s systems — such as training, complex monitoring, and others. However, streaming data was not supported as a first-class citizen across many of the platform’s systems — such as training, complex monitoring, and others.

Machine Learning

Machine Learning Building Metadata Kafka

Supercharge your Airflow Pipelines with the Cloudera Provider Package

Cloudera

SEPTEMBER 21, 2021

Many customers looking at modernizing their pipeline orchestration have turned to Apache Airflow, a flexible and scalable workflow manager for data engineers. But now it has become very simple and secure with our release of the Cloudera Airflow provider , which gives users the best of Airflow and CDP data services.

Python

Python Cloud Data Workflow Accessible

Data Quality Platform: Benefits, Key Features, and How to Choose

Databand.ai

JULY 11, 2023

Data Quality Platform: Benefits, Key Features, and How to Choose Eric Jones July 11, 2023 What Is a Data Quality Platform? A data quality platform is a software solution designed to help organizations manage, maintain, and improve the quality of their data. In this article: Why Do You Need a Data Quality Platform?

Data Cleanse

Data Cleanse Telecommunication High Quality Data BI

Gartner Market Guide to DataOps Software

DataKitchen

DECEMBER 6, 2022

The two things we are most excited about are: First, DataOps is distinct from all Data Analytic tools. As founders, we sat in a room eight years ago (when all the rage was Hadoop, data prep, and data lakes) and debated — will there ever be an ‘ops’ layer that sits next to all the current data tools?

Data Lake

Data Lake Hadoop Government Data Analytics

Running Unified PubSub Client in Production at Pinterest

Pinterest Engineering

NOVEMBER 7, 2023

Jeff Xiang | Software Engineer, Logging Platform Vahid Hashemian | Software Engineer, Logging Platform Jesus Zuniga | Software Engineer, Logging Platform At Pinterest, data is ingested and transported at petabyte scale every day, bringing inspiration for our users to create a life they love.

Kafka

Kafka Java Software Engineer Software Engineering

Transforming MLOps at DoorDash with Machine Learning Workbench

DoorDash Engineering

NOVEMBER 28, 2023

It is amusing for a human being to write an article about artificial intelligence in a time when AI systems, powered by machine learning (ML), are generating their own blog posts. As shown in Figure 1, data science intersects ML in multiple ways and is paramount to DoorDash’s success.

Machine Learning

Machine Learning Pipeline-centric Data Science Designing

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

Data Engineering Podcast

JUNE 17, 2017

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Can you start by giving an overview of your pipeline and the type of workload that you are optimizing for?

Data Pipeline

Data Pipeline Kafka Business Intelligence Architecture

Announcing halide-haskell - a Haskell interface for the Halide image and array processing language

Tweag

JUNE 7, 2023

For the longer answer, suppose you have an array data type, say Vector from the vector library, and now wish to implement some operations with it. fromIntegral For each color component in a pixel, you convert it to a Float , multiply it by some factor , and then cast it back to Word8 making sure it doesn’t overflow. min 255.0 ).

Process

Process Coding Python Deep Learning

What is a DataOps Engineer?

DataKitchen

OCTOBER 5, 2021

A DataOps Engineer owns the assembly line that’s used to build a data and analytic product. We find it helpful to think of data operations as a factory. We find it helpful to think of data operations as a factory. Most organizations run the data factory using manual labor. Figure 1: Ford assembly line, 1913.

Engineering

Engineering Raw Data SQL Data Engineering

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. This blog will be published in two parts. This is what we call the first-mile problem.

Process

Process Kafka SQL Machine Learning

Data Engineering Weekly #128

Data Engineering Weekly

APRIL 23, 2023

Data Engineering Weekly Is Brought to You by RudderStack RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. 🍒 We'll send exclusive The Data Stack Show swag just for participating!

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

Bringing Feature Stores and MLOps to the Enterprise at Tecton

Data Engineering Podcast

JANUARY 4, 2021

Summary As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. As a result the feature store is becoming a required piece of the data platform. And don’t forget to thank them for their continued support of this show!

Python

Python Data Lake Machine Learning Computer Science

PinCompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest

Pinterest Engineering

OCTOBER 31, 2023

In this article, we discuss the PinCompute primitives, architecture, control plane and data plane capabilities, and showcase the value that PinCompute has delivered for innovation and efficiency at Pinterest. Architecture PinCompute is a regional Platform-as-a-Service (PaaS) that builds on top of Kubernetes.

Architecture

Architecture Pipeline-centric Accessible Accessibility

Addressing Data Mesh Technical Challenges with DataOps

DataKitchen

AUGUST 9, 2021

Below is our third post (3 of 5) on combining data mesh with DataOps to foster greater innovation while addressing the challenges of a decentralized architecture. We’ve talked about data mesh in organizational terms (see our first post, “ What is a Data Mesh? ”) and how team structure supports agility. Source: Thoughtworks.

Pharmaceutical

Pharmaceutical Raw Data Data Lake Data

Improved Alerting with Atlas Streaming Eval

Netflix Tech

APRIL 27, 2023

Atlas is an in-memory time-series database that ingests multiple billions of time-series per day and retains the last two weeks of data. Moreover, common database optimizations like caching recently queried data don’t really work for alerting queries because, generally speaking, the last received datapoint is required for correctness.

Database

Database Architecture Consulting Systems

Machine Learning Cheat Sheet (2024)

Knowledge Hut

MARCH 27, 2024

It also contains an extensive collection of data-on-data formatting techniques, slicing, and system and local variable information. Over the last few decades, machine learning has fundamentally altered how systems function and decisions are made. What is a Machine Learning cheat sheet?

Machine Learning

Machine Learning Deep Learning Algorithm Certification

Detecting Speech and Music in Audio Content

Netflix Tech

NOVEMBER 13, 2023

From the violin melody accompanying a pivotal scene to the soaring orchestral arrangement and thunderous sound-effects propelling an edge-of-your-seat action sequence, the various components of the audio soundtrack combine to evoke the very essence of story-telling.

Datasets

Datasets Metadata Algorithm Architecture

An AI Chat Bot Wrote This Blog Post …

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

Webinars

Trending Sources

3. Psyberg: Automated end to end catch up

Webinars

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Druid Deprecation and ClickHouse Adoption at Lyft

Automating dead code cleanup

Mastering Model Retraining in MLOps

Data Observability Tools: Types, Capabilities, and Notable Solutions

Data testing tools: Key capabilities you should know

Scalable Annotation Service?—?Marken

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

How DoorDash Migrated from StatsD to Prometheus

One Big Cluster Stuck: The Right Tool for the Right Job

DEW #131: dbt model contract, Instacart ads modularization in LakeHouse Architecture, Jira to automate Glue tables, Server-Side Tracking

How to Translate SQL Scripts Into Matillion Jobs

Cloudera Data Engineering 2021 Year End Review

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Data Testing Tools: Key Capabilities and 6 Tools You Should Know

DataOps vs. MLOps: Similarities, Differences, and How to Choose

Data Pipeline Observability: A Model For Data Engineers

How we built consitent product launch metrics with the dbt Semantic Layer.

7 Data Testing Methods, Why You Need Them & When to Use Them

How to Translate SQL Scripts Into Matillion Jobs

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

Evolving And Scaling The Data Platform at Yotpo

Data Engineering Weekly #122

Streams Replication Manager Prefixless Replication

Building Real-time Machine Learning Foundations at Lyft

Supercharge your Airflow Pipelines with the Cloudera Provider Package

Data Quality Platform: Benefits, Key Features, and How to Choose

Gartner Market Guide to DataOps Software

Running Unified PubSub Client in Production at Pinterest

Transforming MLOps at DoorDash with Machine Learning Workbench

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

Announcing halide-haskell - a Haskell interface for the Halide image and array processing language

What is a DataOps Engineer?

Fraud Detection with Cloudera Stream Processing Part 1

Data Engineering Weekly #128

Bringing Feature Stores and MLOps to the Enterprise at Tecton

PinCompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest

Addressing Data Mesh Technical Challenges with DataOps

Improved Alerting with Atlas Streaming Eval

Machine Learning Cheat Sheet (2024)

Detecting Speech and Music in Audio Content

Stay Connected