Data Engineering Digest

Building ETL Pipelines With Generative AI

Data Engineering Podcast

OCTOBER 1, 2023

Summary Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI. With Materialize, you can!

Building

Building BI SQL Machine Learning

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

Streaming Data Pipelines: What Are They and How to Build One

Precisely

DECEMBER 28, 2023

The concept of streaming data was born of necessity. But insights derived from day-old data don’t cut it. Business success is based on how we use continuously changing data. That’s where streaming data pipelines come into play. What is a streaming data pipeline?

Data Pipeline

Data Pipeline Building Kafka Big Data

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

Data Engineering Podcast

SEPTEMBER 3, 2023

Summary Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Data Integration

Data Integration BI SQL Python

Our First Netflix Data Engineering Summit

Netflix Tech

DECEMBER 14, 2023

Engineers from across the company came together to share best practices on everything from Data Processing Patterns to Building Reliable Data Pipelines. The result was a series of talks which we are now sharing with the rest of the Data Engineering community!

Data Engineering

Data Engineering Data Engineer Engineering Metadata

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Learn data engineering, all the references ( credits ) This is a special edition of the Data News. But right now I'm in holidays finishing a hiking week in Corsica 🥾 So I wrote this special edition about: how to learn data engineering in 2024. Who are the data engineers?

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Apache Spark Use Cases & Applications

Knowledge Hut

MAY 2, 2024

As per Apache, “ Apache Spark is a unified analytics engine for large-scale data processing ” Spark is a cluster computing framework, somewhat similar to MapReduce but has a lot more capabilities, features, speed and provides APIs for developers in many languages like Scala, Python, Java and R. billion (2019 - 2022).

Scala

Scala Hospitality Healthcare Retail

Building Real-time Machine Learning Foundations at Lyft

Lyft Engineering

JUNE 28, 2023

In early 2022, Lyft already had a comprehensive Machine Learning Platform called LyftLearn composed of model serving , training , CI/CD, feature serving , and model monitoring systems. However, streaming data was not supported as a first-class citizen across many of the platform’s systems — such as training, complex monitoring, and others.

Machine Learning

Machine Learning Building Metadata Kafka

What is Apache Airflow?

Marc Lamberti

SEPTEMBER 22, 2023

In this article, you will learn everything about what Airflow is, what it isn’t, and its core concepts and components. Stemming from this analogy, “you” is the orchestrator in data orchestration, and the recipe is the data pipeline. When you create a data pipeline, you create a DAG.

Data Pipeline

Data Pipeline Python Metadata Database

Snowflake’s AWS re:Invent Highlights for Fast-Tracking ML, Gen AI and Application Innovations

Snowflake

DECEMBER 5, 2023

Bring gen AI to your governed enterprise data with Snowflake Cortex No matter your industry, department or role, you can leverage generative AI (gen AI) and large language models (LLMs) to increase efficiencies and uncover new solutions to business challenges.

AWS

AWS Amazon Web Services Government Cloud Computing

Adopting Real-Time Data At Organizations Of Every Size

Data Engineering Podcast

DECEMBER 4, 2022

Summary The term "real-time data" brings with it a combination of excitement, uncertainty, and skepticism. In this episode Arjun Narayan explains how the technical barriers to adopting real-time data in your analytics and applications have become surmountable by organizations of all sizes.

Data Lake

Data Lake MongoDB MySQL Data Warehouse

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

Data Engineering Podcast

DECEMBER 11, 2022

Summary One of the most critical aspects of software projects is managing its data. Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage.

Database

Database MySQL Data Lake MongoDB

Speeding Up The Time To Insight For Supply Chains And Logistics With The Pathway Database That Thinks

Data Engineering Podcast

OCTOBER 16, 2022

Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics. Atlan is the metadata hub for your data ecosystem. And don’t forget to thank them for their continued support of this show!

Database

Database Metadata MongoDB MySQL

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty In the inaugural blog post of this series, we introduced you to the state of our pipelines before Psyberg and the challenges with incremental processing that led us to create the Psyberg framework within Netflix’s Membership and Finance data engineering team.

Data Process

Data Process Process Metadata Finance

Data News — Week 23.22

Christophe Blefari

JUNE 3, 2023

I wanted to write more about Microsoft Fabric and the states of data that were published last week but I'll do it another time. The article is saying that it lays down with Japanese new strategy to become a leader in AI technologies, by removing barriers on training data they hope to open doors. Or several needles.

Data Pipeline

Data Pipeline Data SQL Algorithm

Top 12 Data Engineering Project Ideas [With Source Code]

Knowledge Hut

JUNE 26, 2023

Welcome to the world of data engineering, where the power of big data unfolds. If you're aspiring to be a data engineer and seeking to showcase your skills or gain hands-on experience, you've landed in the right spot. Therefore, the greatest thing you can do as a novice is to work on some real-time data engineering initiatives.

Data Engineering

Data Engineering Data Engineer Coding Project

Startup Spotlight: Patch Helps Devs Unblock Pipelines With Data Packages

Snowflake

DECEMBER 21, 2023

In this edition, Patch.tech Co-Founder and CPO Whelan Boyd talks about how frustration with clogged data pipelines sparked the idea for Patch’s code packages, which allow engineers to distribute data sets with all the built-in elements that analysts and developers need to create apps. What inspired you to start Patch?

Software Engineer

Software Engineer Software Engineering Database Data Pipeline

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Sub-second query systems allow for near real-time data explorations and low latency, high throughput queries, which are particularly well-suited for handling time-series data. For our customers, this means faster analytics on near real-time data and decision making. An example of how we use Druid rollup at Lyft.

Kafka

Kafka Data Ingestion Datasets Architecture

Data Engineering Weekly #157

Data Engineering Weekly

FEBRUARY 4, 2024

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more. What is Data Modeling, and what is not?

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Data Engineering Weekly #164

Data Engineering Weekly

MARCH 24, 2024

link] Kai Waehner: The Data Streaming Landscape 2024 This is a comprehensive overview of the state of the data streaming landscape in 2024. As we predicted in the key trends of 2023 about Apache Flink as a clear winner in the stream processing frameworks, we see Confluent offering Flink as a service.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

IBM Technology Chooses Cloudera as its Preferred Partner for Addressing Real Time Data Movement Using Kafka

Cloudera

SEPTEMBER 26, 2023

Organizations increasingly rely on streaming data sources not only to bring data into the enterprise but also to perform streaming analytics that accelerate the process of being able to get value from the data early in its lifecycle.

Kafka

Kafka Technology IT Government

Data Engineering Weekly #161

Data Engineering Weekly

MARCH 3, 2024

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more. There will be food, networking, and real-world talks around data engineering. link] Nvidia: What Is Sovereign AI?

Data Engineering

Data Engineering Data Engineer Pipeline-centric Engineering

4 Ways to Tackle Data Pipeline Optimization

Monte Carlo

FEBRUARY 8, 2024

Just as a watchmaker meticulously adjusts every tiny gear and spring in harmonious synchrony for flawless performance, modern data pipeline optimization requires a similar level of finesse and attention to detail. Learn how cost, processing speed, resilience, and data quality all contribute to effective data pipeline optimization.

Data Pipeline

Data Pipeline AWS Datasets Data

SNP Unlocks SAP Data for Advanced Analytics with Its Snowflake Native App

Snowflake

MARCH 14, 2024

As a cohesive ERP solution, SAP is often one of the largest data resources in an organization, containing everything from financial and transactional data to master information about customers, vendors, materials, facilities, planning and even HR. What’s the challenge with unlocking SAP data?

IT

IT Data Ingestion Data AWS

Confluent named a leader in The Forrester Wave™: Cloud Data Pipelines, Q4 2023

Confluent

JANUARY 24, 2024

Learn why Confluent was named as a leader among cloud data pipelines, innovating, in our opinion, every industry with real-time stream processing and analytics, cloud-native Apache Kafka®, and robust developer tooling.

Data Pipeline

Data Pipeline Cloud Kafka Data

Top 20 Azure Data Engineering Projects in 2023 [Source Code]

Knowledge Hut

NOVEMBER 2, 2023

Azure Data engineering projects are complicated and require careful planning and effective team participation for a successful completion. While many technologies are available to help data engineers streamline their workflows and guarantee that each aspect meets its objectives, ensuring that everything works properly takes time.

Data Engineering

Data Engineering Data Engineer Coding Project

One Big Cluster Stuck: The Right Tool for the Right Job

Cloudera

JUNE 26, 2023

Here are some tips and tricks of the trade to prevent well-intended yet inappropriate data engineering and data science activities from cluttering or crashing the cluster. For data engineering and data science teams, CDSW is highly effective as a comprehensive platform that trains, develops, and deploys machine learning models.

ETL Tools

ETL Tools Programming Language Datasets Data Pipeline

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty At Netflix, our Membership and Finance Data Engineering team harnesses diverse data related to plans, pricing, membership life cycle, and revenue to fuel analytics, power various dashboards, and make data-informed decisions. We expect complete and accurate data at the end of each run.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

New Snowflake Features Released in May–July 2023

Snowflake

AUGUST 16, 2023

Applications Snowflake Native App Framework now available in AWS – public preview Snowflake Native Apps are an entirely new way to put data to work. Learn more here. Learn more here. Data comes in a continuous manner, and often a separate architecture is required to handle streaming data.

Scala

Scala Transportation Kafka Data Lake

5 Hard Truths About Generative AI for Technology Leaders

Monte Carlo

JANUARY 3, 2024

Data teams are scrambling to answer the call. That quick check of the box feels like a step forward, but if you aren’t already thinking about how to connect LLMs with your proprietary data and business context to actually drive differentiated value, you’re behind. Hard truth #4: Your data isn’t ready yet anyway.

Technology

Technology Database Data Governance Data Engineering

Data Enrichment in Existing Data Pipelines Using Confluent Cloud

Confluent

AUGUST 16, 2022

Learn how you can integrate data streams into your environment, and enrich data across your existing data pipelines using Confluent Cloud.

Data Pipeline

Data Pipeline Cloud Data

Top Confluent Alternatives

Striim

AUGUST 26, 2023

While Confluent is a well-known option for data streaming platforms, its complexity can pose significant challenges for businesses. This complexity not only has technical repercussions but also impacts business operations by leading to unreliable or missing data, which can be both costly and time-consuming to rectify.

MongoDB

MongoDB Google Cloud Kafka AWS

Auto-Diagnosis and Remediation in Netflix Data Platform

Netflix Tech

JANUARY 13, 2022

By Vikram Srivastava and Marcelo Mayworm Netflix has one of the most complex data platforms in the cloud on which our data scientists and engineers run batch and streaming workloads. And we can’t discount the productivity impact it causes on data platform users.

Kafka

Kafka Big Data Data Machine Learning

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

In this dynamic realm of data engineering, a monumental challenge takes centre stage: efficiently managing the ever-changing tides of real-time data. Data, the lifeblood of organisations, holds the key to unlocking untapped potential and propelling businesses forward. In this blog, we will cover: What Is CDC and Its Benefits?

Telecommunication

Telecommunication Metadata Healthcare Finance

7 Essential Data Cleaning Best Practices

Monte Carlo

APRIL 1, 2024

But, for data engineers, there’s something else that comes pretty close to the top of that list: clean data. Data cleaning is an essential step to ensure your data is safe from the adage “garbage in, garbage out.” Define Clear Data Quality Standards 2. Implement Routine Data Audits 3. Table of Contents 1.

High Quality Data

High Quality Data Datasets Data Data Pipeline

For your eyes only: improving Netflix video quality with neural networks

Netflix Tech

NOVEMBER 17, 2022

To do so, we continuously push the boundaries of streaming video quality and leverage the best video technologies. There are, roughly speaking, two steps to encode a video in our pipeline: Video preprocessing, which encompasses any transformation applied to the high-quality source video prior to encoding.

Media

Media Architecture Algorithm Designing

Snowflake Expands Programmability to Bolster Support for AI/ML and Streaming Pipeline Development

Snowflake

JUNE 28, 2023

At Snowflake, we’re helping data scientists, data engineers, and application developers build faster and more efficiently in the Data Cloud. To make it even easier to process data with Snowpark Python UDFs and Stored Procedures, we have added support for Python 3.9 and unstructured data support, now in public preview.

Pipeline-centric

Pipeline-centric Programming Language Government Python

Life of a Netflix Partner Engineer?—?The case of extra 40 ms

Netflix Tech

DECEMBER 14, 2020

The case of the extra 40 ms By: John Blair , Netflix Partner Engineering The Netflix application runs on hundreds of smart TVs, streaming sticks and pay TV set top boxes. Meanwhile, a field engineer for the chip vendor had diagnosed the root cause: Netflix’s Android TV application, called Ninja, was not delivering audio data quickly enough.

Bytes

Bytes Engineering Manufacturing Coding

Data Engineering Weekly #162

Data Engineering Weekly

MARCH 10, 2024

Editor’s Note: Chennai Meetup Wrap-Up & Preparation work started for DEWCon I am so grateful for the enthusiastic participants who made our Chennai Data Heroes- Community for Data Folks meetup vibrant! Big thanks to our insightful speakers, Hareshkumar Selvakumar - Talks about his work on Data Products for PayPal.

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more. The blog is an excellent read to understand late-arriving data, backfilling, and incremental processing complications.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

A Gentle Introduction to Analytical Stream Processing

Towards Data Science

APRIL 3, 2023

Building a Mental Model for Engineers and Anyone in Between Stream Processing can be handled gently and with care, or wildly, and almost out of control! By processing a smaller set of data, more often , you effectively divide and conquer a data problem that may otherwise be cost and time prohibitive.

Process

Process Data Lake Systems Bytes

10 Essential Azure Data Engineer Skills to Improve in 2023

Knowledge Hut

NOVEMBER 17, 2023

Azure Data Engineers play an important role in building efficient, secure, and intelligent data solutions on Microsoft Azure's powerful platform. The position of Azure Data Engineers is becoming increasingly important as businesses attempt to use the power of data for strategic decision-making and innovation.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. containing data that may have to be used to enrich the streaming data.

Process

Process Kafka SQL Machine Learning

Striim Cloud on AWS: Unify your data with a fully managed change data capture and data streaming service

Striim

NOVEMBER 30, 2022

Businesses of all scales and industries have access to increasingly large amounts of data, which need to be harnessed effectively. According to an IDG Market Pulse survey , companies collect data from 400 sources on average. However, fragmented data can slow down the delivery of great product experiences and internal operations.

AWS

AWS Cloud Management MySQL

Building ETL Pipelines With Generative AI

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

Webinars

Trending Sources

Streaming Data Pipelines: What Are They and How to Build One

Webinars

Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

Our First Netflix Data Engineering Summit

How to learn data engineering

Apache Spark Use Cases & Applications

Building Real-time Machine Learning Foundations at Lyft

What is Apache Airflow?

Snowflake’s AWS re:Invent Highlights for Fast-Tracking ML, Gen AI and Application Innovations

Adopting Real-Time Data At Organizations Of Every Size

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

Speeding Up The Time To Insight For Supply Chains And Logistics With The Pathway Database That Thinks

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Data News — Week 23.22

Top 12 Data Engineering Project Ideas [With Source Code]

Startup Spotlight: Patch Helps Devs Unblock Pipelines With Data Packages

Druid Deprecation and ClickHouse Adoption at Lyft

Data Engineering Weekly #157

Data Engineering Weekly #164

IBM Technology Chooses Cloudera as its Preferred Partner for Addressing Real Time Data Movement Using Kafka

Data Engineering Weekly #161

4 Ways to Tackle Data Pipeline Optimization

SNP Unlocks SAP Data for Advanced Analytics with Its Snowflake Native App

Confluent named a leader in The Forrester Wave™: Cloud Data Pipelines, Q4 2023

Top 20 Azure Data Engineering Projects in 2023 [Source Code]

One Big Cluster Stuck: The Right Tool for the Right Job

1. Streamlining Membership Data Engineering at Netflix with Psyberg

New Snowflake Features Released in May–July 2023

5 Hard Truths About Generative AI for Technology Leaders

Data Enrichment in Existing Data Pipelines Using Confluent Cloud

Top Confluent Alternatives

Auto-Diagnosis and Remediation in Netflix Data Platform

Unleashing the Power of CDC With Snowflake

7 Essential Data Cleaning Best Practices

For your eyes only: improving Netflix video quality with neural networks

Snowflake Expands Programmability to Bolster Support for AI/ML and Streaming Pipeline Development

Life of a Netflix Partner Engineer?—?The case of extra 40 ms

Data Engineering Weekly #162

Data Engineering Weekly #151

A Gentle Introduction to Analytical Stream Processing

10 Essential Azure Data Engineer Skills to Improve in 2023

Fraud Detection with Cloudera Stream Processing Part 1

Striim Cloud on AWS: Unify your data with a fully managed change data capture and data streaming service

Stay Connected