Blog - Data Engineering Digest

How to Implement a Data Pipeline Using Amazon Web Services?

Analytics Vidhya

FEBRUARY 6, 2023

Introduction The demand for data to feed machine learning models, data science research, and time-sensitive insights is higher than ever thus, processing the data becomes complex. To make these processes efficient, data pipelines are necessary. appeared first on Analytics Vidhya.

Amazon Web Services

Amazon Web Services Data Pipeline Machine Learning Data Science

Data News — Week 24.15

Christophe Blefari

APRIL 12, 2024

The fest we deserve ( credits ) I hope this Data News finds you well. This is an episode in French and we talked mainly about the eventual end of the modern data stack. Build analytics at Hive.co — The journey Oleg and his team went through to implement a modern data stack.

BI

BI SQL Data Coding

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

ChatGPT> DataOps, or data operations, is a set of practices and technologies that organizations use to improve the speed, quality, and reliability of their data analytics processes. The goal of DataOps is to help organizations make better use of their data to drive business decisions and improve outcomes.

Machine Learning

Machine Learning Data Preparation Government Data Analytics

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

A Notebook is all I want or Don't

Data Engineering Weekly

MAY 3, 2024

There is a lot of context missing in that tweet, so I decided to write a blog about it. People have reservations about using tools like Jupytor Notebook for the production pipeline for a good reason. However, modern Notebooks like Databricks seamlessly integrate with Git to build pull requests and code review processes.

Programming Language

Programming Language ETL Tools Data Pipeline Coding

Building an Open Data Processing Pipeline for IoT

Cloudera

SEPTEMBER 11, 2018

Last week Cloudera introduced an open end-to-end architecture for IoT and the different components needed to help satisfy today’s enterprise needs regarding operational technology (OT), information technology (IT), data analytics and machine learning (ML), along with modern and traditional application development, deployment, and integration.

Data Process

Data Process Process Building Machine Learning

The Stream Processing Model Behind Google Cloud Dataflow

Towards Data Science

APRIL 30, 2024

Balancing correctness, latency, and cost in unbounded data processing Image created by the author. Intro Google Dataflow is a fully managed data processing service that provides serverless unified stream and batch data processing. Apache Beam lets users define processing logic based on the Dataflow model.

Google Cloud

Google Cloud Process Cloud Lambda Architecture

An educational side project

The Pragmatic Engineer

JUNE 1, 2023

for the simulation engine Go on the backend PostgreSQL for the data layer React and TypeScript on the frontend Prometheus and Grafana for monitoring and observability And if you were wondering how all of this was built, Juraj documented his process in an incredible, 34-part blog series. You can read this here.

Education

Education Project PostgreSQL Software Engineer

Join us at the Iceberg Summit 2024

Cloudera

MAY 10, 2024

Iceberg, a high-performance open-source format for huge analytic tables, delivers the reliability and simplicity of SQL tables to big data while allowing for multiple engines like Spark, Flink, Trino, Presto, Hive, and Impala to work with the same tables, all at the same time. The Iceberg community is deeply important to us at Cloudera.

Government

Government Data Governance Data Pipeline Big Data

Data News — Week 24.16

Christophe Blefari

APRIL 19, 2024

easy ( credits ) Hey, new Friday, new Data News. This is super interesting because it details important steps of the generative process. This blog shows how you can use Gen AI to evaluate inputs like translations with added reasons. This week, I feel like the selection is smaller than usual, so enjoy the links.

MySQL

MySQL Data Datasets SQL

Data News — Week 24.20

Christophe Blefari

MAY 17, 2024

The sun is out, the days are getting longer and Data News is still here. Next week marks 3 years of this newsletter/blog (yay 🎉 ). Let me introduce an application of this on the Data Council 2024 80 videos. After all, companies need to make decisions, and these decisions need to be informed by data.

Food

Food Data BI Engineering

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. Data lakes are notoriously complex. Join us at the top event for the global data community, Data Council Austin. Your first 30 days are free!

Database

Database Technology Data Lake High Quality Data

The Data Discovery Team

Jesse Anderson

NOVEMBER 14, 2023

A Guest Post by Ole Olesen-Bagneux In this blog post I would like to describe a new data team, that I call ‘the data discovery team’. Data discovery is thought of in different ways in data science and in information science respectfully. In an enterprise data reality, searching for data is a bit of a hassle.

Metadata

Metadata Data Science Big Data Data

Version Your Data Lakehouse Like Your Software With Nessie

Data Engineering Podcast

MARCH 10, 2024

Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. Data lakes are notoriously complex. Join us at the top event for the global data community, Data Council Austin.

Data Lake

Data Lake High Quality Data Data Pipeline Architecture

How to Easily Connect Airbyte with Snowflake for Unleashing Data’s Power?

Workfall

SEPTEMBER 18, 2023

Reading Time: 9 minutes Imagine your data as pieces of a complex puzzle scattered across different platforms and formats. This is where the power of data integration comes into play. Meet Airbyte, the data magician that turns integration complexities into child’s play. In this blog, we will cover: What is Airbyte?

Data Pipeline

Data Pipeline Raw Data Data Schemas Healthcare

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty In the inaugural blog post of this series, we introduced you to the state of our pipelines before Psyberg and the challenges with incremental processing that led us to create the Psyberg framework within Netflix’s Membership and Finance data engineering team.

Data Process

Data Process Process Metadata Finance

Gotchas of Streaming Pipelines: Profiling & Performance Improvements

Lyft Engineering

JUNE 6, 2023

Discover how Lyft identified and fixed performance issues in our streaming pipelines. Background Every streaming pipeline is unique. When reviewing a pipeline’s performance, we ask the following questions: “Is there a bottleneck?”, “Is the pipeline performing optimally?”, “Will it continue to scale with increased load?”

Utilities

Utilities Coding Python Systems

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty At Netflix, our Membership and Finance Data Engineering team harnesses diverse data related to plans, pricing, membership life cycle, and revenue to fuel analytics, power various dashboards, and make data-informed decisions. We expect complete and accurate data at the end of each run.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

Rockset

MARCH 18, 2024

Flink is one of the most popular stream processing technologies, ranked as a top five Apache project and backed by a diverse committer community including Alibaba and Apple. It powers steam processing at many companies including Uber, Netflix, and Linkedin.

Cloud

Cloud Building Metadata Kafka

The Five Use Cases in Data Observability: Mastering Data Production

DataKitchen

MAY 10, 2024

The Five Use Cases in Data Observability: Mastering Data Production (#3) Introduction Managing the production phase of data analytics is a daunting challenge. Overseeing multi-tool, multi-dataset, and multi-hop data processes ensures high-quality outputs.

Raw Data

Raw Data Data Ingestion Datasets Data

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty This blog post will cover how Psyberg helps automate the end-to-end catchup of different pipelines, including dimension tables. In the previous installments of this series, we introduced Psyberg and delved into its core operational modes: Stateless and Stateful Data Processing.

Metadata

Metadata Data Pipeline Scala Data Workflow

Building DoorDash’s Product Knowledge Graph with Large Language Models

DoorDash Engineering

APRIL 23, 2024

When a merchant comes onboard at DoorDash, we add their internal SKU data — raw merchant data — to our retail catalog. SKU data from different merchants come in varying formats and quality; they may, for example, have missing or incorrect attribute values. Examples include OpenAI’s GPT-4, Google’s Bard, and Meta’s Llama.

Building

Building Retail Manufacturing Unstructured Data

Data Engineering Weekly #165

Data Engineering Weekly

MARCH 31, 2024

Intuit: How Intuit data analysts write SQL 2x faster with the internal GenAI tool The productivity increase with GenAI is undeniable, and several startups are trying to solve the Text2SQL generation problem. My key highlight is that Excellent data documentation and “clean data” improve results.

Data Engineering

Data Engineering Data Engineer Engineering Scala

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the first part of this series, we talked about design patterns for data creation and the pros & cons of each system from the data contract perspective. In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective. Why is Data Quality Expensive?

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Data Engineering Weekly #152

Data Engineering Weekly

DECEMBER 10, 2023

RudderStack, one of the leading alternatives to Segment , is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Lackluster AI/ML results often stem from poor data quality. Visit rudderstack.com to learn more.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

The Five Use Cases in Data Observability: Fast, Safe Development and Deployment

DataKitchen

MAY 10, 2024

The Five Use Cases in Data Observability: Fast, Safe Development and Deployment (#4) Introduction The integrity and functionality of new code, tools, and configurations during the development and deployment stages are crucial. This process is critical as it ensures data quality from the onset. Will I Create A Failed Deploy?

Data Ingestion

Data Ingestion Datasets Coding Data

From Google Colab to a Ploomber Pipeline: ML at Scale with GPUs

KDnuggets

MARCH 17, 2022

In this short blog, we’ll review the process of taking a POC data science pipeline (ML/Deep learning/NLP) that was conducted on Google Colab, and transforming it into a pipeline that can run parallel at scale and works with Git so the team can collaborate on.

Deep Learning

Deep Learning Data Science Process IT

Training Foundation Improvements for Closeup Recommendation Ranker

Pinterest Engineering

SEPTEMBER 26, 2023

We have published a detailed blog post of its modeling architecture. While it is blessed with an abundance of data for training, it is also crucial to maintain a high data storage efficiency. While it is blessed with an abundance of data for training, it is also crucial to maintain a high data storage efficiency.

Software Engineer

Software Engineer Software Engineering Machine Learning Datasets

Unlock the New Wave of Gen AI With Snowpark Container Services GPU-Powered Compute

Snowflake

DECEMBER 20, 2023

Within the scope of gen AI, this new Snowpark runtime empowers developers to efficiently and securely deploy containers to do things like the following and more: LLM fine-tuning Open-source vector database deployment Distributed embedding processing Voice to text transcription Why did Snowflake build a container service?

Scala

Scala Government Java AWS

Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

LinkedIn Engineering

MARCH 23, 2023

Co-Authors: Yuhong Cheng , Shangjin Zhang , Xinyu Liu, and Yi Pan Efficient data processing is crucial in reducing learning curves, simplifying maintenance efforts, and decreasing operational complexity. By unifying these pipelines, we have saved 94% of processing time. Samza , Spark and Apache Flink ).

Process

Process Lambda Architecture Kafka Datasets

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Github writes an excellent blog to capture the current state of the LLM integration architecture. Lackluster AI/ML results often stem from poor data quality.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

Data Engineering Weekly #157

Data Engineering Weekly

FEBRUARY 4, 2024

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Joe Reis: Definition of Data Modeling & What Data Modeling Is not Joe raised a very fundamental question in data engineering.

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

HBase Deprecation at Pinterest

Pinterest Engineering

MAY 13, 2024

The subsequent blog post will delve into how we looked into our specific needs, evaluated multiple candidates and decided on the adoption of a new database technology. At its peak usage, we had around 50 clusters, 9000 AWS EC2 instances, and over 6 PBs of data. took almost two years).

NoSQL

NoSQL MySQL Database Systems

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Cloudera

NOVEMBER 1, 2023

Elevate your AI applications with our latest applied ML prototype At Cloudera, we continuously strive to empower organizations to unlock the full potential of their data, catalyzing innovation and driving actionable insights. High-level overview of real-time data ingest with Cloudera DataFlow to Pinecone vector database.

Machine Learning

Machine Learning Data Ingestion Database Architecture

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Pinterest Engineering

NOVEMBER 28, 2023

Pinterest’s real-time metrics asynchronous data processing pipeline, powering Pinterest’s time series database Goku, stood at the crossroads of opportunity. The mission was clear: identify bottlenecks, innovate relentlessly, and propel our real-time analytics processing capabilities into an era of unparalleled efficiency.

Kafka

Kafka Bytes Architecture Utilities

Observability in Your Data Pipeline: A Practical Guide

Databand.ai

JUNE 8, 2023

Observability in Your Data Pipeline: A Practical Guide Eitan Chazbani June 8, 2023 Achieving observability for data pipelines means that data engineers can monitor, analyze, and comprehend their data pipeline’s behavior. This is part of a series of articles about data observability.

Data Pipeline

Data Pipeline Bytes Raw Data Data Collection

DataOps Framework: 4 Key Components and How to Implement Them

Databand.ai

AUGUST 30, 2023

The DataOps framework is a set of practices, processes, and technologies that enables organizations to improve the speed, accuracy, and reliability of their data management and analytics operations. The core philosophy of DataOps is to treat data as a valuable asset that must be managed and processed efficiently.

Data Governance

Data Governance Data Pipeline Government Data Cleanse

Data Engineering Weekly #161

Data Engineering Weekly

MARCH 3, 2024

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Editor’s Note: Chennai, India Meetup - March-08 Update We are thankful to Ideas2IT to host our first Data Hero’s meetup.

Data Engineering

Data Engineering Data Engineer Pipeline-centric Engineering

What Is MLOps?

Edureka

MAY 6, 2024

In today’s data-driven world, machine learning models play a huge role in developing sectors like healthcare, finance, transport, e-commerce, and so on. Whether you are a newbie or an experienced individual, if you want to explore more about the concepts of MLOPS, then you just click on the right blog. Why do we need MLOPS?

Machine Learning

Machine Learning Metadata Programming Language Healthcare

How to Simplify Data Pipelines with DBT and Airflow?

Workfall

AUGUST 14, 2023

Reading Time: 7 minutes In today’s data-driven world, efficient data pipelines have become the backbone of successful organizations. These pipelines ensure that data flows smoothly from various sources to its intended destinations, enabling businesses to make informed decisions and gain valuable insights.

Data Pipeline

Data Pipeline Raw Data Data Database

Deploying Data Pipelines using the Saga pattern

Picnic Engineering

FEBRUARY 8, 2023

In our previous blog, Dima Kalashnikov explained how we configure our Internal services pipeline in the Analytics Platform. In this post, we will explain how our team automates the creation of new data pipeline deployments. Now, we can have a pipeline ready in minutes. A true heroic story indeed! How does it work?

Data Pipeline

Data Pipeline Kafka Data Architecture

Gaining Control of Your CDP Environment

Cloudera

APRIL 24, 2023

Unwelcome… … are platform instability, downtime, hardware failure, poor performance, cluster resource contention, repeated process failures, runaway live queries, critical services alarms, invisibility into alarm cacophony… the list goes on. Visibility and Transparency Into the cluster, platform, services, and processes.

Professional Services

Professional Services Datasets Data Pipeline Data Architecture

Data Engineering Weekly #148

Data Engineering Weekly

OCTOBER 1, 2023

Data Engineering Weekly Is Brought to You by RudderStack RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. What is the data behavior? Job Scheduling : how frequently should our data pipeline run?

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Databand.ai

AUGUST 30, 2023

DataOps , short for data operations, is an emerging discipline that focuses on improving the collaboration, integration, and automation of data processes across an organization. By using DataOps tools, organizations can break down silos, reduce time-to-insight, and improve the overall quality of their data analytics processes.

Data Cleanse

Data Cleanse Data Pipeline Data Ingestion Data Validation

How to Implement a Data Pipeline Using Amazon Web Services?

Data News — Week 24.15

Webinars

Trending Sources

An AI Chat Bot Wrote This Blog Post …

Webinars

A Notebook is all I want or Don't

Building an Open Data Processing Pipeline for IoT

The Stream Processing Model Behind Google Cloud Dataflow

An educational side project

Join us at the Iceberg Summit 2024

Data News — Week 24.16

Data News — Week 24.20

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

The Data Discovery Team

Version Your Data Lakehouse Like Your Software With Nessie

How to Easily Connect Airbyte with Snowflake for Unleashing Data’s Power?

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Gotchas of Streaming Pipelines: Profiling & Performance Improvements

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

The Five Use Cases in Data Observability: Mastering Data Production

3. Psyberg: Automated end to end catch up

Building DoorDash’s Product Knowledge Graph with Large Language Models

Data Engineering Weekly #165

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly #152

The Five Use Cases in Data Observability: Fast, Safe Development and Deployment

From Google Colab to a Ploomber Pipeline: ML at Scale with GPUs

Training Foundation Improvements for Closeup Recommendation Ranker

Unlock the New Wave of Gen AI With Snowpark Container Services GPU-Powered Compute

Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

Data Engineering Weekly #151

Data Engineering Weekly #157

HBase Deprecation at Pinterest

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Observability in Your Data Pipeline: A Practical Guide

DataOps Framework: 4 Key Components and How to Implement Them

Data Engineering Weekly #161

What Is MLOps?

How to Simplify Data Pipelines with DBT and Airflow?

Deploying Data Pipelines using the Saga pattern

Gaining Control of Your CDP Environment

Data Engineering Weekly #148

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Stay Connected