Blog - Data Engineering Digest

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty In the inaugural blog post of this series, we introduced you to the state of our pipelines before Psyberg and the challenges with incremental processing that led us to create the Psyberg framework within Netflix’s Membership and Finance data engineering team.

Data Process

Data Process Process Metadata Finance

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

I won’t bore you with the importance of data quality in the blog. Speed vs. Correctness vs. Time [SCT theorem] Just like the CAP theorem, there's a balance to be struck between speed, correctness, and Time in a data pipeline. Let’s talk about the data processing types. Why is Data Quality Expensive?

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Please Use Streaming Workload to Benchmark Vector Databases

Towards Data Science

DECEMBER 1, 2023

Why static workload is insufficient and what I learned by comparing HNSWLIB and DiskANN using streaming workload Image by DALLE-3 Vector databases are built for high-dimensional vector retrieval. Many vector databases are now measuring their performance using this approach in their tech blogs. Streaming workload tells you a lot more.

Database

Database Algorithm Datasets Kafka

Webinars

The Product Manager’s Guide to Optimizing DX for Systemic Impact

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Cloudera Streaming Analytics 1.4: the unification of SQL batch and streaming

Cloudera

JUNE 7, 2021

In October of 2020 Cloudera acquired Eventador and Cloudera Streaming Analytics (CSA) 1.3.0 It was the first release to incorporate SQL Stream Builder (SSB) from the acquisition, and brought rich SQL processing to the already robust Apache Flink offering. Why batch + streaming? was released early in 2021.

SQL

SQL Manufacturing Finance Architecture

One Big Cluster Stuck: The Right Tool for the Right Job

Cloudera

JUNE 26, 2023

Impala vs Spark Use Impala primarily for analytical workloads triggered by end users. That depends on the business use case, use case complexity, workflow complexity, and whether batch or streaming data is required. The post One Big Cluster Stuck: The Right Tool for the Right Job appeared first on Cloudera Blog.

ETL Tools

ETL Tools Programming Language Datasets Data Pipeline

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

It aims to streamline data ingestion, processing, and analytics by automating and integrating various data workflows. It encompasses the systems, tools, and processes that enable businesses to manage their data more efficiently and effectively. As a result, they can be slow, inefficient, and prone to errors.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

Data Engineering Weekly #124

Data Engineering Weekly

MARCH 26, 2023

Last year around this time, Bundling vs. Unbundling was the talk of the town. The blog highlights that the job is not just writing SQL but providing a strategic business solution for an organization. The blog is very educative for me about measuring the lifetime value of a customer and segmentation on buying behavior.

Data Engineering

Data Engineering Data Engineer Engineering Lambda Architecture

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Modernize their architecture to ingest data in real-time using the new streaming features available in CDP Private Cloud Base in order to make the data available to their users quickly. Support Kafka connectivity to HDFS, AWS S3 and Kafka Streams. New Features CDH to CDP. Identifying areas of interest for Customer A. Phase 1: Planning.

Cloud

Cloud Kafka Professional Services Metadata

How to Become Databricks Certified Apache Spark Developer?

ProjectPro

FEBRUARY 21, 2023

Apache Spark is the most efficient, scalable, and widely used in-memory data computation tool capable of performing batch-mode, real-time, and analytics operations. The next evolutionary shift in the data processing environment will be brought about by Spark due to its exceptional batch and streaming capabilities.

Scala

Scala Programming Language Java Hadoop

Introduction to Streaming Data

Cloud Academy

JULY 16, 2019

Designing a streaming data pipeline presents many challenges, particularly around specific technology requirements. In this blog post, we will walk through some initial scoping steps and walk through an example. In this blog post, we will walk through some initial scoping steps and walk through an example.

Manufacturing

Manufacturing MySQL Data Cloud

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Cloudera

JULY 21, 2022

Cloudera users can securely connect Rill to a source of event stream data, such as Cloudera DataFlow , model data into Rill’s cloud-based Druid service, and share live operational dashboards within minutes via Rill’s interactive metrics dashboard or any connected BI solution. Efficient batch data processing. Apache Hive.

BI

BI Digital Media Data Warehouse Kafka

Striim Cloud on AWS: Unify your data with a fully managed change data capture and data streaming service

Striim

NOVEMBER 30, 2022

Companies that can’t process and analyze it to glean useful insights for their operations are falling behind. We are excited to launch Striim Cloud on AWS: a real-time data integration and streaming platform that connects clouds, data and applications with unprecedented speed and simplicity.

AWS

AWS Cloud Management Google Cloud

Towards a Reliable Device Management Platform

Netflix Tech

AUGUST 30, 2021

By Benson Ma , Alok Ahuja Introduction At Netflix, hundreds of different device types, from streaming sticks to smart TVs, are tested every day through automation to ensure that new software releases continue to deliver the quality of the Netflix experience that our customers enjoy. In this blog post, we will focus on the latter feature set.

Management

Management Kafka Transportation Cloud

PinCompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest

Pinterest Engineering

OCTOBER 31, 2023

We leverage Custom Resources (CR) to define the kinds of workloads supported by the platform, and the platform offers a range of workload orchestration capabilities which supports both batch jobs and long running services in various forms. It provides three groups of APIs: workload APIs, operation APIs, and insight APIs. G5 family).

Architecture

Architecture Pipeline-centric Accessible Accessibility

Data Science Course Fees, Eligibility & Duration

Knowledge Hut

JANUARY 22, 2024

Online vs. In-Person Delivery The choice between online and in-person delivery for data science courses depends on your learning preferences and situational factors. Key topics include data collection, storage formats like databases, data warehouses, data lakes, transforming and processing data using ETL, batch and stream processing.

Data Science

Data Science Certification Education Data Lake

Change Data Capture: What It Is and How to Use It

Rockset

JUNE 7, 2021

Change data capture (CDC) is the process of recognising when data has been changed in a source system so a downstream process or system can action that change. Push vs Pull There are two main ways for change data capture systems to operate. What Is Change Data Capture?

IT

IT Kafka Database MongoDB

Complex Event Generation for Business Process Monitoring using Apache Flink

Zalando Engineering

JULY 12, 2017

While developing Zalando’s real-time business process monitoring solution, we encountered the need to generate complex events upon the detection of specific patterns of input events. In this blog post we describe the generation of such events using Apache Flink, and share our experiences and lessons learned in the process.

Process

Process Kafka AWS Architecture

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn Engineering

JANUARY 19, 2024

This multi-entity handover process involves huge amounts of data updating and cloning. Data consistency, feature reliability, processing scalability, and end-to-end observability are key drivers to ensuring business as usual (zero disruptions) and a cohesive customer experience. Push for eventual success of the request.

Recruitment

Recruitment Data Process Process Kafka

Tableau Operational Dashboards and Reporting on DynamoDB - Evaluating Redshift and Athena

Rockset

AUGUST 13, 2019

Organizations speak of operational reporting and analytics as the next technical challenge in improving business processes and efficiency. We consider several approaches, all of which use DynamoDB Streams but differ in how the dashboards are served: 1. DynamoDB Streams + Lambda + Kinesis Firehose + Redshift 2.

BI

BI NoSQL PostgreSQL AWS

Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite

Data Engineering Podcast

MARCH 25, 2023

Summary The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be.

MySQL

MySQL Python Architecture Machine Learning

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Confluent

SEPTEMBER 26, 2019

The significant difference today is that companies use Apache Kafka as an event streaming platform for building mission-critical infrastructures and core operations platforms. Batch processing and reports after minutes or even hours is not sufficient. Apache Kafka as an event streaming platform for real-time analytics.

Kafka

Kafka BI SQL Datasets

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Snowflake

MARCH 2, 2023

That proves to be a difficult task for data engineering teams that have to manage separate infrastructure for batch data and streaming data. To address this challenge, we are happy to announce the public preview of Snowpipe Streaming as the latest addition to our Snowflake ingestion offerings. How does Snowpipe Streaming work?

Kafka

Kafka Data Ingestion Data Pipeline Cloud Storage

Stream Processing vs. Real-Time Analytics Databases

Rockset

MARCH 27, 2023

This is part two in Rockset’s Making Sense of Real-Time Analytics on Streaming Data series. In part 1 , we covered the technology landscape for real-time analytics on streaming data. In this post, we’ll explore the differences between real-time analytics databases and stream processing frameworks.

Database

Database Process Scala SQL

The Future of Data Warehousing

Monte Carlo

JANUARY 16, 2024

In this blog post, we’ll look at six innovations that are shaping the future of the data warehousing, as well as challenges and considerations that organizations should keep in mind. Easier to stream real-time data 3. Data lake and data warehouse convergence The data lake vs data warehouse question is constantly evolving.

Data Lake

Data Lake Data Warehouse Unstructured Data AWS

Lyft’s Reinforcement Learning Platform

Lyft Engineering

MARCH 12, 2024

This highlights another strength of RL models — they optimize for the whole decision making process towards a target metric in potentially changing environments. More typically, we perform batch updates anywhere from every 10 minutes to 24 hours. The metric can be fairly high level, for example conversion or revenue.

Algorithm

Algorithm Machine Learning Datasets Food

Data Engineering Weekly #116

Data Engineering Weekly

JANUARY 29, 2023

The conversation around data observability points out the growing gap in data observability [aka finding the things] vs. fixing the data quality. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query. Streaming plus batch unified in a single platform.

Data Engineering

Data Engineering Data Engineer Engineering Deep Learning

Data Engineering Weekly #112

Data Engineering Weekly

DECEMBER 18, 2022

The author writes an exciting blog, Modern data stack in a Box!! link] Confessions of a Data Guy: Dataframe Showdown – Polars vs. Spark vs. Pandas vs. DataFusion. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query. Streaming plus batch unified in a single platform.

Data Engineering

Data Engineering Data Engineer Engineering Relational Database

Real-Time Analytics on Kinesis Event Streams Using Rockset, Druid, Elasticsearch and Redshift

Rockset

FEBRUARY 24, 2022

Which databases are optimized for ingesting streaming events and analyzing them in real time? We’ll start by evaluating three options for running real-time analytics on AWS Kinesis event streams. I’ll focus on the use of events to help understand, analyze and diagnose problems using various OLAP databases and AWS Kinesis data streams.

AWS

AWS Amazon Web Services Kafka SQL

Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix Tech

NOVEMBER 20, 2023

by Jun He , Yingyi Zhang , and Pawan Dixit Incremental processing is an approach to process new or changed data in workflows. The key advantage is that it only incrementally processes data that are newly added or updated to a dataset, instead of re-processing the complete dataset.

Process

Process Data Pipeline Datasets SQL

The Good and the Bad of Apache Kafka Streaming Platform

AltexSoft

OCTOBER 21, 2022

Similar to Google in web browsing and Photoshop in image processing, it became a gold standard in data streaming, preferred by 70 percent of Fortune 500 companies. Apache Kafka is an open-source, distributed streaming platform for messaging, storing, processing, and integrating large data volumes in real time. Kafka APIs.

Kafka

Kafka Hadoop ETL Tools Big Data

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

Just as a dimensional data model will transform data for human consumption, ML models need raw data transformed for ML model consumption through a process called “ feature engineering.” Based on Tecton blog So is this similar to data engineering pipelines into a data lake/warehouse?

Engineering

Engineering Raw Data Data Science Scala

Data Movement in Netflix Studio via Data Mesh

Netflix Tech

JULY 26, 2021

Operational Reporting is a reporting paradigm specialized in covering high-resolution, low-latency data sets, serving detailed day-to-day activities¹ and processes of a business domain. Data Mesh is a fully managed, streaming data pipeline product used for enabling Change Data Capture (CDC) use cases. tactical) in nature.

Data

Data Data Pipeline MySQL Data Warehouse

Microservices, Apache Kafka, and Domain-Driven Design

Confluent

JUNE 26, 2019

In these projects, microservice architectures use Kafka as an event streaming platform. Domain-driven design is used to define the different bounded contexts which represent the various business processes that the application needs to perform. Apache Kafka – An event streaming platform for microservices. Microservices.

Kafka

Kafka Designing Architecture ETL Tools

Cloudera DataFlow for the Public Cloud: A technical deep dive

Cloudera

AUGUST 16, 2021

In this blog post we’re revisiting the challenges that come with running Apache NiFi at scale before we take a closer look at the architecture and core features of CDF-PC. Since it supports both structured and unstructured data for streaming and batch integrations, Apache NiFi is quickly becoming a core component of modern data pipelines.

Cloud

Cloud Unstructured Data Utilities Metadata

Data Engineering Weekly #115

Data Engineering Weekly

JANUARY 22, 2023

Editor’s Note: Update on our blog series One of the promises I made toward the end of 2022 is to publish more of my thoughts and industry observation of data engineering trends. Data Catalog - A broken promise A classic blog triggers a few conversations about Data Catalog and its future. Sign up free to test out the tool today.

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

Real-Time CDC With Rockset And Confluent Cloud

Rockset

MARCH 26, 2023

To do this, Rockset has partnered with Confluent, the original creators of Kafka who provide the cloud-native data streaming platform Confluent Cloud. At Confluent I talked often about the fanciful sounding “Stream and Table Duality”. You can learn more about Confluent vs. Kafka over on Confluent’s site.

Cloud

Cloud PostgreSQL Kafka Database

How to Become an Azure Data Engineer in 2023?

ProjectPro

JANUARY 19, 2022

Read this blog till the end to learn more about the roles and responsibilities, necessary skillsets, average salaries, and various important certifications that will help you build a successful career as an Azure Data Engineer. Data engineers will be in high demand as long as there is data to process.

Data Engineering

Data Engineering Data Engineer Engineering Scala

What is AWS Data Pipeline?

ProjectPro

JUNE 16, 2022

Generally, it consists of three key elements: a source, processing step(s), and destination to streamline movement across digital platforms. This blog will teach you about AWS Data Pipeline, its architecture, components, and benefits. 2) What is AWS data pipeline vs. AWS glue? Table of Contents What is an AWS Data Pipeline?

Data Pipeline

Data Pipeline AWS Amazon Web Services Data Consolidation

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

There are several benefits of such optimizations like saving on storage, faster query time, cheaper downstream processing, and an increase in developer productivity by removing additional ETLs written only for query performance improvement. We will publish a follow-up blog post about AutoAnalyze in the future.

Data Warehouse

Data Warehouse Metadata Algorithm Data

20+ Data Engineering Projects for Beginners with Source Code

ProjectPro

AUGUST 24, 2021

And, out of these professions, this blog will discuss the data engineering job role. The data in Kafka is analyzed with Spark Streaming API, and the data is stored in a column store called HBase. This helps improve customer service, enhance customer loyalty, and generate new revenue streams for the airline.

Data Engineering

Data Engineering Data Engineer Coding Project

Hadoop MapReduce vs. Apache Spark Who Wins the Battle?

ProjectPro

NOVEMBER 11, 2014

Confused over which framework to choose for big data processing - Hadoop MapReduce vs. Apache Spark. This blog helps you understand the critical differences between two popular big data frameworks. Spark vs. Hadoop MapReduce Comparison -The Bottomline Will Apache Spark Eliminate Hadoop MapReduce?

Hadoop

Hadoop Scala Machine Learning Java

Cloudera Flow Management Continuous Delivery while Minimizing Downtime

Cloudera

JANUARY 19, 2021

Cloudera Flow Management , based on Apache NiFi and part of the Cloudera DataFlow platform , is used by some of the largest organizations in the world to facilitate an easy-to-use, powerful, and reliable way to distribute and process data at high velocity in the modern big data ecosystem. DataFlow Process Group. Flow Development .

Management

Management Big Data Ecosystem Kafka AWS

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

Read this blog to understand everything about AWS Glue that makes it one of the most popular data integration solutions in the industry. For e.g., Finaccel, a leading tech company in Indonesia, leverages AWS Glue to easily load, process, and transform their enterprise data for further processing. billion by 2026?

AWS

AWS Scala Metadata Data Lake

Consistent caching mechanism in Titus Gateway

Netflix Tech

NOVEMBER 3, 2022

This blog post presents how our current iteration of Titus deals with high API call volumes by scaling out horizontally. We introduce a caching mechanism in the API gateway layer, allowing us to offload processing from singleton leader elected controllers without giving up strict data consistency and guarantees clients observe.

Systems

Systems Architecture Process AWS

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Webinars

Trending Sources

Please Use Streaming Workload to Benchmark Vector Databases

Webinars

Cloudera Streaming Analytics 1.4: the unification of SQL batch and streaming

One Big Cluster Stuck: The Right Tool for the Right Job

DataOps Architecture: 5 Key Components and How to Get Started

Data Engineering Weekly #124

Upgrade Journey: The Path from CDH to CDP Private Cloud

How to Become Databricks Certified Apache Spark Developer?

Introduction to Streaming Data

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Striim Cloud on AWS: Unify your data with a fully managed change data capture and data streaming service

Towards a Reliable Device Management Platform

PinCompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest

Data Science Course Fees, Eligibility & Duration

Change Data Capture: What It Is and How to Use It

Complex Event Generation for Business Process Monitoring using Apache Flink

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

Tableau Operational Dashboards and Reporting on DynamoDB - Evaluating Redshift and Athena

Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Stream Processing vs. Real-Time Analytics Databases

The Future of Data Warehousing

Lyft’s Reinforcement Learning Platform

Data Engineering Weekly #116

Data Engineering Weekly #112

Real-Time Analytics on Kinesis Event Streams Using Rockset, Druid, Elasticsearch and Redshift

Incremental Processing using Netflix Maestro and Apache Iceberg

The Good and the Bad of Apache Kafka Streaming Platform

Data Vault on Snowflake: Feature Engineering and Business Vault

Data Movement in Netflix Studio via Data Mesh

Microservices, Apache Kafka, and Domain-Driven Design

Cloudera DataFlow for the Public Cloud: A technical deep dive

Data Engineering Weekly #115

Real-Time CDC With Rockset And Confluent Cloud

How to Become an Azure Data Engineer in 2023?

What is AWS Data Pipeline?

Optimizing data warehouse storage

20+ Data Engineering Projects for Beginners with Source Code

Hadoop MapReduce vs. Apache Spark Who Wins the Battle?

Cloudera Flow Management Continuous Delivery while Minimizing Downtime

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Consistent caching mechanism in Titus Gateway

Stay Connected