Blog - Data Engineering Digest

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. It allows multiple data processing engines, such as Flink, NiFi, Spark, Hive, and Impala to access and analyze data in simple, familiar SQL tables. Currently, Iceberg support in CSP is in technical preview mode.

Process

Process SQL Kafka Database

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Introduction At Lyft, we have used systems like Apache ClickHouse and Apache Druid for near real-time and sub-second analytics. This is crucial for use cases like market signaling and forecasting which benefit from, and depend upon, the most up-to-date information. An example of how we use Druid rollup at Lyft.

Kafka

Kafka Data Ingestion Datasets Architecture

Running Unified PubSub Client in Production at Pinterest

Pinterest Engineering

NOVEMBER 7, 2023

A central component of data ingestion infrastructure at Pinterest is our PubSub stack, and the Logging Platform team currently runs deployments of Apache Kafka and MemQ. years since our previous blog post, PSC has been battle-tested at large scale in Pinterest with notably positive feedback and results.

Kafka

Kafka Java Software Engineer Software Engineering

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Building Real-time Machine Learning Foundations at Lyft

Lyft Engineering

JUNE 28, 2023

However, streaming data was not supported as a first-class citizen across many of the platform’s systems — such as training, complex monitoring, and others. While several teams were using streaming data in their Machine Learning (ML) workflows, doing so was a laborious process, sometimes requiring weeks or months of engineering effort.

Machine Learning

Machine Learning Building Metadata Kafka

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Cloudera

JULY 18, 2022

In part 1 of this blog we discussed how Cloudera DataFlow for the Public Cloud (CDF-PC), the universal data distribution service powered by Apache NiFi, can make it easy to acquire data from wherever it originates and move it efficiently to make it available to other applications in a streaming fashion. Use case recap.

Process

Process Kafka Scala SQL

Data Engineering Weekly #167

Data Engineering Weekly

APRIL 14, 2024

With the 1-bit LLM model, the researchers are suggesting instead of FP16 (Full Precision floating-point number with 5-bits) or FP32 (Full Precision floating-point number with 6-bits), you can build an equally efficient model using ternary digit set ∈ {-1, 0, 1}. Github shares some insights on how Github engineers use Github Copilot.

Data Engineering

Data Engineering Data Engineer Engineering Business Intelligence

Lessons from debugging a tricky direct memory leak

Pinterest Engineering

SEPTEMBER 29, 2023

Sanchay Javeria | Software Engineer, Ads Data Infrastructure To support metrics reporting for ads from external advertisers and real-time ad budget calculations at Pinterest, we run streaming pipelines using Apache Flink. This was intentionally generous to buy us enough time to fix the issue.

Utilities

Utilities Coding Kafka Engineering

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. Building real-time streaming analytics data pipelines requires the ability to process data in the stream.

Process

Process Kafka SQL Machine Learning

SQL Streambuilder Data Transformations

Cloudera

FEBRUARY 21, 2023

SQL Stream Builder (SSB) is a versatile platform for data analytics using SQL as a part of Cloudera Streaming Analytics, built on top of Apache Flink. It enables users to easily write, run, and manage real-time continuous SQL queries on stream data and a smooth user experience. What is a data transformation?

SQL

SQL Kafka Raw Data Data

Your Parents Still Don’t Know What a Hashtag Is. Let’s Teach Them the Basics of Machine Learning and Streaming Data

Cloudera

OCTOBER 13, 2021

Quite often, the digital natives of the family — you — have to explain to the analog fans of the family what PDFs are, how to use a hashtag, a phone camera, or a remote. Imagine if you had to explain what machine learning is and how to use it. Using these books, you can answer questions such as: . Have you heard about streaming?

Machine Learning

Machine Learning Data Ingestion Algorithm Technology

Implementing and Using UDFs in Cloudera SQL Stream Builder

Cloudera

FEBRUARY 22, 2023

Cloudera’s SQL Stream Builder (SSB) is a versatile platform for data analytics using SQL. As apart of Cloudera Streaming Analytics it enables users to easily write, run, and manage real-time SQL queries on streams with a smooth user experience, while it attempts to expose the full power of Apache Flink.

SQL

SQL Raw Data Programming Language Kafka

Data Engineering Weekly #109

Data Engineering Weekly

NOVEMBER 27, 2022

I have a long list of thoughts on this conversation, which might need a blog post on its own. Let’s take an example of Slack features, “Compose a DM,” Channel Selection," Invite Members,” or “Invite Reminder”? It is a great overview of streaming infrastructure characteristics.

Data Engineering

Data Engineering Data Engineer Engineering SQL

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

I won’t bore you with the importance of data quality in the blog. The bias toward correctness will increase the processing time, which may not be feasible when speed is a priority. Let’s talk about the data processing types. Two-Phase WAP The Two-Phase WAP, as the name suggests, follows two copy processes.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Cloudera Streaming Analytics 1.4: the unification of SQL batch and streaming

Cloudera

JUNE 7, 2021

In October of 2020 Cloudera acquired Eventador and Cloudera Streaming Analytics (CSA) 1.3.0 It was the first release to incorporate SQL Stream Builder (SSB) from the acquisition, and brought rich SQL processing to the already robust Apache Flink offering. Why batch + streaming? A bit of Flink history.

SQL

SQL Manufacturing Finance Architecture

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. What are some of the main use cases for Spark? Who uses Spark?

Scala

Scala MySQL Kafka Hadoop

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

DoorDash Engineering

OCTOBER 17, 2023

For example, if two reasonably sized groups are expected to be split 50/50, but instead show a 55/45 split, the assignment process likely is compromised. Cautionary tales of faux gains and real losses Example 1: The $10 Million Mirage Imagine that your target is to improve weekly revenue per user.

Education

Education Kafka Algorithm Data Warehouse

Data Engineering Annotated Monthly – April 2022

Big Data Tools

MAY 19, 2022

Take the new dynamic tasks , for example. Based on the “map-reduce” paradigm, they allow you to compute the next DAGs from the current state – a very useful feature, which incidentally has been available in Luigi for a while. Apache Hudi 1.11.0 – This release of the well-known data lake has added many interesting changes.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – April 2022

Big Data Tools

MAY 19, 2022

Take the new dynamic tasks , for example. Based on the “map-reduce” paradigm, they allow you to compute the next DAGs from the current state – a very useful feature, which incidentally has been available in Luigi for a while. Apache Hudi 1.11.0 – This release of the well-known data lake has added many interesting changes.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

Netflix Tech

OCTOBER 28, 2021

I developed many batch and real-time data pipelines using open source technologies for AOL Advertising and eBay. Over the years, I followed the big data open-source community and Netflix tech blogs closely, and learned a lot about Netflix’s innovative engineering solutions and active contributions to the open-source ecosystem.

Data Engineering

Data Engineering Data Engineer Engineering Software Engineer

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

This time I learned about Brooklin, a LinkedIn service for streaming data in a heterogeneous environment. The official GitHub for the project says that it is characterized by high reliability and throughput, claiming that Brooklin can run hundreds of streaming pipelines simultaneously. This is no doubt very interesting.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

This time I learned about Brooklin, a LinkedIn service for streaming data in a heterogeneous environment. The official GitHub for the project says that it is characterized by high reliability and throughput, claiming that Brooklin can run hundreds of streaming pipelines simultaneously. This is no doubt very interesting.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Delta: A Data Synchronization and Enrichment Platform

Netflix Tech

OCTOBER 15, 2019

Part I: Overview Andreas Andreakis , Falguni Jhaveri , Ioannis Papapanagiotou , Mark Cho , Poorna Reddy , Tongliang Liu Overview It is a commonly observed pattern for applications to utilize multiple datastores where each is used to serve a specific need such as storing the canonical form of data (MySQL etc.), caching (Memcached etc.),

Transportation

Transportation MySQL Kafka Data

Top 20+ Big Data Certifications and Courses in 2023

Knowledge Hut

SEPTEMBER 6, 2023

This influx of data is handled by robust big data systems which are capable of processing, storing, and querying data at scale. Data Analysis : Strong data analysis skills will help you define ways and strategies to transform data and extract useful insights from the data set. Why Should You Take Big Data Certification?

Big Data

Big Data Certification Hadoop Scala

Data Engineering Annotated Monthly – September 2021

Big Data Tools

OCTOBER 5, 2021

Kafka 3.0.0 – The Apache Software Foundation needed less than one month to go from Kafka version 3.0.0-rc0 Lots of happy customers are aware of Apache Camel , an integration framework that makes it possible to connect almost anything to everything. Burton the same person? rc0 to the release of 3.0.0.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – September 2021

Big Data Tools

OCTOBER 5, 2021

Kafka 3.0.0 – The Apache Software Foundation needed less than one month to go from Kafka version 3.0.0-rc0 Lots of happy customers are aware of Apache Camel , an integration framework that makes it possible to connect almost anything to everything. Burton the same person? rc0 to the release of 3.0.0.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Complex Event Generation for Business Process Monitoring using Apache Flink

Zalando Engineering

JULY 12, 2017

While developing Zalando’s real-time business process monitoring solution, we encountered the need to generate complex events upon the detection of specific patterns of input events. In this blog post we describe the generation of such events using Apache Flink, and share our experiences and lessons learned in the process.

Process

Process Kafka AWS Architecture

15+ AWS Projects Ideas for Beginners to Practice in 2023

ProjectPro

JULY 23, 2021

AWS (Amazon Web Services) is the world’s leading and widely used cloud platform, with over 200 fully featured services available from data centers worldwide. This blog presents some of the most unique and innovative AWS projects from beginner to advanced levels. Mass Emailing using AWS Lambda 4. Customer Logic Workflow 8.

AWS

AWS Project Amazon Web Services Cloud Computing

Java vs Python for Data Science in 2023-What's your choice?

ProjectPro

JUNE 18, 2021

This blog aims to answer all questions on how Java vs Python compare for data science and which should be the programming language of your choice for doing data science in 2021. According to Popularity of Programming Languages (PYPL) , Python and Java are two of the most popular programming languages in use as of June 2021.

Java

Java Data Science Python Programming Language

Gotchas of Streaming Pipelines: Profiling & Performance Improvements

Lyft Engineering

JUNE 6, 2023

Discover how Lyft identified and fixed performance issues in our streaming pipelines. Background Every streaming pipeline is unique. Profiling is the first step of the process, and requires the right tools. Figure 1: PyFlame Graph Example If your pipeline is JVM based, you can use various JVM profilers to identify bottlenecks.

Utilities

Utilities Coding Python Systems

Accelerating Deployments of Streaming Pipelines – Announcing Data in Motion on Kubernetes

Cloudera

MAY 7, 2024

Regardless of industry or use case, there are two key themes that always arise when executing on digital transformation strategies. Data needs to be shared in real time so it can be embedded deeper into everyday operational processes across the organization that are working from the same ground-truth.

Kafka

Kafka Data Lake Cloud Computing Cloud

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

Rockset

MARCH 18, 2024

Today, Confluent announced the general availability of its serverless Apache Flink service. Flink is one of the most popular stream processing technologies, ranked as a top five Apache project and backed by a diverse committer community including Alibaba and Apple. What is RAG?

Cloud

Cloud Building Metadata Kafka

The Stream Processing Model Behind Google Cloud Dataflow

Towards Data Science

APRIL 30, 2024

Balancing correctness, latency, and cost in unbounded data processing Image created by the author. Intro Google Dataflow is a fully managed data processing service that provides serverless unified stream and batch data processing. If you want to learn more about stream processing, I strongly recommend this paper.

Google Cloud

Google Cloud Process Cloud Lambda Architecture

5 Key Takeaways from #Current2023

Cloudera

OCTOBER 17, 2023

With few conferences curating content specific to streaming developers, Current has historically been an important event for anyone trying to keep a pulse on what’s happening in the streaming space. Flink is here to stay. It makes perfect sense that Apache Flink has emerged as the standard.

Database-centric

Database-centric Kafka Pipeline-centric Database

Best Data Processing Frameworks That You Must Know

Knowledge Hut

JANUARY 18, 2024

“Big data Analytics” is a phrase that was coined to refer to amounts of datasets that are so large traditional data processing software simply can’t manage them. For example, big data is used to pick out trends in economics, and those trends and patterns are used to predict what will happen in the future.

Data Process

Data Process Process Hadoop Scala

Stream Processing vs. Real-Time Analytics Databases

Rockset

MARCH 27, 2023

This is part two in Rockset’s Making Sense of Real-Time Analytics on Streaming Data series. In part 1 , we covered the technology landscape for real-time analytics on streaming data. In this post, we’ll explore the differences between real-time analytics databases and stream processing frameworks.

Database

Database Process Scala SQL

How to Use Kafka for Event Streaming in a Microservices Architecture?

Workfall

JUNE 27, 2023

It means that there is a high risk of data loss but Apache Kafka solves this because it is distributed and can easily scale horizontally and other servers can take over the workload seamlessly. A good real-world example will be a taxi app. This is where Apache Kafka comes in. Let’s get started!

Kafka

Kafka Architecture AWS Transportation

Getting Started with Cloudera Stream Processing Community Edition

Cloudera

AUGUST 10, 2022

Cloudera has a strong track record of providing a comprehensive solution for stream processing. Cloudera Stream Processing (CSP), powered by Apache Flink and Apache Kafka, provides a complete stream management and stateful processing solution.

Process

Process Kafka PostgreSQL MySQL

Streaming Market Data with Flink SQL Part II: Intraday Value-at-Risk

Cloudera

MAY 18, 2021

In case you missed it, part I starts with a simple case of calculating streaming VWAP. Event-driven and streaming architectures enable complex processing on market events as they happen, making them a natural fit for financial market applications. Value-at-Risk (VaR) is a widely used metric in risk management.

SQL

SQL Java Data Business Analyst

Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

LinkedIn Engineering

MARCH 23, 2023

Co-Authors: Yuhong Cheng , Shangjin Zhang , Xinyu Liu, and Yi Pan Efficient data processing is crucial in reducing learning curves, simplifying maintenance efforts, and decreasing operational complexity. By unifying these pipelines, we have saved 94% of processing time. Samza , Spark and Apache Flink ).

Process

Process Lambda Architecture Kafka Datasets

Large-scale User Sequences at Pinterest

Pinterest Engineering

MAY 2, 2023

This kind of signal plays a critical role in various ML applications, especially for large-scale sequential modeling applications (see example ). At Pinterest, most of our streaming jobs are built on top of Apache Flink , because Flink is a mature streaming framework with a lot of adoption in the industry.

Lambda Architecture

Lambda Architecture Datasets Software Engineer Software Engineering

Streaming Market Data with Flink SQL Part I: Streaming VWAP

Cloudera

MAY 4, 2021

Event-driven and streaming architectures enable complex processing on market events as they happen, making them a natural fit for financial market applications. Flink SQL is a data processing language that enables rapid prototyping and development of event-driven and streaming applications. Streaming VWAP.

SQL

SQL Business Analyst Data Java

Where’s My Data?—?A Unique Encounter with Flink Streaming’s Kinesis Connector

Lyft Engineering

AUGUST 14, 2023

Where’s My Data — A Unique Encounter with Flink Streaming’s Kinesis Connector For years now, Lyft has not only been a proponent of but also a contributor to Apache Flink. Context While Lyft runs many streaming applications, the one specifically in question is a persistence job. Data Engineer : “Alert!

Data

Data Engineering Process Management

Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix Tech

NOVEMBER 20, 2023

by Jun He , Yingyi Zhang , and Pawan Dixit Incremental processing is an approach to process new or changed data in workflows. The key advantage is that it only incrementally processes data that are newly added or updated to a dataset, instead of re-processing the complete dataset.

Process

Process Data Pipeline Datasets SQL

Data Engineering Annotated Monthly – January 2022

Big Data Tools

FEBRUARY 9, 2022

Waiting a little longer might not be such a bad thing in this case, because now we have even more interesting releases to talk about! Theoretically, all of the components may be available, but the setup process is just a pain. Apache Hop 1.1 — The number of no-code tools is snowballing. Apache Hop is different in many ways.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Druid Deprecation and ClickHouse Adoption at Lyft

Webinars

Trending Sources

Running Unified PubSub Client in Production at Pinterest

Webinars

Building Real-time Machine Learning Foundations at Lyft

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Data Engineering Weekly #167

Lessons from debugging a tricky direct memory leak

Fraud Detection with Cloudera Stream Processing Part 1

SQL Streambuilder Data Transformations

Your Parents Still Don’t Know What a Hashtag Is. Let’s Teach Them the Basics of Machine Learning and Streaming Data

Implementing and Using UDFs in Cloudera SQL Stream Builder

Data Engineering Weekly #109

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Cloudera Streaming Analytics 1.4: the unification of SQL batch and streaming

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

Data Engineering Annotated Monthly – April 2022

Data Engineering Annotated Monthly – April 2022

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

Data Engineering Annotated Monthly – September 2022

Data Engineering Annotated Monthly – September 2022

Delta: A Data Synchronization and Enrichment Platform

Top 20+ Big Data Certifications and Courses in 2023

Data Engineering Annotated Monthly – September 2021

Data Engineering Annotated Monthly – September 2021

Complex Event Generation for Business Process Monitoring using Apache Flink

15+ AWS Projects Ideas for Beginners to Practice in 2023

Java vs Python for Data Science in 2023-What's your choice?

Gotchas of Streaming Pipelines: Profiling & Performance Improvements

Accelerating Deployments of Streaming Pipelines – Announcing Data in Motion on Kubernetes

Build AI-powered Recommendations with Confluent Cloud for Apache Flink® and Rockset

The Stream Processing Model Behind Google Cloud Dataflow

5 Key Takeaways from #Current2023

Best Data Processing Frameworks That You Must Know

Stream Processing vs. Real-Time Analytics Databases

How to Use Kafka for Event Streaming in a Microservices Architecture?

Getting Started with Cloudera Stream Processing Community Edition

Streaming Market Data with Flink SQL Part II: Intraday Value-at-Risk

Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

Large-scale User Sequences at Pinterest

Streaming Market Data with Flink SQL Part I: Streaming VWAP

Where’s My Data?—?A Unique Encounter with Flink Streaming’s Kinesis Connector

Incremental Processing using Netflix Maestro and Apache Iceberg

Data Engineering Annotated Monthly – January 2022

Stay Connected