Blog - Data Engineering Digest

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

In this blog post, we will discuss such technologies. In the past, this data was too large and complex for traditional data processing tools to handle. However, advances in technology have now made it possible to store, process, and analyze big data quickly and effectively. It is especially true in the world of big data.

Big Data

Big Data Technology NoSQL Hadoop

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. It allows multiple data processing engines, such as Flink, NiFi, Spark, Hive, and Impala to access and analyze data in simple, familiar SQL tables. The Catalog Type should be set to Hive.

Process

Process SQL Kafka Database

Running Unified PubSub Client in Production at Pinterest

Pinterest Engineering

NOVEMBER 7, 2023

A central component of data ingestion infrastructure at Pinterest is our PubSub stack, and the Logging Platform team currently runs deployments of Apache Kafka and MemQ. years since our previous blog post, PSC has been battle-tested at large scale in Pinterest with notably positive feedback and results.

Kafka

Kafka Java Software Engineer Software Engineering

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Getting Started With Cloudera Open Data Lakehouse on Private Cloud

Cloudera

OCTOBER 16, 2023

Cloudera recently released a fully featured Open Data Lakehouse , powered by Apache Iceberg in the private cloud, in addition to what’s already been available for the Open Data Lakehouse in the public cloud since last year. to stream ingest data sets to Iceberg. to stream ingest data sets to Iceberg.

Cloud

Cloud Kafka SQL Data

IBM Technology Chooses Cloudera as its Preferred Partner for Addressing Real Time Data Movement Using Kafka

Cloudera

SEPTEMBER 26, 2023

Organizations increasingly rely on streaming data sources not only to bring data into the enterprise but also to perform streaming analytics that accelerate the process of being able to get value from the data early in its lifecycle.

Kafka

Kafka Technology IT Government

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Introduction At Lyft, we have used systems like Apache ClickHouse and Apache Druid for near real-time and sub-second analytics. In this particular blog post, we explain how Druid has been used at Lyft and what led us to adopt ClickHouse for our sub-second analytic system. An example of how we use Druid rollup at Lyft.

Kafka

Kafka Data Ingestion Datasets Architecture

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Cloudera

JULY 18, 2022

In part 1 of this blog we discussed how Cloudera DataFlow for the Public Cloud (CDF-PC), the universal data distribution service powered by Apache NiFi, can make it easy to acquire data from wherever it originates and move it efficiently to make it available to other applications in a streaming fashion. Data decays!

Process

Process Kafka Scala SQL

Data Engineering Weekly #167

Data Engineering Weekly

APRIL 14, 2024

link] Alibaba: Building a Streaming Lakehouse: Performance Comparison Between Paimon and Hudi I’m not a big fan of these comparison studies since it all depends on the nature of the data and the business use cases. link] Github: 4 ways GitHub engineers use GitHub Copilot The impact of LLM on software development is undeniable.

Data Engineering

Data Engineering Data Engineer Engineering Business Intelligence

Building Real-time Machine Learning Foundations at Lyft

Lyft Engineering

JUNE 28, 2023

However, streaming data was not supported as a first-class citizen across many of the platform’s systems — such as training, complex monitoring, and others. While several teams were using streaming data in their Machine Learning (ML) workflows, doing so was a laborious process, sometimes requiring weeks or months of engineering effort.

Machine Learning

Machine Learning Building Metadata Kafka

Cloudera named a Strong Performer in The Forrester Wave™: Streaming Analytics, Q2 2021

Cloudera

JUNE 7, 2021

Cloudera has been named as a Strong Performer in the Forrester Wave for Streaming Analytics, Q2 2021. We are proud to have been named as one of “ The 14 providers that matter most ” in streaming analytics. CDF enables such enterprises to achieve successful digital transformations with streaming analytics. It’s too late.

Kafka

Kafka Data Ingestion Architecture Cloud

Auto-Diagnosis and Remediation in Netflix Data Platform

Netflix Tech

JANUARY 13, 2022

By Vikram Srivastava and Marcelo Mayworm Netflix has one of the most complex data platforms in the cloud on which our data scientists and engineers run batch and streaming workloads. Pensive infrastructure comprises two separate systems to support batch and streaming workloads. In the future, we are looking to automate this process.

Kafka

Kafka Big Data Data Machine Learning

Lessons from debugging a tricky direct memory leak

Pinterest Engineering

SEPTEMBER 29, 2023

Sanchay Javeria | Software Engineer, Ads Data Infrastructure To support metrics reporting for ads from external advertisers and real-time ad budget calculations at Pinterest, we run streaming pipelines using Apache Flink. Framework off-heap memory is reserved for Flink’s internal operations and data structures.

Utilities

Utilities Coding Kafka Engineering

Data Engineering Weekly #157

Data Engineering Weekly

FEBRUARY 4, 2024

The user journey, sales process, marketing campaign, everything falls under a state machine. Data modeling is a collaborative process across business units to capture state changes in business activity. The solution centered around Notebook opens a Flink Session for the Kafka stream and continues the exploration.

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Data Engineering Weekly #141

Data Engineering Weekly

AUGUST 6, 2023

🎉 Although we're still working on linking the payment processing with our dewcon.ai The first one that caught my eye is Astronomer published a blog post Introducing Cosmos 1.0: The first one that caught my eye is Astronomer published a blog post Introducing Cosmos 1.0: the best way to run dbt Core in Airflow.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. Building real-time streaming analytics data pipelines requires the ability to process data in the stream.

Process

Process Kafka SQL Machine Learning

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

Github writes an excellent blog to capture the current state of the LLM integration architecture. link] Netflix: Incremental Processing using Netflix Maestro and Apache Iceberg Netflix writes about its incremental processing design with its orchestration engine Maestro on top of Iceberg.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

SQL Streambuilder Data Transformations

Cloudera

FEBRUARY 21, 2023

SQL Stream Builder (SSB) is a versatile platform for data analytics using SQL as a part of Cloudera Streaming Analytics, built on top of Apache Flink. It enables users to easily write, run, and manage real-time continuous SQL queries on stream data and a smooth user experience. What is a data transformation?

SQL

SQL Kafka Raw Data Data

Building Real Time Applications On Streaming Data With Eventador

Data Engineering Podcast

APRIL 19, 2020

In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. How does it fit into an application architecture?

Building

Building PostgreSQL MongoDB SQL

Your Parents Still Don’t Know What a Hashtag Is. Let’s Teach Them the Basics of Machine Learning and Streaming Data

Cloudera

OCTOBER 13, 2021

Cloudera produced a series of ebooks — Production Machine Learning For Dummies , Apache NiFi For Dummies , and Apache Flink For Dummies (coming soon) — to help simplify even the most complex tech topics. Have you heard about streaming? Okay, what about streaming data? There’s no need to panic.

Machine Learning

Machine Learning Data Ingestion Algorithm Technology

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part II)

Cloudera

AUGUST 26, 2020

Live data-streaming offers businesses exciting new opportunities to transform the way they operate, leveraging real-time insights to drive better decision making and enhance operational efficiency. Hello Dinesh, thank you for joining us for Part II of our Q&A on streaming data. How does that happen in near real-time?

Banking

Banking Data Ingestion Kafka Data Lake

Data Engineering Weekly #109

Data Engineering Weekly

NOVEMBER 27, 2022

I have a long list of thoughts on this conversation, which might need a blog post on its own. Maybe Slack is 1% of the company implementing data engineering effectively to drive the product feature, but that is the point of implementing data contract and shifting left for an efficient data creation process.

Data Engineering

Data Engineering Data Engineer Engineering SQL

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

I won’t bore you with the importance of data quality in the blog. The bias toward correctness will increase the processing time, which may not be feasible when speed is a priority. Let’s talk about the data processing types. Why is Data Quality Expensive? Ensuring correctness can slow down the pipeline.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Cloudera Streaming Analytics 1.4: the unification of SQL batch and streaming

Cloudera

JUNE 7, 2021

In October of 2020 Cloudera acquired Eventador and Cloudera Streaming Analytics (CSA) 1.3.0 It was the first release to incorporate SQL Stream Builder (SSB) from the acquisition, and brought rich SQL processing to the already robust Apache Flink offering. Why batch + streaming? A bit of Flink history.

SQL

SQL Manufacturing Finance Architecture

Implementing and Using UDFs in Cloudera SQL Stream Builder

Cloudera

FEBRUARY 22, 2023

Cloudera’s SQL Stream Builder (SSB) is a versatile platform for data analytics using SQL. As apart of Cloudera Streaming Analytics it enables users to easily write, run, and manage real-time SQL queries on streams with a smooth user experience, while it attempts to expose the full power of Apache Flink.

SQL

SQL Raw Data Programming Language Kafka

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?

Scala

Scala MySQL Kafka Hadoop

Using other CDP services with Cloudera Operational Database

Cloudera

FEBRUARY 16, 2021

In the previous blog post , we looked at some of the application development concepts for the Cloudera Operational Database (COD). In this blog post, we’ll see how you can use other CDP services with COD. COD is an operational database-as-a-service that brings ease of use and flexibility to Apache HBase. Cloudera DataFlow .

Database

Database Machine Learning Data Lake Kafka

What is Streaming Analytics?

Cloudera

APRIL 20, 2021

What is Streaming Analytics? Streaming Analytics is a type of data analysis that processes data streams for real-time analytics. It continuously processes data from multiple streams and performs simple calculations to complex event processing for delivering sophisticated use cases.

Hospitality

Hospitality Kafka Retail Data Ingestion

Cloudera DataFlow’s key milestones and wins in 2020

Cloudera

FEBRUARY 17, 2021

Streaming data (or data-in-motion) is one such technology space that thrived during these times. Cloudera DataFlow (CDF), the industry’s leading real-time streaming data platform, was truly at the frontlines helping our customers find clarity with their data during these dark times.

Kafka

Kafka Food Manufacturing Healthcare

Data Engineering Weekly #124

Data Engineering Weekly

MARCH 26, 2023

The blog highlights that the job is not just writing SQL but providing a strategic business solution for an organization. Contribute to the Rudderstack Transformations Library, Win $1000 RudderStack Transformations lets you customize event data in real time with your own JavaScript or Python code. 🤔] engineering.

Data Engineering

Data Engineering Data Engineer Engineering Lambda Architecture

Data Engineering Annotated Monthly – April 2022

Big Data Tools

MAY 19, 2022

Apache Hudi 1.11.0 – This release of the well-known data lake has added many interesting changes. Kyuubi 1.5.1 – Kyuubi is a JDBC server built over Apache Spark, but as of version 1.5.0, it supports two more SQL engines, Flink and Trino/Presto. RocketMQ Streams 1.0.1 Take the new dynamic tasks , for example.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – April 2022

Big Data Tools

MAY 19, 2022

Apache Hudi 1.11.0 – This release of the well-known data lake has added many interesting changes. Kyuubi 1.5.1 – Kyuubi is a JDBC server built over Apache Spark, but as of version 1.5.0, it supports two more SQL engines, Flink and Trino/Presto. RocketMQ Streams 1.0.1 Take the new dynamic tasks , for example.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

Netflix Tech

OCTOBER 28, 2021

Over the years, I followed the big data open-source community and Netflix tech blogs closely, and learned a lot about Netflix’s innovative engineering solutions and active contributions to the open-source ecosystem. CL provides an end-to-end solution for logging, processing, and analyzing user interactions on Netflix apps from all devices.

Data Engineering

Data Engineering Data Engineer Engineering Software Engineer

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

This time I learned about Brooklin, a LinkedIn service for streaming data in a heterogeneous environment. The official GitHub for the project says that it is characterized by high reliability and throughput, claiming that Brooklin can run hundreds of streaming pipelines simultaneously. It’s been a very bustling two months in Berlin.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

This time I learned about Brooklin, a LinkedIn service for streaming data in a heterogeneous environment. The official GitHub for the project says that it is characterized by high reliability and throughput, claiming that Brooklin can run hundreds of streaming pipelines simultaneously. It’s been a very bustling two months in Berlin.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

Data Hub – has expanded to support all stages of the data lifecycle: Collect – Flow Management (Apache NiFi), Streams Management (Apache Kafka) and Streaming Analytics (Apache Flink). Enrich – Data Engineering (Apache Spark and Apache Hive). New Services.

Cloud

Cloud Data Warehouse AWS NoSQL

From Big Data to Better Data: Ensuring Data Quality with Verity

Lyft Engineering

OCTOBER 3, 2023

Streaming compute however, empowers more complex window queries on semantic correctness. Finally, as the subject of this blog post, we can assess data quality via batch compute analytics on our data warehouse, providing a comprehensive albeit slower evaluation compared to the previously mentioned methods.

Big Data

Big Data Metadata Data Warehouse Data

Data Engineering Annotated Monthly – July 2021

Big Data Tools

AUGUST 3, 2021

Kotlin API for Apache Spark – A year after showing you the first preview, we released version 1.0. Apache Spark already has two official APIs for JVM – Scala and Java – but we’re hoping the Kotlin API will be useful as well, as we’ve introduced several unique features. Here’s what’s happening in data engineering right now. Cassandra 4.0

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – July 2021

Big Data Tools

AUGUST 3, 2021

Kotlin API for Apache Spark – A year after showing you the first preview, we released version 1.0. Apache Spark already has two official APIs for JVM – Scala and Java – but we’re hoping the Kotlin API will be useful as well, as we’ve introduced several unique features. Here’s what’s happening in data engineering right now. Cassandra 4.0

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

DoorDash Engineering

OCTOBER 17, 2023

For example, if two reasonably sized groups are expected to be split 50/50, but instead show a 55/45 split, the assignment process likely is compromised. The term itself conjures a sense of rigor, validity, and trust. Yet as powerful as experimentation is, its integrity can be compromised by overlooked details and unforeseen challenges.

Education

Education Kafka Algorithm Data Warehouse

20 Latest AWS Glue Interview Questions and Answers for 2023

ProjectPro

JANUARY 24, 2023

With over 20 pre-built connectors and 40 pre-built transformers, AWS Glue is an extract, transform, and load (ETL) service that is fully managed and allows users to easily process and import their data for analytics. What is the process for adding metadata to the AWS Glue Data Catalog?

AWS

AWS Data Lake ETL Tools Scala

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

Live data-streaming offers businesses exciting new opportunities to transform the way they operate, leveraging real-time insights to drive better decision making and enhance operational efficiency. Data-in-motion is predominantly about streaming data so enterprises typically have two different ways or binary ways of looking at data.

Banking

Banking Kafka Cloud Storage Government

Data Engineering Annotated Monthly – June 2022

Big Data Tools

JULY 13, 2022

Apache Ambari: Resurrected – In February, Apache Ambari was moved to the Apache Attic. The process of returning to active maintenance is not even described in the docs. The process of returning to active maintenance is not even described in the docs. However, a miracle happened!

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – June 2022

Big Data Tools

JULY 13, 2022

Apache Ambari: Resurrected – In February, Apache Ambari was moved to the Apache Attic. The process of returning to active maintenance is not even described in the docs. The process of returning to active maintenance is not even described in the docs. However, a miracle happened!

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Delta: A Data Synchronization and Enrichment Platform

Netflix Tech

OCTOBER 15, 2019

Another thread or process is constantly polling events from the log table and writes them to one or multiple datastores, optionally removing events from the log table after acknowledged by all datastores. providing advanced search capabilities (ElasticSearch etc.), caching (Memcached etc.), Deal Service, Talent Service and Vendor Service).

Transportation

Transportation MySQL Kafka Data

Big Data Technologies that Everyone Should Know in 2024

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Webinars

Trending Sources

Running Unified PubSub Client in Production at Pinterest

Webinars

Getting Started With Cloudera Open Data Lakehouse on Private Cloud

IBM Technology Chooses Cloudera as its Preferred Partner for Addressing Real Time Data Movement Using Kafka

Druid Deprecation and ClickHouse Adoption at Lyft

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Data Engineering Weekly #167

Building Real-time Machine Learning Foundations at Lyft

Cloudera named a Strong Performer in The Forrester Wave™: Streaming Analytics, Q2 2021

Auto-Diagnosis and Remediation in Netflix Data Platform

Lessons from debugging a tricky direct memory leak

Data Engineering Weekly #157

Data Engineering Weekly #141

Fraud Detection with Cloudera Stream Processing Part 1

Data Engineering Weekly #151

SQL Streambuilder Data Transformations

Building Real Time Applications On Streaming Data With Eventador

Your Parents Still Don’t Know What a Hashtag Is. Let’s Teach Them the Basics of Machine Learning and Streaming Data

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part II)

Data Engineering Weekly #109

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Cloudera Streaming Analytics 1.4: the unification of SQL batch and streaming

Implementing and Using UDFs in Cloudera SQL Stream Builder

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Using other CDP services with Cloudera Operational Database

What is Streaming Analytics?

Cloudera DataFlow’s key milestones and wins in 2020

Data Engineering Weekly #124

Data Engineering Annotated Monthly – April 2022

Data Engineering Annotated Monthly – April 2022

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

Data Engineering Annotated Monthly – September 2022

Data Engineering Annotated Monthly – September 2022

Happy Birthday, CDP Public Cloud

From Big Data to Better Data: Ensuring Data Quality with Verity

Data Engineering Annotated Monthly – July 2021

Data Engineering Annotated Monthly – July 2021

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

20 Latest AWS Glue Interview Questions and Answers for 2023

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Data Engineering Annotated Monthly – June 2022

Data Engineering Annotated Monthly – June 2022

Delta: A Data Synchronization and Enrichment Platform

Stay Connected