Blog - Data Engineering Digest

Transactional Machine Learning at Scale with MAADS-VIPER and Apache Kafka

Confluent

DECEMBER 11, 2020

This blog post shows how transactional machine learning (TML) integrates data streams with automated machine learning (AutoML), using Apache Kafka® as the data backbone, to create a frictionless machine learning […].

Machine Learning

Machine Learning Kafka Data Programming

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Cloudera

JULY 18, 2022

In part 1 of this blog we discussed how Cloudera DataFlow for the Public Cloud (CDF-PC), the universal data distribution service powered by Apache NiFi, can make it easy to acquire data from wherever it originates and move it efficiently to make it available to other applications in a streaming fashion.

Process

Process Kafka Scala SQL

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. This blog will be published in two parts. This is what we call the first-mile problem.

Process

Process Kafka SQL Machine Learning

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

In this blog post, we will discuss such technologies. There are a variety of big data processing technologies available, including Apache Hadoop, Apache Spark, and MongoDB. In general, Hadoop and Spark are good choices for batch processing, while Kafka and Storm are better suited for streaming applications.

Big Data

Big Data Technology NoSQL Hadoop

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Cloudera

MARCH 5, 2024

Cloudera is now the only provider to offer an open data lakehouse with Apache Iceberg for cloud and on-premises. Apache Ozone As AI and other advanced analytics continue to grow in scale, performance and scalable data storage will need to expand right along with them. ZDU gives organizations a more convenient means of upgrading.

Data Lake

Data Lake Government Data Storage Kafka

Deploying Data Pipelines using the Saga pattern

Picnic Engineering

FEBRUARY 8, 2023

In our previous blog, Dima Kalashnikov explained how we configure our Internal services pipeline in the Analytics Platform. The steps of this story are actually local transactions distributed across the various components, all coordinated to achieve our larger goal. A true heroic story indeed! How does it work?

Data Pipeline

Data Pipeline Kafka Data Architecture

What is Apache Kafka Used For?

ProjectPro

FEBRUARY 8, 2023

Did you know thousands of businesses, including over 80% of the Fortune 100, use Apache Kafka to modernize their data strategies? Apache Kafka is the most widely used open-source stream-processing solution for gathering, processing, storing, and analyzing large amounts of data. What is Apache Kafka Used For?

Kafka

Kafka Banking Medical Healthcare

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Introduction At Lyft, we have used systems like Apache ClickHouse and Apache Druid for near real-time and sub-second analytics. In this particular blog post, we explain how Druid has been used at Lyft and what led us to adopt ClickHouse for our sub-second analytic system. ioConfig: Kafka server info, topic names, etc. (ex.

Kafka

Kafka Data Ingestion Datasets Architecture

Data News — Week 23.09

Christophe Blefari

MARCH 4, 2023

I'll try to think about it in the following weeks to understand where I go for the third year of the newsletter and the blog. Understanding slowly changing dimensions (SCD) in data warehousing — SCD modeling is an old technique but more and more relevant today as we need to keep track of transactional data.

Machine Learning

Machine Learning AWS Data Data Lake

Digital Transformation is a Data Journey From Edge to Insight

Cloudera

JANUARY 20, 2021

This is the first in a six-part blog series that outlines the data journey from edge to AI and the business value data produces along the journey. Serving – controlling and running essential business operations (ATM transactions, retail checkout, or production monitoring) . STEP 4: Capture data from Apache Kafka streams.

Manufacturing

Manufacturing Data Warehouse Kafka Retail

The Importance of Distributed Tracing for Apache-Kafka-Based Applications

Confluent

MARCH 26, 2019

Apache-Kafka ® -based applications stand out for their ability to decouple producers and consumers using an event log as an intermediate layer. This article describes how to instrument Kafka-based applications with distributed tracing capabilities in order to make dataflows between event-based components more visible.

Kafka

Kafka Transportation Metadata Consulting

Data Engineering Annotated Monthly – November 2021

Big Data Tools

DECEMBER 7, 2021

Apache Arrow 6.0.1 – Apache Arrow presents itself as a cross-language development platform for in-memory analytics. release of Apache Arrow brings much better support for the Go language! Apache Geode – Was anyone even thinking about data engineering 19 years ago, back when Apache Geode first came on the scene?

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – November 2021

Big Data Tools

DECEMBER 7, 2021

Apache Arrow 6.0.1 – Apache Arrow presents itself as a cross-language development platform for in-memory analytics. release of Apache Arrow brings much better support for the Go language! Apache Geode – Was anyone even thinking about data engineering 19 years ago, back when Apache Geode first came on the scene?

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

The customer also wanted to utilize the new features in CDP PvC Base like Apache Ranger for dynamic policies, Apache Atlas for lineage, comprehensive Kafka streaming services and Hive 3 features that are not available in legacy CDH versions. ACID transactions, ANSI 2016 SQL SupportMajor Performance improvements.

Cloud

Cloud Kafka Professional Services Metadata

Data Engineering Weekly #123

Data Engineering Weekly

MARCH 19, 2023

link] Uber: Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi Uber writes a comprehensive guide on running incremental ETL using Apache Hudi. The blog discusses how these ML models integrate with the application to serve users.

Data Engineering

Data Engineering Data Engineer Engineering Media

Data Engineering Annotated Monthly – June 2022

Big Data Tools

JULY 13, 2022

Apache Ambari: Resurrected – In February, Apache Ambari was moved to the Apache Attic. How is it possible to support distributed transactions and solve the other complex problems of distributed systems? There are also multiple improvements for streaming support (for Kafka and Kinesis ), along with many other changes.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – June 2022

Big Data Tools

JULY 13, 2022

Apache Ambari: Resurrected – In February, Apache Ambari was moved to the Apache Attic. How is it possible to support distributed transactions and solve the other complex problems of distributed systems? There are also multiple improvements for streaming support (for Kafka and Kinesis ), along with many other changes.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Data Engineering Podcast

FEBRUARY 11, 2018

In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. The landscape of time series databases is extensive and oftentimes difficult to navigate.

PostgreSQL

PostgreSQL NoSQL Google Cloud MongoDB

MongoDB CDC: When to Use Kafka, Debezium, Change Streams and Rockset

Rockset

JULY 28, 2022

Applications tend to need the support of higher level APIs for things like ACID transactions. Options For Change Data Capture on MongoDB Apache Kafka The native CDC architecture for capturing change events in MongoDB uses Apache Kafka. The Rockset solution requires neither Kafka nor Debezium.

MongoDB

MongoDB Kafka NoSQL Data Lake

Gartner® Magic Quadrant™ for Cloud Database Report Recognizes Cloudera as a Visionary

Cloudera

JANUARY 19, 2022

These integrated data services provide fit-for-purpose solutions for different data workloads including – advanced analytics, streaming, Machine Learning and transaction processing, which will provide an end to end automated data lifecycle. Download the reports to see the detailed scores . 2021 Gartner Magic Quadrant for Cloud DBMS .

Database

Database Cloud Data Warehouse Data Lake

Delta: A Data Synchronization and Enrichment Platform

Netflix Tech

OCTOBER 15, 2019

We have observed a series of distinct patterns which have tried to address multi-datastore synchronization, such as dual writes, distributed transactions, etc. Change Log Table When mutations (like an insert, update and delete) occur on a set of tables, entries for the changes are added to the log table as part of the same transaction.

Transportation

Transportation MySQL Kafka Data

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part II)

Cloudera

AUGUST 26, 2020

In Part II of our Q&A, Dinesh will be looking at how businesses can leverage technology like Apache Flink and Apache NiFi to promote low latency processing of high-volume, high-velocity data. There would be a flurry of transactions on the stock which would in turn affect the value. How does that happen in near real-time?

Banking

Banking Data Ingestion Kafka Data Lake

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

In a financial services context, this could be trades or transactional data. With real-time streaming data, this context and detection is instantly available and the second fraudulent transaction can be blocked immediately. However, in the context of time and geography, these two events point to a pattern of fraud.

Banking

Banking Kafka Cloud Storage Government

Why Mutability Is Essential for Real-Time Data Analytics

Rockset

MARCH 10, 2022

We'll be publishing more posts in the series in the near future, so subscribe to our blog so you don't miss them! He was also a contributor to the open source Apache HBase project. A platform such as Apache Kafka/Confluent , Spark or Amazon Kinesis for publishing that stream of event data.

Data Analytics

Data Analytics Data Warehouse Medical MySQL

How to Use KSQL Stream Processing and Real-Time Databases to Analyze Streaming Data in Kafka

Rockset

MARCH 19, 2020

Intro In recent years, Kafka has become synonymous with “streaming,” and with features like Kafka Streams, KSQL, joins, and integrations into sinks like Elasticsearch and Druid, there are more ways than ever to build a real-time analytics application around streaming data in Kafka.

Kafka

Kafka Database Process SQL

RocksDB Is Eating the Database World

Rockset

JANUARY 23, 2020

Going into the details of LSM trees, and RocksDB’s implementation of the same, is out of the scope of this blog, but suffice it to say that it’s an indexing structure optimized to handle high-volume—sequential or random—write workloads. Apache Cassandra is one of the most popular NoSQL databases. trillion euros.

Database

Database MySQL Kafka NoSQL

Using Graph Processing for Kafka Stream Visualizations

Confluent

AUGUST 29, 2019

We know that Apache Kafka ® is great when you’re dealing with streams, allowing you to conveniently look at streams as tables. Kafka already allows you to look at data as streams or tables; graphs are a third option, a more natural representation with a lot of grounding in theory for some use cases. 8, and so on.

Kafka

Kafka Process Algorithm Cloud

Analytics on DynamoDB: Comparing Elasticsearch, Athena and Spark

Rockset

APRIL 29, 2019

In this blog post I compare options for real-time analytics on DynamoDB - Elasticsearch , Athena, and Spark - in terms of ease of setup, maintenance, query capability, latency. However, as an operational database optimized for transaction processing, DynamoDB is not well-suited to delivering real-time analytics.

NoSQL

NoSQL PostgreSQL AWS SQL

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Confluent

SEPTEMBER 26, 2019

In the early days, many companies simply used Apache Kafka ® for data ingestion into Hadoop or another data lake. However, Apache Kafka is more than just messaging. Rockset supports JDBC and integrates with other SQL dashboards like Tableau, Grafana, and Apache Superset.

Kafka

Kafka BI SQL Datasets

Getting Started with Cloudera Stream Processing Community Edition

Cloudera

AUGUST 10, 2022

Cloudera Stream Processing (CSP), powered by Apache Flink and Apache Kafka, provides a complete stream management and stateful processing solution. In CSP, Kafka serves as the storage streaming substrate, and Flink as the core in-stream processing engine that supports SQL and REST interfaces. Apache Kafka and SMM.

Process

Process Kafka PostgreSQL MySQL

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Snowflake

MARCH 2, 2023

As part of this, we are also supporting Snowpipe Streaming as an ingestion method for our Snowflake Connector for Kafka. Now we are able to ingest our data in near real time directly from Kafka topics to a Snowflake table, drastically reducing the cost of ingestion and improving our SLA from 15 minutes to within 60 seconds.

Kafka

Kafka Data Ingestion Data Pipeline Cloud Storage

Reliable Data Exchange with the Outbox Pattern and Cloudera DiM

Cloudera

MARCH 15, 2023

The Transactional Outbox pattern provides a solution for services to execute these operations in a safe and atomic manner, keeping the application in a consistent state. This way the two statements can be part of the same transaction, and since most modern databases guarantee atomicity, the transaction either succeeds or fails completely.

PostgreSQL

PostgreSQL Kafka Database Data

Building a Scalable Search Architecture

Confluent

JUNE 18, 2019

While the intuitive approach, known as a distributed transaction , is popular and seems useful, you might encounter consistency problems if one of your writes fails. that describes the existing problems with heterogeneous, distributed transactions. Building an indexing pipeline at scale with Kafka Connect. Direct indexing.

Architecture

Architecture Building Kafka Database-centric

Spring for Apache Kafka Deep Dive – Part 3: Apache Kafka and Spring Cloud Data Flow

Confluent

MAY 30, 2019

Following part 1 and part 2 of the Spring for Apache Kafka Deep Dive blog series, here in part 3 we will discuss another project from the Spring team: Spring Cloud Data Flow , which focuses on enabling developers to easily develop, deploy, and orchestrate event streaming pipelines based on Apache Kafka ®.

Kafka

Kafka Cloud Data Pipeline PostgreSQL

7 Lessons From GoCardless’ Implementation of Data Contracts

Monte Carlo

JULY 7, 2022

You can read more about Convoy’s approach from our blog with their Head of Product, Data Platform, Chad Sanderson, “ The modern data warehouse is broken.” Interestingly these three organizations share a need for near real-time data and have services that produce copious first-party event/transactional data). Or is it a passing fad?

Data Warehouse

Data Warehouse Software Engineer Software Engineering Data

Putting Events in Their Place with Dynamic Routing

Confluent

APRIL 4, 2019

In the Apache Kafka ® world, this means that each of those microservice client applications subscribes to a common Kafka topic. Once this stream is created, the application may take any action on the events using the rich Kafka Streams API. IoT: a stream of sensor data in which each sensor reading is an event.

Kafka

Kafka Data Cleanse Retail Finance

Announcing the GA of Cloudera DataFlow for the Public Cloud on Microsoft Azure

Cloudera

FEBRUARY 10, 2022

The need for a cloud-native Apache NiFi service on Microsoft Azure. Apache NiFi’s rich processor library provides Azure focused processors like ADLS Gen2, Event Hub, Blob Storage or Cosmos DB out of the box. Modern applications often provide streaming interfaces to send transaction data in real-time to external systems for analysis.

Cloud

Cloud Kafka AWS Data Ingestion

Deploying Kafka Streams and KSQL with Gradle – Part 1: Overview and Motivation

Confluent

MAY 15, 2019

He describes the traditional way of segmenting systems as either analytic or transactional, with transactional systems being the ones that generate useful data, and analytic systems being the ones that use that data for decision-making and reporting. Oracle GoldenGate is only a source for Kafka, not a sink.

Kafka

Kafka ETL Tools Cloud Data Integration

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part III)

Cloudera

SEPTEMBER 1, 2020

In the final part of our Q&A, Dinesh will be looking at low latency streaming architecture models, expand on technology like Apache Flink, Apache NiFi and Apache Kafka and talk how Cloudera is already supporting businesses in integrating these solutions to leverage new opportunities within the financial service sector.

Kafka

Kafka Banking Government Architecture

Spring for Apache Kafka Deep Dive – Part 4: Continuous Delivery of Event Streaming Pipelines

Confluent

JUNE 11, 2019

Here in part 4 of the Spring for Apache Kafka Deep Dive blog series, we will cover: Common event streaming topology patterns supported in Spring Cloud Data Flow. Create and manage event streaming pipelines, including a Kafka Streams application using Spring Cloud Data Flow.

Kafka

Kafka Cloud Java MongoDB

Data Engineering Annotated Monthly – January 2022

Big Data Tools

FEBRUARY 9, 2022

Apache Hop 1.1 — The number of no-code tools is snowballing. We all know Apache NiFi, a stream processing tool with its own processing engine. Apache Hop is different in many ways. For one, it uses Apache Beam as an engine. Kafka: Add range and scan query over kv-store in IQv2 — The name of this KIP speaks for itself.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – January 2022

Big Data Tools

FEBRUARY 9, 2022

Apache Hop 1.1 — The number of no-code tools is snowballing. We all know Apache NiFi, a stream processing tool with its own processing engine. Apache Hop is different in many ways. For one, it uses Apache Beam as an engine. Kafka: Add range and scan query over kv-store in IQv2 — The name of this KIP speaks for itself.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Spring for Apache Kafka Deep Dive – Part 2: Apache Kafka and Spring Cloud Stream

Confluent

MARCH 12, 2019

On the heels of part one in this blog series, Spring for Apache Kafka – Part 1: Error Handling, Message Conversion and Transaction Support , here in part two we’ll focus on another project that enhances the developer experience when building streaming applications on Kafka: Spring Cloud Stream.

Kafka

Kafka Cloud Programming Coding

2022 Summer Intern Projects Article #3

DoorDash Engineering

APRIL 4, 2023

This is the third blog post in a series of articles showcasing our 2022 summer intern projects. The Revenue Platform (RP) team is attempting to mitigate these issues by providing mechanisms for recording financial transactions in a compliant and auditable way that is amenable to accounting and reporting.

Project

Project Banking Kafka Database

Transactional Machine Learning at Scale with MAADS-VIPER and Apache Kafka

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Webinars

Trending Sources

Fraud Detection with Cloudera Stream Processing Part 1

Webinars

Big Data Technologies that Everyone Should Know in 2024

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Deploying Data Pipelines using the Saga pattern

What is Apache Kafka Used For?

Druid Deprecation and ClickHouse Adoption at Lyft

Data News — Week 23.09

Digital Transformation is a Data Journey From Edge to Insight

The Importance of Distributed Tracing for Apache-Kafka-Based Applications

Data Engineering Annotated Monthly – November 2021

Data Engineering Annotated Monthly – November 2021

Upgrade Journey: The Path from CDH to CDP Private Cloud

Data Engineering Weekly #123

Data Engineering Annotated Monthly – June 2022

Data Engineering Annotated Monthly – June 2022

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

MongoDB CDC: When to Use Kafka, Debezium, Change Streams and Rockset

Gartner® Magic Quadrant™ for Cloud Database Report Recognizes Cloudera as a Visionary

Delta: A Data Synchronization and Enrichment Platform

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part II)

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Why Mutability Is Essential for Real-Time Data Analytics

How to Use KSQL Stream Processing and Real-Time Databases to Analyze Streaming Data in Kafka

RocksDB Is Eating the Database World

Using Graph Processing for Kafka Stream Visualizations

Analytics on DynamoDB: Comparing Elasticsearch, Athena and Spark

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Getting Started with Cloudera Stream Processing Community Edition

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Reliable Data Exchange with the Outbox Pattern and Cloudera DiM

Building a Scalable Search Architecture

Spring for Apache Kafka Deep Dive – Part 3: Apache Kafka and Spring Cloud Data Flow

7 Lessons From GoCardless’ Implementation of Data Contracts

Putting Events in Their Place with Dynamic Routing

Announcing the GA of Cloudera DataFlow for the Public Cloud on Microsoft Azure

Deploying Kafka Streams and KSQL with Gradle – Part 1: Overview and Motivation

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part III)

Spring for Apache Kafka Deep Dive – Part 4: Continuous Delivery of Event Streaming Pipelines

Data Engineering Annotated Monthly – January 2022

Data Engineering Annotated Monthly – January 2022

Spring for Apache Kafka Deep Dive – Part 2: Apache Kafka and Spring Cloud Stream

2022 Summer Intern Projects Article #3

Stay Connected