Blog - Data Engineering Digest

Running Unified PubSub Client in Production at Pinterest

Pinterest Engineering

NOVEMBER 7, 2023

A central component of data ingestion infrastructure at Pinterest is our PubSub stack, and the Logging Platform team currently runs deployments of Apache Kafka and MemQ. years since our previous blog post, PSC has been battle-tested at large scale in Pinterest with notably positive feedback and results.

Kafka

Kafka Java Software Engineer Software Engineering

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

What was once popular and in demand can quickly become outdated. In this blog post, we will discuss such technologies. What Are Big Data T echnologies? There are a variety of big data processing technologies available, including Apache Hadoop, Apache Spark, and MongoDB.

Big Data

Big Data Technology NoSQL Hadoop

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Introduction At Lyft, we have used systems like Apache ClickHouse and Apache Druid for near real-time and sub-second analytics. In this particular blog post, we explain how Druid has been used at Lyft and what led us to adopt ClickHouse for our sub-second analytic system. ioConfig: Kafka server info, topic names, etc.

Kafka

Kafka Data Ingestion Datasets Architecture

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Analysis of Confluent Buying Immerok

Jesse Anderson

JANUARY 9, 2023

I’ve always been vocal about ksqlDB’s and Kafka Stream’s limitations. The Future of ksqlDB and Kafka Streams With this announcement, the future of primarily ksqlDB and, to a lesser extent, Kafka Streams comes into view. Since Kafka Streams is part of the Apache project, I don’t see it going away as quickly.

Kafka

Kafka Coding Technology SQL

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. The CSP engine is powered by Apache Flink, which is the best-in-class processing engine for stateful streaming pipelines. Iceberg is a high-performance open table format for huge analytic data sets.

Process

Process SQL Kafka Database

Streams Replication Manager Prefixless Replication

Cloudera

JANUARY 31, 2024

Streams Replication Manager (SRM) is an enterprise-grade replication solution that enables fault tolerant, scalable, and robust cross-cluster Kafka topic replication. SRM replicates data at high performance and keeps topic properties in sync across clusters. ACL and configuration changes are not synced across mirrored clusters.

Management

Management Kafka Big Data Cloud

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Cloudera

MARCH 5, 2024

Cloudera is now the only provider to offer an open data lakehouse with Apache Iceberg for cloud and on-premises. Apache Ozone As AI and other advanced analytics continue to grow in scale, performance and scalable data storage will need to expand right along with them. But even with its rise, AI is still a struggle for some enterprises.

Data Lake

Data Lake Data Storage Government Kafka

Data Engineering Weekly #141

Data Engineering Weekly

AUGUST 6, 2023

We've overcome some unexpected hiccups, and guess what? We've overcome some unexpected hiccups, and guess what? The first one that caught my eye is Astronomer published a blog post Introducing Cosmos 1.0: The blog narrates how to schedule dbt jobs in Airflow by parsing dbt's manifest.json file, and auto construct Airflow tasks.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Cloudera

JULY 18, 2022

In part 1 of this blog we discussed how Cloudera DataFlow for the Public Cloud (CDF-PC), the universal data distribution service powered by Apache NiFi, can make it easy to acquire data from wherever it originates and move it efficiently to make it available to other applications in a streaming fashion.

Process

Process Kafka Scala SQL

Data News — Snowflake and Databricks summits

Christophe Blefari

JULY 3, 2023

Container Services & Nvidia partnership — Snowflake is slowly becoming a one-stop shop, with container services you will be able to run your own apps in a Kubernetes cluster managed by Snowflake. 2 summits ( credits I cropped the image) Hey, since I said I should try to send the newsletter at a specific schedule I did not.

SQL

SQL Data Kafka AWS

Scaling Kafka Brokers in Cloudera Data Hub

Cloudera

OCTOBER 4, 2022

This blog post will provide guidance to administrators currently using or interested in using Kafka nodes to maintain cluster changes as they scale up or down to balance performance and cloud costs in production deployments. Kafka brokers contained within host groups enable the administrators to more easily add and remove nodes.

Kafka

Kafka Data Cloud Big Data

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. This is what we call the first-mile problem. This blog will be published in two parts. The use case.

Process

Process Kafka SQL Machine Learning

#ClouderaLife Spotlight: Barnabas Maidics, Software Engineer

Cloudera

AUGUST 24, 2021

As a Software Engineer at Cloudera, Barnabas gets to experience rewarding work with emerging technologies like Apache Kafka. As he sees it, “Kafka is a famous and widely used project. The team is not just about Kafka, but other components that are built around Kafka too, so we can work on different projects, full of challenges.

Software Engineer

Software Engineer Software Engineering Engineering Kafka

Metadata Management And Integration At LinkedIn With DataHub

Data Engineering Podcast

AUGUST 24, 2020

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it?

Metadata

Metadata Management Kafka Data Engineering

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. Can you start by explaining what Spark is? What are some of the main use cases for Spark? Who uses Spark?

Scala

Scala MySQL Kafka Hadoop

Top 20+ Big Data Certifications and Courses in 2023

Knowledge Hut

SEPTEMBER 6, 2023

What is Big Data Certification? It is a well-known fact that we inhabit a data-rich world. Businesses are generating, capturing, and storing vast amounts of data at an enormous scale. This influx of data is handled by robust big data systems which are capable of processing, storing, and querying data at scale.

Big Data

Big Data Certification Hadoop Scala

Generating and Viewing Lineage through Apache Ozone

Cloudera

AUGUST 10, 2021

With Apache Ozone on the Cloudera Data Platform (CDP) , they can implement a scale-out model and build out their next generation storage architecture without sacrificing security, governance and lineage. In this article, we’ll focus on generating and viewing lineage that includes Ozone assets from Apache Atlas. With CDP 7.1.4

Hadoop

Hadoop Kafka Datasets Government

How to Connect KSQL to Confluent Cloud using Kubernetes with Helm

Confluent

JUNE 12, 2019

Confluent Cloud, a fully managed event cloud-native streaming service that extends the value of Apache Kafka ® , is simple, resilient, secure, and performant, allowing you to focus on what is important—building contextual event-driven applications, not infrastructure. Next, click on the cluster name whose configuration you want.

Cloud

Cloud Kafka Healthcare Software Engineer

Building Real-time Machine Learning Foundations at Lyft

Lyft Engineering

JUNE 28, 2023

In this blog post, we will discuss what we built in support of that goal and some of the lessons we learned along the way. Capabilities of Real-time Machine Learning One of the first questions we asked ourselves is — what are the general use cases within the ML ecosystem that can leverage streaming data?

Machine Learning

Machine Learning Building Metadata Kafka

How to Become Databricks Certified Apache Spark Developer?

ProjectPro

FEBRUARY 21, 2023

With around 35k stars and over 26k forks on Github, Apache Spark is one of the most popular big data frameworks used by 22,760 companies worldwide. Apache Spark is the most efficient, scalable, and widely used in-memory data computation tool capable of performing batch-mode, real-time, and analytics operations.

Scala

Scala Programming Language Java Hadoop

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

Here’s what’s happening in the world of data engineering right now. One of the use cases from the product page that stood out to me in particular was the effort to mirror multiple Kafka clusters in one Brooklin cluster! Apache Pegasus might be the alternative you are looking for, if not now, then in your next project.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

Here’s what’s happening in the world of data engineering right now. One of the use cases from the product page that stood out to me in particular was the effort to mirror multiple Kafka clusters in one Brooklin cluster! Apache Pegasus might be the alternative you are looking for, if not now, then in your next project.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

What is Streaming Analytics?

Cloudera

APRIL 20, 2021

What is Streaming Analytics? What are the business challenges with today’s data? IT teams tried solving the problem by adding more clusters but noticed the rising cost for infrastructure and struggled to hire the right talent to manage them. What are the advantages of Streaming Analytics?

Hospitality

Hospitality Kafka Retail Data Ingestion

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Confluent

FEBRUARY 6, 2019

The blog posts How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and Using Apache Kafka to Drive Cutting-Edge Machine Learning describe the benefits of leveraging the Apache Kafka ® ecosystem as a central, scalable and mission-critical nervous system.

Machine Learning

Machine Learning Python Kafka Java

Data Engineering Annotated Monthly – June 2022

Big Data Tools

JULY 13, 2022

Here’s what’s happening in the world of data engineering right now. Apache Ambari: Resurrected – In February, Apache Ambari was moved to the Apache Attic. There are also multiple improvements for streaming support (for Kafka and Kinesis ), along with many other changes. However, a miracle happened!

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – June 2022

Big Data Tools

JULY 13, 2022

Here’s what’s happening in the world of data engineering right now. Apache Ambari: Resurrected – In February, Apache Ambari was moved to the Apache Attic. There are also multiple improvements for streaming support (for Kafka and Kinesis ), along with many other changes. However, a miracle happened!

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #125

Data Engineering Weekly

APRIL 2, 2023

Meta: Presto - A Decade of SQL Analytics at Meta Presto and Kafka are the two systems that greatly impacted data infrastructure in the last decade. The cluster split approach to store real-time, protected, and archive tweets is an excellent reference model for designing enterprise search engines. I echoed a similar statement here. .

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Data Engineering Annotated Monthly – November 2021

Big Data Tools

DECEMBER 7, 2021

And what better time than the holidays to catch up on the latest news and read about other interesting topics? News A lot of what we do in engineering involves learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now. Apache Pinot 0.9.0

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – November 2021

Big Data Tools

DECEMBER 7, 2021

And what better time than the holidays to catch up on the latest news and read about other interesting topics? News A lot of what we do in engineering involves learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now. Apache Pinot 0.9.0

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

How to configure clients to connect to Apache Kafka Clusters securely – Part 2: LDAP

Cloudera

DECEMBER 10, 2020

In the previous post, we talked about Kerberos authentication and explained how to configure a Kafka client to authenticate using Kerberos credentials. In this post we will look into how to configure a Kafka client to authenticate using LDAP, instead of Kerberos. We use the Kafka-console-consumer for all the examples below.

Kafka

Kafka Certification Management Accessible

KSQL Training for Hands-On Learning

Confluent

JULY 11, 2019

Reading, writing, and transforming data in Apache Kafka ® using KSQL is an effective way to rapidly deliver event streaming applications for clients (e.g., For a KSQL newbie the practical exercises show you how to process data in Apache Kafka using an interactive SQL interface. streaming insurance events ).

Kafka

Kafka Insurance SQL Architecture

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

Here’s what’s happening in the world of data engineering right now. I’ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn’t able to make it do what I wanted it to. This new release brings exciting features like support for Apache Iceberg! There are several solutions.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

Here’s what’s happening in the world of data engineering right now. I’ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn’t able to make it do what I wanted it to. This new release brings exciting features like support for Apache Iceberg! There are several solutions.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – April 2022

Big Data Tools

MAY 19, 2022

Here’s what’s happening in the world of data engineering right now. Apache Hudi 1.11.0 – This release of the well-known data lake has added many interesting changes. Kyuubi 1.5.1 – Kyuubi is a JDBC server built over Apache Spark, but as of version 1.5.0, Notably, cluster failover is now supported on the client-side.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – April 2022

Big Data Tools

MAY 19, 2022

Here’s what’s happening in the world of data engineering right now. Apache Hudi 1.11.0 – This release of the well-known data lake has added many interesting changes. Kyuubi 1.5.1 – Kyuubi is a JDBC server built over Apache Spark, but as of version 1.5.0, Notably, cluster failover is now supported on the client-side.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

20 Latest AWS Glue Interview Questions and Answers for 2023

ProjectPro

JANUARY 24, 2023

This blog will discuss some popular AWS Glue interview questions and answers to help you strengthen your AWS Glue knowledge and ace your big data engineer interview. What is the process for adding metadata to the AWS Glue Data Catalog? What client languages, data formats, and integrations does AWS Glue Schema Registry support?

AWS

AWS Data Lake ETL Tools Scala

How-to: Index Data from S3 Using CDP Data Hub

Cloudera

SEPTEMBER 9, 2020

This blog post will present a simple “hello world” kind of example on how to get data that is stored in S3 indexed and served by an Apache Solr service hosted in a Data Discovery and Exploration cluster in CDP. We will only cover AWS and S3 environments in this blog. You have a DDE cluster running.

AWS

AWS Data Unstructured Data Hadoop

Cloudera DataFlow’s key milestones and wins in 2020

Cloudera

FEBRUARY 17, 2021

Everyone was looking for real-time insights by analyzing what is going on currently within their businesses and taking corrective action pro-actively. Here is a recap of what we had delivered successfully with CDF in 2020, overcoming all obstacles the year threw at us. Overall, I think we had a great year from a product perspective.

Kafka

Kafka Food Manufacturing Healthcare

Data Engineering Annotated Monthly – September 2021

Big Data Tools

OCTOBER 5, 2021

Here’s what’s happening in data engineering right now. You have multiple sources of data and you have to define what is true and what is not. Kafka 3.0.0 – The Apache Software Foundation needed less than one month to go from Kafka version 3.0.0-rc0 Zingg 0.3.0 – MDM (Master Data Management) is tricky.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – September 2021

Big Data Tools

OCTOBER 5, 2021

Here’s what’s happening in data engineering right now. You have multiple sources of data and you have to define what is true and what is not. Kafka 3.0.0 – The Apache Software Foundation needed less than one month to go from Kafka version 3.0.0-rc0 Zingg 0.3.0 – MDM (Master Data Management) is tricky.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Access control for Azure ADLS cloud object storage

Cloudera

SEPTEMBER 15, 2020

introduces fine-grained authorization for access to Azure Data Lake Storage using Apache Ranger policies. Apache Ranger provides a centralized console to manage authorization and view audits of access to resources in a large number of services including Apache Hadoop’s HDFS, Apache Hive, Apache HBase, Apache Kafka, Apache Solr.

Accessible

Accessible Accessibility Cloud Cloud Storage

Data Engineering Annotated Monthly – July 2021

Big Data Tools

AUGUST 3, 2021

August is a good time to start new things – some people are on vacation and have more spare time to read than usual, while others are back and looking for a quick refresher on what’s new in data engineering. Here’s what’s happening in data engineering right now. And yes, at the time of writing “we” is just me, Pasha.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – July 2021

Big Data Tools

AUGUST 3, 2021

August is a good time to start new things – some people are on vacation and have more spare time to read than usual, while others are back and looking for a quick refresher on what’s new in data engineering. Here’s what’s happening in data engineering right now. And yes, at the time of writing “we” is just me, Pasha.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Log Reduction Techniques with CFM

Cloudera

OCTOBER 28, 2020

Cloudera services logs offer a breadth of information to assist in cluster maintenance; from assisting in security checks, auditing tasks, and validation for performance tuning and testing tasks – to name a few. . x cluster that has both Kerberos and TLS enabled. What does this Workflow Look Like? Benefits: Easy to implement.

Kafka

Kafka SQL Professional Services Consulting

Running Unified PubSub Client in Production at Pinterest

Big Data Technologies that Everyone Should Know in 2024

Webinars

Trending Sources

Druid Deprecation and ClickHouse Adoption at Lyft

Webinars

Analysis of Confluent Buying Immerok

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Streams Replication Manager Prefixless Replication

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Data Engineering Weekly #141

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Data News — Snowflake and Databricks summits

Scaling Kafka Brokers in Cloudera Data Hub

Fraud Detection with Cloudera Stream Processing Part 1

#ClouderaLife Spotlight: Barnabas Maidics, Software Engineer

Metadata Management And Integration At LinkedIn With DataHub

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Top 20+ Big Data Certifications and Courses in 2023

Generating and Viewing Lineage through Apache Ozone

How to Connect KSQL to Confluent Cloud using Kubernetes with Helm

Building Real-time Machine Learning Foundations at Lyft

How to Become Databricks Certified Apache Spark Developer?

Data Engineering Annotated Monthly – September 2022

Data Engineering Annotated Monthly – September 2022

What is Streaming Analytics?

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Data Engineering Annotated Monthly – June 2022

Data Engineering Annotated Monthly – June 2022

Data Engineering Weekly #125

Data Engineering Annotated Monthly – November 2021

Data Engineering Annotated Monthly – November 2021

How to configure clients to connect to Apache Kafka Clusters securely – Part 2: LDAP

KSQL Training for Hands-On Learning

Data Engineering Annotated Monthly – May 2022

Data Engineering Annotated Monthly – May 2022

Data Engineering Annotated Monthly – April 2022

Data Engineering Annotated Monthly – April 2022

20 Latest AWS Glue Interview Questions and Answers for 2023

How-to: Index Data from S3 Using CDP Data Hub

Cloudera DataFlow’s key milestones and wins in 2020

Data Engineering Annotated Monthly – September 2021

Data Engineering Annotated Monthly – September 2021

Access control for Azure ADLS cloud object storage

Data Engineering Annotated Monthly – July 2021

Data Engineering Annotated Monthly – July 2021

Log Reduction Techniques with CFM

Stay Connected