Blog - Data Engineering Digest

Running Unified PubSub Client in Production at Pinterest

Pinterest Engineering

NOVEMBER 7, 2023

A central component of data ingestion infrastructure at Pinterest is our PubSub stack, and the Logging Platform team currently runs deployments of Apache Kafka and MemQ. years since our previous blog post, PSC has been battle-tested at large scale in Pinterest with notably positive feedback and results.

Kafka

Kafka Java Software Engineer Software Engineering

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers. The release of Apache Beam in 2016 proved to be a game-changer for LinkedIn.

Process

Process Lambda Architecture Kafka Machine Learning

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Introduction At Lyft, we have used systems like Apache ClickHouse and Apache Druid for near real-time and sub-second analytics. In this particular blog post, we explain how Druid has been used at Lyft and what led us to adopt ClickHouse for our sub-second analytic system. ioConfig: Kafka server info, topic names, etc. (ex.

Kafka

Kafka Data Ingestion Datasets Architecture

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Data Engineering Weekly #168

Data Engineering Weekly

APRIL 21, 2024

The blog narrates how Chronon fits into Stripe’s online and offline requirements. link] Grab: Enabling near real-time data analytics on the data lake Apache Hudi’s Merge On Read (MoR) is a game changer in developing low-latency analytics on top of the data lake. link] All rights reserved ProtoGrowth Inc, India.

Data Engineering

Data Engineering Data Engineer Engineering Medical

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

I won’t bore you with the importance of data quality in the blog. Data Testing vs. Data Observability Data testing and data observability are two important aspects of data quality. Data testing ensures that data meets specific requirements. The Fronting Kafka pattern follows a two-cluster approach.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. This blog will be published in two parts. This is what we call the first-mile problem.

Process

Process Kafka SQL Machine Learning

Advanced Testing Techniques for Spring Kafka

Confluent

NOVEMBER 13, 2020

Apache Kafka®. All of these share one thing in common: complexity in testing. This is the final blog […]. Asynchronous boundaries. Frameworks. Configuring frameworks. Now imagine them combined—it gets much harder.

Kafka

Kafka IT

Streams Replication Manager Prefixless Replication

Cloudera

JANUARY 31, 2024

Streams Replication Manager (SRM) is an enterprise-grade replication solution that enables fault tolerant, scalable, and robust cross-cluster Kafka topic replication. Introduction Kafka as an event streaming component can be applied to a wide variety of use cases. Replication can be dynamically enabled for topics and consumer groups.

Management

Management Kafka Big Data Cloud

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Cloudera

JULY 21, 2022

Cloudera has partnered with Rill Data, an expert in metrics at any scale, as Cloudera’s preferred ISV partner to provide technical expertise and support services for Apache Druid customers. We want Cloudera customers that rely on Apache Druid to know that their clusters are secure and supported by the Cloudera partner ecosystem.

BI

BI Digital Media Data Warehouse Kafka

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

They have dev, test, and production clusters running critical workloads and want to upgrade their clusters to CDP Private Cloud Base. Support Kafka connectivity to HDFS, AWS S3 and Kafka Streams. Cluster management and replication support for Kafka clusters. The customer is a heavy user of Kafka for data ingestion.

Cloud

Cloud Kafka Professional Services Metadata

Generating and Viewing Lineage through Apache Ozone

Cloudera

AUGUST 10, 2021

With Apache Ozone on the Cloudera Data Platform (CDP) , they can implement a scale-out model and build out their next generation storage architecture without sacrificing security, governance and lineage. In this article, we’ll focus on generating and viewing lineage that includes Ozone assets from Apache Atlas. With CDP 7.1.4

Hadoop

Hadoop Kafka Datasets Government

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Cloudera

JULY 15, 2021

This blog post provides an overview of best practice for the design and deployment of clusters incorporating hardware and operating system configuration, along with guidance for networking and security as well as integration with existing enterprise infrastructure. Introduction and Rationale. Recommended deployment patterns.

Architecture

Architecture Cloud Kafka Hadoop

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Confluent

FEBRUARY 6, 2019

The blog posts How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and Using Apache Kafka to Drive Cutting-Edge Machine Learning describe the benefits of leveraging the Apache Kafka ® ecosystem as a central, scalable and mission-critical nervous system.

Machine Learning

Machine Learning Python Kafka Java

Top 20+ Big Data Certifications and Courses in 2023

Knowledge Hut

SEPTEMBER 6, 2023

Finally, there is a test to certify successful completion of the course. Big Data Frameworks : Familiarity with popular Big Data frameworks such as Hadoop, Apache Spark, Apache Flink, or Kafka are the tools used for data processing. I mentioned few of the best big data training online courses in this blog.

Big Data

Big Data Certification Hadoop Scala

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

DECEMBER 9, 2022

Cloudera has been providing enterprise support for Apache NiFi since 2015, helping hundreds of organizations take control of their data movement pipelines on premises and in the public cloud. What if there was a way to not require developers to manage their own Apache NiFi installation without putting that burden on platform administrators?

Designing

Designing Coding Google Cloud AWS

How to Become Databricks Certified Apache Spark Developer?

ProjectPro

FEBRUARY 21, 2023

With around 35k stars and over 26k forks on Github, Apache Spark is one of the most popular big data frameworks used by 22,760 companies worldwide. Apache Spark is the most efficient, scalable, and widely used in-memory data computation tool capable of performing batch-mode, real-time, and analytics operations.

Scala

Scala Programming Language Java Hadoop

Building A Real Time Event Data Warehouse For Sentry

Data Engineering Podcast

NOVEMBER 26, 2019

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode.

Data Warehouse

Data Warehouse Building PostgreSQL Kafka

Monitoring Data Replication in Multi-Datacenter Apache Kafka Deployments

Confluent

APRIL 10, 2019

Previously in 3 Ways to Prepare for Disaster Recovery in Multi-Datacenter Apache Kafka Deployments , we provided resources for multi-datacenter designs, centralized schema management, prevention of cyclic repetition of messages, and automatic consumer offset translation to automatically resume applications.

Kafka

Kafka Metadata Java Cloud

How to Connect KSQL to Confluent Cloud using Kubernetes with Helm

Confluent

JUNE 12, 2019

Confluent Cloud, a fully managed event cloud-native streaming service that extends the value of Apache Kafka ® , is simple, resilient, secure, and performant, allowing you to focus on what is important—building contextual event-driven applications, not infrastructure. KSQL and Kafka Connect example. and Helm/Tiller 2.8.2+

Cloud

Cloud Kafka Healthcare Software Engineer

Metadata Management And Integration At LinkedIn With DataHub

Data Engineering Podcast

AUGUST 24, 2020

When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help.

Metadata

Metadata Management Kafka Data Engineering

Scaling Kafka Brokers in Cloudera Data Hub

Cloudera

OCTOBER 4, 2022

This blog post will provide guidance to administrators currently using or interested in using Kafka nodes to maintain cluster changes as they scale up or down to balance performance and cloud costs in production deployments. Kafka brokers contained within host groups enable the administrators to more easily add and remove nodes.

Kafka

Kafka Data Cloud Big Data

Data Engineering Weekly #119

Data Engineering Weekly

FEBRUARY 19, 2023

Sign up free to test out the tool today. The blog discusses fairness in ML and demonstrates why high accuracy doesn’t mean the algorithm is fair. Get The Guide Foodpanda: Menu Ranking Foodpanda, in a similar line of application as DoorDash, talks about optimizing menu ranking by applying A/B testing.

Data Engineering

Data Engineering Data Engineer Engineering Google Cloud

Building Real Time Applications On Streaming Data With Eventador

Data Engineering Podcast

APRIL 19, 2020

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. Closing Announcements Thank you for listening!

Building

Building PostgreSQL MongoDB SQL

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it.

Scala

Scala MySQL Kafka Hadoop

New Snowflake Features Released in May–July 2023

Snowflake

AUGUST 16, 2023

Read our Summit recap blog for highlights across industries or watch Summit sessions now on-demand. Developers can now start building and testing Snowflake Native Apps in their accounts in AWS. The new Kafka connector, built with Snowpipe Streaming , now supports schema detection and evolution. If you missed out, not to worry!

Scala

Scala Transportation Kafka Data Lake

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

DoorDash Engineering

OCTOBER 17, 2023

SRM represents one of the most egregious data quality issues in A/B tests because it fundamentally compromises the basic assumption of random assignment. Statistical approaches for identifying imbalance The most common approach for identifying SRM is to use a chi-square test that can quickly detect when something is wrong.

Education

Education Kafka Algorithm Data Warehouse

Implementing and Using UDFs in Cloudera SQL Stream Builder

Cloudera

FEBRUARY 22, 2023

As apart of Cloudera Streaming Analytics it enables users to easily write, run, and manage real-time SQL queries on streams with a smooth user experience, while it attempts to expose the full power of Apache Flink. The post Implementing and Using UDFs in Cloudera SQL Stream Builder appeared first on Cloudera Blog.

SQL

SQL Raw Data Programming Language Kafka

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

Data Hub – has expanded to support all stages of the data lifecycle: Collect – Flow Management (Apache NiFi), Streams Management (Apache Kafka) and Streaming Analytics (Apache Flink). Enrich – Data Engineering (Apache Spark and Apache Hive). Predict – Data Engineering (Apache Spark).

Cloud

Cloud Data Warehouse AWS Machine Learning

Data Engineering Weekly #123

Data Engineering Weekly

MARCH 19, 2023

link] Uber: Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi Uber writes a comprehensive guide on running incremental ETL using Apache Hudi. The blog discusses implementing Type-2 SCD modeling and strategies to generate surrogate keys and bridge tables to handle many-to-many relationships.

Data Engineering

Data Engineering Data Engineer Engineering Media

20 Latest AWS Glue Interview Questions and Answers for 2023

ProjectPro

JANUARY 24, 2023

If you are preparing for your ETL developer or data engineer interview , you must possess a solid fundamental knowledge of AWS Glue, as you’re likely to get asked questions that test your ability to handle complex big data ETL tasks. The Schema Registry supports Java client apps and the Apache Avro and JSON Schema data formats.

AWS

AWS Data Lake ETL Tools Scala

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

One of the use cases from the product page that stood out to me in particular was the effort to mirror multiple Kafka clusters in one Brooklin cluster! Apache Pegasus might be the alternative you are looking for, if not now, then in your next project. Druid 24.0.0 – Apache Druid has made the leap from 0.23.0 to 24.0.0.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

One of the use cases from the product page that stood out to me in particular was the effort to mirror multiple Kafka clusters in one Brooklin cluster! Apache Pegasus might be the alternative you are looking for, if not now, then in your next project. Druid 24.0.0 – Apache Druid has made the leap from 0.23.0 to 24.0.0.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #109

Data Engineering Weekly

NOVEMBER 27, 2022

Sign up free to test out the tool today. I have a long list of thoughts on this conversation, which might need a blog post on its own. Save Your Seat Myntra: Quicksilver - Near Real Time Platform at Myntra Myntra writes about its near-real-time streaming platform built on top of Kafka, Flink & Spark.

Data Engineering

Data Engineering Data Engineer Engineering SQL

Building Real-time Machine Learning Foundations at Lyft

Lyft Engineering

JUNE 28, 2023

In this blog post, we will discuss what we built in support of that goal and some of the lessons we learned along the way. That Python object is portable across all environments that we support—local test environment, notebook, staging, and production. register_feature(feature_definition).add_sink(feature_sink)

Machine Learning

Machine Learning Building Metadata Kafka

How to configure clients to connect to Apache Kafka Clusters securely – Part 2: LDAP

Cloudera

DECEMBER 10, 2020

In the previous post, we talked about Kerberos authentication and explained how to configure a Kafka client to authenticate using Kerberos credentials. In this post we will look into how to configure a Kafka client to authenticate using LDAP, instead of Kerberos. We use the Kafka-console-consumer for all the examples below.

Kafka

Kafka Certification Management Accessible

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

MARCH 9, 2021

In this blog, we’ll highlight the key CDP aspects that provide data governance and lineage and show how they can be extended to incorporate metadata for non-CDP systems from across the enterprise. Apache Atlas as a fundamental part of SDX. The example 1_typedef-server.json describes the server typedef used in this blog. .

Data Governance

Data Governance Government Metadata Datasets

Migrate to CDP Private Cloud Base – A Step by Step Guide

Cloudera

SEPTEMBER 30, 2021

Our recent blog discussed the four paths to get from legacy platforms to CDP Private Cloud Base. In this blog and accompanying video, we will deep dive into the mechanics of running an in-place upgrade from CDH5 or CDH6 to CDP Private Cloud Base. Exporting Sentry policies ready for Apache Ranger. Replication Manager checks.

Cloud

Cloud PostgreSQL Metadata MySQL

Data Architect: Role Description, Skills, Certifications and When to Hire

AltexSoft

FEBRUARY 11, 2023

It serves as a foundation for the entire data management strategy and consists of multiple components including data pipelines; , on-premises and cloud storage facilities – data lakes , data warehouses , data hubs ;, data streaming and Big Data analytics solutions ( Hadoop , Spark , Kafka , etc.);

Data Architect

Data Architect Certification Generalist Big Data

Fast Analytics On Semi-Structured And Structured Data In The Cloud

Data Engineering Podcast

OCTOBER 7, 2019

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode.

Structured Data

Structured Data Cloud SQL Programming Language

Deployment of Exabyte-Backed Big Data Components

LinkedIn Engineering

DECEMBER 19, 2023

Co-authors: Arjun Mohnot , Jenchang Ho , Anthony Quigley , Xing Lin , Anil Alluri , Michael Kuchenbecker LinkedIn operates one of the world’s largest Apache Hadoop big data clusters. This process continuously sends metadata information to Kafka , including health reports and version data, among other details.

Big Data

Big Data Hadoop Metadata Data

Software Developer Salary in Singapore [2024 Market Overview]

Knowledge Hut

DECEMBER 27, 2023

They whisk their magic by testing, writing codes, helping build new software, and managing a team of coders. Many software developers ensure to debug the code while writing to ensure its efficiency in the testing stage. They run and test any programs before they are made public. What Does Software Developer Do?

Medical

Medical Programming Language Amazon Web Services Entertainment

How AI may impact software architecture by Andrew Carr

Scott Logic

JUNE 6, 2023

Predictions around this are very hard to make, especially taking into account how fast this field is changing, so it will be interesting to revisit this blog in a couple of years to see how things are. In the rest of this blog post, I wish to consider the viability of this and its potential impact on how software architecture is designed.

Architecture

Architecture Coding Designing Systems

RocksDB Is Eating the Database World

Rockset

JANUARY 23, 2020

Going into the details of LSM trees, and RocksDB’s implementation of the same, is out of the scope of this blog, but suffice it to say that it’s an indexing structure optimized to handle high-volume—sequential or random—write workloads. Apache Cassandra is one of the most popular NoSQL databases. Who Uses RocksDB? trillion euros.

Database

Database MySQL Kafka NoSQL

Log Reduction Techniques with CFM

Cloudera

OCTOBER 28, 2020

Cloudera services logs offer a breadth of information to assist in cluster maintenance; from assisting in security checks, auditing tasks, and validation for performance tuning and testing tasks – to name a few. . Any records from Kafka, only IF the record represents users publishing to certain Kafka topics. Assumptions.

Kafka

Kafka SQL Professional Services Consulting

Running Unified PubSub Client in Production at Pinterest

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

Webinars

Trending Sources

Druid Deprecation and ClickHouse Adoption at Lyft

Webinars

Data Engineering Weekly #168

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Fraud Detection with Cloudera Stream Processing Part 1

Advanced Testing Techniques for Spring Kafka

Streams Replication Manager Prefixless Replication

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Upgrade Journey: The Path from CDH to CDP Private Cloud

Generating and Viewing Lineage through Apache Ozone

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Top 20+ Big Data Certifications and Courses in 2023

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

How to Become Databricks Certified Apache Spark Developer?

Building A Real Time Event Data Warehouse For Sentry

Monitoring Data Replication in Multi-Datacenter Apache Kafka Deployments

How to Connect KSQL to Confluent Cloud using Kubernetes with Helm

Metadata Management And Integration At LinkedIn With DataHub

Scaling Kafka Brokers in Cloudera Data Hub

Data Engineering Weekly #119

Building Real Time Applications On Streaming Data With Eventador

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

New Snowflake Features Released in May–July 2023

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

Implementing and Using UDFs in Cloudera SQL Stream Builder

Happy Birthday, CDP Public Cloud

Data Engineering Weekly #123

20 Latest AWS Glue Interview Questions and Answers for 2023

Data Engineering Annotated Monthly – September 2022

Data Engineering Annotated Monthly – September 2022

Data Engineering Weekly #109

Building Real-time Machine Learning Foundations at Lyft

How to configure clients to connect to Apache Kafka Clusters securely – Part 2: LDAP

Data governance beyond SDX: Adding third party assets to Apache Atlas

Migrate to CDP Private Cloud Base – A Step by Step Guide

Data Architect: Role Description, Skills, Certifications and When to Hire

Fast Analytics On Semi-Structured And Structured Data In The Cloud

Deployment of Exabyte-Backed Big Data Components

Software Developer Salary in Singapore [2024 Market Overview]

How AI may impact software architecture by Andrew Carr

RocksDB Is Eating the Database World

Log Reduction Techniques with CFM

Stay Connected