Blog - Data Engineering Digest

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Sub-second query systems allow for near real-time data explorations and low latency, high throughput queries, which are particularly well-suited for handling time-series data. In this particular blog post, we explain how Druid has been used at Lyft and what led us to adopt ClickHouse for our sub-second analytic system.

Kafka

Kafka Data Ingestion Datasets Architecture

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

DoorDash Engineering

OCTOBER 17, 2023

Experimentation isn’t just a cornerstone for innovation and sound decision-making; it’s often referred to as the gold standard for problem-solving, thanks in part to its roots in the scientific method. The term itself conjures a sense of rigor, validity, and trust. At DoorDash, we constantly innovate and experiment.To

Education

Education Kafka Algorithm Data Warehouse

How Tenable Executes DataOps with Monte Carlo and Snowflake

Monte Carlo

SEPTEMBER 8, 2023

It is driven by a data platform that uses data from all of Tenable’s vulnerability management, cloud security, identity exposure, web app scanning, and external attack surface management point products to provide cybersecurity leaders a comprehensive and contextual view of their attack surface. Creating a SQL custom monitor in the Monte Carlo UI.

Kafka

Kafka SQL Data Pipeline Database

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Towards a Reliable Device Management Platform

Netflix Tech

AUGUST 30, 2021

In this blog post, we will focus on the latter feature set. The challenge, then, is to be able to ingest and process these events in a scalable manner, i.e., scaling with the number of devices, which will be the focus of this blog post. Users then effectively run tests by connecting their devices to the RAE in a plug-and-play fashion.

Management

Management Kafka Transportation Cloud

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. This blog will be published in two parts. This is what we call the first-mile problem. The use case.

Process

Process Kafka SQL Machine Learning

Striim Cloud on AWS: Unify your data with a fully managed change data capture and data streaming service

Striim

NOVEMBER 30, 2022

With Striim, all your team needs to do is to hit a few clicks for configuration, and an automated pipeline will be created between your source and AWS targets. Businesses of all scales and industries have access to increasingly large amounts of data, which need to be harnessed effectively.

AWS

AWS Cloud Management MySQL

Deployment of Exabyte-Backed Big Data Components

LinkedIn Engineering

DECEMBER 19, 2023

As workloads and clusters grow, operational overhead becomes even more challenging, including rack maintenance, hardware failures, OS upgrades, and configuration convergence that often arise in large-scale infrastructure. Historically, deploying code changes to Hadoop big data clusters has been complex.

Big Data

Big Data Hadoop Metadata Data

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

Cloudera, an innovator in providing data products to address these types of challenges, has introduced some new products, such as Kudu low-latency storage and Druid high-performance real-time analytics database that have been successfully implemented in the customer on-premises data centres for these types of real-time data warehousing use cases.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

Designing a Real-Time ETA Prediction System Using Kafka, DynamoDB and Rockset

Rockset

JULY 8, 2020

For this example, we will use Kafka. The service then pushes the geohash along with the coordinates to a Kafka topic. Rockset ingests data from this Kafka topic and updates it into a collection called locations. Message queue - Used for transferring data between various components. For this example, we will use DynamoDB.

Kafka

Kafka Designing Systems Food

Change Data Capture: What It Is and How to Use It

Rockset

JUNE 7, 2021

This approach often leads to lower latency between the source and target because as soon as the change is made the target is notified and can action it immediately, instead of polling for changes. The downside of the pull approach is that it often increases latency.

IT

IT Kafka Database MongoDB

The Evolution of Enforcing our Professional Community Policies at Scale

LinkedIn Engineering

JANUARY 16, 2024

In a previous blog post, we talked about how we built our anti-abuse platform using CASAL. In a previous blog post, we talked about how we built our anti-abuse platform using CASAL. In this blog post, we'll go deeper into how we manage account restrictions. We recognize that not all restrictions are created equal.

Kafka

Kafka Relational Database Java Architecture

Cache warming: Agility for a stateful service

Netflix Tech

DECEMBER 4, 2018

Cache Warmer The design goals for the cache warming system were to: Limit network impact on current EVCache clients Minimize the memory and disk usage on EVCache nodes Shorten the warm-up time Have no restrictions on when (Peak/Non-peak periods) to warm up Prior approaches We experimented with several design approaches to address these requirements.

AWS

AWS Metadata Architecture Kafka

When Real-Time Matters: Rockset Delivers 70ms Data Latency at 20MB/s Streaming Ingest

Rockset

JUNE 8, 2023

Streaming data adoption continues to accelerate - over 80% of Fortune 100 companies already use Apache Kafka - driven by organizations creating value by putting data to use in real time. This is a 98% reduction in latency since the last publication of ingest performance benchmarks. Therefore, 8,000 writes is equivalent to 10 MB/s.

Kafka

Kafka Algorithm Database Architecture

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

DoorDash Engineering

JANUARY 16, 2024

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality. This practice minimizes disruptions caused by an outage of one or more AZs. All microservice pods are deployed in multiple isolated cells.

Bytes

Bytes Cloud Management PostgreSQL

Evolution of Streaming Pipelines in Lyft’s Marketplace

Lyft Engineering

SEPTEMBER 27, 2022

This paradigm had several minutes of inherent latency. The team needed better infrastructure to make the dynamic pricing system more reactive for the following reasons: Decrease end-to-end latency that would make the system more reactive to marketplace imbalances. Decrease development time and increase product iteration speed.

Kafka

Kafka Aggregated Data Machine Learning Architecture

Elasticsearch or Rockset for Real-Time Analytics: Real-Time Ingestion and Indexing

Rockset

MARCH 15, 2021

This includes making the data available for query as soon as it is ingested, creating proper indexes on the data so that the query latency is very low, and much more. While there’s typically some amount of data engineering required here, there are ways to minimize it. latency, indexing, etc.), latency, indexing, etc.),

MongoDB

MongoDB Data Ingestion Analytics Application Kafka

A Beginner’s Guide to Learning PySpark for Big Data Processing

ProjectPro

JANUARY 25, 2022

Here’s What You Need to Know About PySpark This blog will take you through the basics of PySpark, the PySpark architecture, and a few popular PySpark libraries , among other things. Did you know that, according to Linkedin, over 24,000 Big Data jobs in the US list Apache Spark as a required skill? Why use PySpark? What is PySpark?

Big Data

Big Data Data Process Process Kafka

Journey to Event Driven – Part 4: Four Pillars of Event Streaming Microservices

Confluent

MAY 9, 2019

Yes, we can transact across partitions in Apache Kafka ® , and with traditional database transactions you get many “trust” semantics for free; however, they are pushed down into the database runtime. So far in this series, we have recognized that by going back to first principles, we have a new foundation to work with. Deployment model.

Kafka

Kafka Pipeline-centric Architecture Database-centric

Analytics on Kafka Event Streams Using Druid, Elasticsearch and Rockset

Rockset

NOVEMBER 6, 2019

With event-driven architectures powered by systems like Apache Kafka becoming more prominent, there are now many applications in the modern software stack that make use of events and messages to operate effectively. Events are messages that are sent by a system to notify operators or other systems about a change in its domain.

Kafka

Kafka Data Lake SQL Hadoop

Data Product Strategies: How Cloudera Helps Realize and Accelerate Successful Data Product Strategies

Cloudera

AUGUST 20, 2021

For example, the Cloudera Data Flow experience offers an integrated event processing capability to deliver low-latency analytics by combining Flow Management (using Apache NiFi), Streams Messaging (using Apache Kafka) and Stream Processing / Analytics (using Apache Flink / SQL Stream Builder).

Data Warehouse

Data Warehouse Data Architecture Cloud

100+ Kafka Interview Questions and Answers for 2023

ProjectPro

JUNE 29, 2021

Your search for Apache Kafka interview questions ends right here! Let us now dive directly into the Apache Kafka interview questions and answers and help you get started with your Big Data interview preparation! How to study for Kafka interview? What is Kafka used for? What are main APIs of Kafka?

Kafka

Kafka Bytes Big Data Java

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

In this blog post we will use what we have learned in this Data Vault blog series to support the data preparation requirements for ML on Snowflake, using Data Vault patterns for modeling and automation. “The features you use influence more than everything else the result. ML workflow, ubr.to/3EJHjvm

Engineering

Engineering Raw Data Data Science Scala

DBLog: A Generic Change-Data-Capture Framework

Netflix Tech

DECEMBER 17, 2019

Requirements In a previous blog post, we discussed Delta , a data enrichment and synchronization platform. In our Delta blog post , we also described use cases beyond data synchronization, such as event processing. In databases like MySQL and PostgreSQL, transaction logs are the source of CDC events.

MySQL

MySQL PostgreSQL Database Transportation

DBLog: A Generic Change-Data-Capture Framework

Netflix Tech

DECEMBER 17, 2019

Requirements In a previous blog post, we discussed Delta , a data enrichment and synchronization platform. In our Delta blog post , we also described use cases beyond data synchronization, such as event processing. In databases like MySQL and PostgreSQL, transaction logs are the source of CDC events.

MySQL

MySQL PostgreSQL Database Transportation

A Blueprint for a Real-World Recommendation System

Rockset

DECEMBER 19, 2023

This blog post distills his decade of experience into a comprehensive read, offering a detailed overview of the complexities and innovations at every stage of building a real-world recommender system. Stage 2: Ranking - We rank these candidates using some heuristic to pick the top 10 to 50 items.

Systems

Systems Machine Learning Deep Learning Media

70+ Azure Interview Questions and Answers to Prepare in 2023

ProjectPro

DECEMBER 10, 2021

This blog covers the top 50 most frequently asked Azure interview questions and answers. Well, this Azure interview questions and answers blog will help you land your dream cloud computing job role! It will provide you with a good sense of what areas you should focus on as you prepare for your next Azure interview.

BI

BI Cloud Computing SQL Database

50 PySpark Interview Questions and Answers For 2023

ProjectPro

NOVEMBER 22, 2021

PySpark has exploded in popularity in recent years, and many businesses are capitalizing on its advantages by producing plenty of employment opportunities for PySpark professionals. According to the Businesswire report , the worldwide big data as a service market is estimated to grow at a CAGR of 36.9% from 2019 to 2026, reaching $61.42

Hadoop

Hadoop Python Datasets Metadata

The Rise of Managed Services for Apache Kafka

Confluent

SEPTEMBER 20, 2019

As a distributed system for collecting, storing, and processing data at scale, Apache Kafka ® comes with its own deployment complexities. To simplify all of this, different providers have emerged to offer Apache Kafka as a managed service. Before Confluent Cloud was announced , a managed service for Apache Kafka did not exist.

Kafka

Kafka Management Cloud AWS

100+ Data Engineer Interview Questions and Answers for 2023

ProjectPro

JULY 27, 2021

This blog is your one-stop solution for the top 100+ Data Engineer Interview Questions and Answers. In this blog, we have collated the frequently asked data engineer interview questions based on tools and technologies that are highly useful for a data engineer in the Big Data industry. that leverage big data analytics and tools.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Data Engineering Digest

Druid Deprecation and ClickHouse Adoption at Lyft

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

Webinars

Trending Sources

How Tenable Executes DataOps with Monte Carlo and Snowflake

Webinars

Towards a Reliable Device Management Platform

Fraud Detection with Cloudera Stream Processing Part 1

Striim Cloud on AWS: Unify your data with a fully managed change data capture and data streaming service

Deployment of Exabyte-Backed Big Data Components

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Designing a Real-Time ETA Prediction System Using Kafka, DynamoDB and Rockset

Change Data Capture: What It Is and How to Use It

The Evolution of Enforcing our Professional Community Policies at Scale

Cache warming: Agility for a stateful service

When Real-Time Matters: Rockset Delivers 70ms Data Latency at 20MB/s Streaming Ingest

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

Evolution of Streaming Pipelines in Lyft’s Marketplace

Elasticsearch or Rockset for Real-Time Analytics: Real-Time Ingestion and Indexing

A Beginner’s Guide to Learning PySpark for Big Data Processing

Journey to Event Driven – Part 4: Four Pillars of Event Streaming Microservices

Analytics on Kafka Event Streams Using Druid, Elasticsearch and Rockset

Data Product Strategies: How Cloudera Helps Realize and Accelerate Successful Data Product Strategies

100+ Kafka Interview Questions and Answers for 2023

Data Vault on Snowflake: Feature Engineering and Business Vault

DBLog: A Generic Change-Data-Capture Framework

DBLog: A Generic Change-Data-Capture Framework

A Blueprint for a Real-World Recommendation System

70+ Azure Interview Questions and Answers to Prepare in 2023

50 PySpark Interview Questions and Answers For 2023

The Rise of Managed Services for Apache Kafka

100+ Data Engineer Interview Questions and Answers for 2023

Stay Connected