Blog - Data Engineering Digest

SQL Streambuilder Data Transformations

Cloudera

FEBRUARY 21, 2023

SQL Stream Builder (SSB) is a versatile platform for data analytics using SQL as a part of Cloudera Streaming Analytics, built on top of Apache Flink. It enables users to easily write, run, and manage real-time continuous SQL queries on stream data and a smooth user experience.

SQL

SQL Kafka Raw Data Data

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Snowflake

MARCH 2, 2023

That proves to be a difficult task for data engineering teams that have to manage separate infrastructure for batch data and streaming data. To address this challenge, we are happy to announce the public preview of Snowpipe Streaming as the latest addition to our Snowflake ingestion offerings. How does Snowpipe Streaming work?

Kafka

Kafka Data Ingestion Data Pipeline Cloud Storage

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Cloudera

JULY 18, 2022

In part 1 of this blog we discussed how Cloudera DataFlow for the Public Cloud (CDF-PC), the universal data distribution service powered by Apache NiFi, can make it easy to acquire data from wherever it originates and move it efficiently to make it available to other applications in a streaming fashion. Data decays!

Process

Process Kafka Scala SQL

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the first part of this series, we talked about design patterns for data creation and the pros & cons of each system from the data contract perspective. In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective. Theories of the Data Pipeline 1.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Best Practices for Data Ingestion with Snowflake: Part 3

Snowflake

APRIL 19, 2023

Welcome to the third blog post in our series highlighting Snowflake’s data ingestion capabilities, covering the latest on Snowpipe Streaming (currently in public preview) and how streaming ingestion can accelerate data engineering on Snowflake. What is Snowpipe Streaming?

Data Ingestion

Data Ingestion Kafka Java Data Pipeline

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

In this three-part blog post series, we introduce you to Psyberg , our incremental data processing framework designed to tackle such challenges! Some techniques we used were: 1. Using fixed lookback windows to always reprocess data, assuming that most late-arriving events will occur within that window.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Using Change Data Capture for Warehouse Analytics

Picnic Engineering

MARCH 28, 2023

Pipeline overview Since the TS is an externally developed hardware and software system, Picnic cannot change it to expose analytics events directly. So how can we get our hands on events for analytical purposes if we cannot change the TS to expose them? It detects any changes made to a database and exposes them as events.

Kafka

Kafka Transportation Data Warehouse Database

Real-Time CDC With Rockset And Confluent Cloud

Rockset

MARCH 26, 2023

Folks have definitely tried, and while Apache Kafka® has become the standard for event-driven architectures, it still struggles to replace your everyday PostgreSQL database instance in the modern application stack. Since then, databases have been a major part of my professional life and modern, everyday life for most folks.

Cloud

Cloud PostgreSQL Kafka Database

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn Engineering

JANUARY 19, 2024

As of November 2023, roughly 150K+ recruiters switched jobs in the previous 12 months as shown in Figure 1. Figure 1: Talent pool report for recruiters - LinkedIn Talent Insights During mergers and acquisitions, the source company’s user licenses and data are transferred to the acquiring company.

Recruitment

Recruitment Data Process Process Kafka

Stream Processing vs. Real-Time Analytics Databases

Rockset

MARCH 27, 2023

This is part two in Rockset’s Making Sense of Real-Time Analytics on Streaming Data series. In part 1 , we covered the technology landscape for real-time analytics on streaming data. In this post, we’ll explore the differences between real-time analytics databases and stream processing frameworks.

Database

Database Process Scala SQL

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. Building real-time streaming analytics data pipelines requires the ability to process data in the stream.

Process

Process Kafka SQL Machine Learning

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

DoorDash Engineering

OCTOBER 17, 2023

Experimentation isn’t just a cornerstone for innovation and sound decision-making; it’s often referred to as the gold standard for problem-solving, thanks in part to its roots in the scientific method. has more Android users than other parts of the world, the country attribute will also be flagged as an imbalance.

Education

Education Kafka Algorithm Data Warehouse

Demystifying event streams: Transforming events into tables with dbt

dbt Developer Hub

NOVEMBER 3, 2022

Let’s discuss how to convert events from an event-driven microservice architecture into relational tables in a warehouse like Snowflake. In this blog post we’ll dive into how we tackled one source of quality issues: directly relying on upstream database schemas. What are Events?

Kafka

Kafka ETL Tools BI Database

Async APIs - don't confuse your events, commands and state by David Hope

Scott Logic

APRIL 22, 2024

In my previous blog post I looked at various technologies for sending data asynchronously between services including RabbitMQ, Kafka, AWS EventBridge. I’ve coloured the data entities according to their types and we see there’s a few different patterns like events and state which we’ll discuss in a moment.

AWS

AWS Metadata Systems Kafka

2022 Summer Intern Projects Article #3

DoorDash Engineering

APRIL 4, 2023

This is the third blog post in a series of articles showcasing our 2022 summer intern projects. The primary purpose of Archiver is to pick up events containing financial transaction data sent by different upstream teams and store this data in a data lake. The event data is stored in AWS S3 in an encoded format in JSON files.

Project

Project Banking Kafka Database

Digital Transformation is a Data Journey From Edge to Insight

Cloudera

JANUARY 20, 2021

This is the first in a six-part blog series that outlines the data journey from edge to AI and the business value data produces along the journey. Fig 1: The Enterprise Data Lifecycle. Security & Governance – an integrated set of security, management and governance technologies across the entire data lifecycle.

Manufacturing

Manufacturing Data Warehouse Kafka Retail

Leveraging CockroachDB’s Change Feed for Real-Time Inventory Data Processing

DoorDash Engineering

NOVEMBER 21, 2022

While building out DashMart’s internal inventory management system to help DashMart associates manage inventory, the DashMart engineering team came to realize that since the inventory tables were so core and foundational to different operational use cases in a DashMart, some actions or code must be triggered every time the inventory level changes.

Data Process

Data Process Process Kafka Database

Delta: A Data Synchronization and Enrichment Platform

Netflix Tech

OCTOBER 15, 2019

Part I: Overview Andreas Andreakis , Falguni Jhaveri , Ioannis Papapanagiotou , Mark Cho , Poorna Reddy , Tongliang Liu Overview It is a commonly observed pattern for applications to utilize multiple datastores where each is used to serve a specific need such as storing the canonical form of data (MySQL etc.), caching (Memcached etc.),

Transportation

Transportation MySQL Kafka Data

How Rockset Enables SQL-Based Rollups for Streaming Data

Rockset

AUGUST 30, 2021

Apache Kafka has made acquiring real-time data more mainstream, but only a small sliver are turning batch analytics, run nightly, into real-time analytical dashboards with alerts and automatic anomaly detection. The majority are still draining streaming data into a data lake or a warehouse and are doing batch analytics.

SQL

SQL Kafka MongoDB MySQL

The Evolution of Enforcing our Professional Community Policies at Scale

LinkedIn Engineering

JANUARY 16, 2024

In a previous blog post, we talked about how we built our anti-abuse platform using CASAL. In this blog post, we'll go deeper into how we manage account restrictions. Figure 2: Relational database schema We adopted a pragmatic and scalable approach by distributing member restrictions across different Oracle tables.

Kafka

Kafka Relational Database Java Architecture

A Beginner’s Guide to Learning PySpark for Big Data Processing

ProjectPro

JANUARY 25, 2022

Table of Contents Here’s What You Need to Know About PySpark What is PySpark? Here’s What You Need to Know About PySpark This blog will take you through the basics of PySpark, the PySpark architecture, and a few popular PySpark libraries , among other things. What if you could use both these technologies together?

Big Data

Big Data Data Process Process Kafka

Updates, Inserts, Deletes: Comparing Elasticsearch and Rockset for Real-Time Data Ingest

Rockset

OCTOBER 11, 2022

Introduction Managing streaming data from a source system, like PostgreSQL, MongoDB or DynamoDB, into a downstream system for real-time analytics is a challenge for many teams. For a system like Elasticsearch , engineers need to have in-depth knowledge of the underlying architecture in order to efficiently ingest streaming data.

Data Ingestion

Data Ingestion Kafka Relational Database PostgreSQL

Data Pipeline- Definition, Architecture, Examples, and Use Cases

ProjectPro

DECEMBER 7, 2021

Data pipelines are a significant part of the big data domain, and every professional working or willing to work in this field must have extensive knowledge of them. Table of Contents What is a Data Pipeline? Building streaming data pipelines for large data is enticing due to the velocity of the data. What is a Big Data Pipeline?

Data Pipeline

Data Pipeline Architecture Kafka AWS

Data Engineering Weekly #110

Data Engineering Weekly

DECEMBER 4, 2022

My take on this The prediction is spot on with the cost optimization, but #1 (cost optimization) & #2 (specialization) conflict. Streaming plus batch unified in a single platform. I recently shared the thought and am excited to see Meta’s blog on static analysis of SQL queries. The author gives seven predictions.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Using Graph Processing for Kafka Stream Visualizations

Confluent

AUGUST 29, 2019

We know that Apache Kafka ® is great when you’re dealing with streams, allowing you to conveniently look at streams as tables. Stream processing engines like KSQL furthermore give you the ability to manipulate all of this fluently. The approach we’ll use works with any Kafka run though. 8, and so on.

Kafka

Kafka Process Algorithm Cloud

Online Data Migration from HBase to TiDB with Zero Downtime

Pinterest Engineering

AUGUST 18, 2022

In this blog post, we will first learn the various approaches considered for data migration with their trade offs. We will then do a deep dive on how the data migration was conducted from HBase to TiDB for one of the first use cases having 4 TB table size serving 14k read qps and 400 write qps with zero downtime.

Data Ingestion

Data Ingestion Hadoop Database Kafka

100+ Kafka Interview Questions and Answers for 2023

ProjectPro

JUNE 29, 2021

Your search for Apache Kafka interview questions ends right here! Let us now dive directly into the Apache Kafka interview questions and answers and help you get started with your Big Data interview preparation! How to study for Kafka interview? What is Kafka used for? What are main APIs of Kafka?

Kafka

Kafka Bytes Big Data Java

5 Steps for Migrating from Elasticsearch to Rockset for Real-Time Analytics

Rockset

NOVEMBER 1, 2022

This blog outlines best practices from customers I have helped migrate from Elasticsearch to Rockset , reducing risk and avoiding common pitfalls. In this blog, we distilled their migration journeys into 5 steps. Time Series You will often have events or records with a timestamp and want to search based on a range of time.

Database-centric

Database-centric Pipeline-centric SQL Aggregated Data

50 PySpark Interview Questions and Answers For 2023

ProjectPro

NOVEMBER 22, 2021

You can think of it as a database table. Example showing the use of StructType and StructField classes in PySpark- import pyspark from pyspark.sql import SparkSession from pyspark.sql.types import StructType,StructField, StringType, IntegerType spark = SparkSession.builder.master("local[1]").appName('ProjectPro').getOrCreate()

Hadoop

Hadoop Python Datasets Metadata

Analytics on DynamoDB: Comparing Elasticsearch, Athena and Spark

Rockset

APRIL 29, 2019

In this blog post I compare options for real-time analytics on DynamoDB - Elasticsearch , Athena, and Spark - in terms of ease of setup, maintenance, query capability, latency. We can use AWS Glue to perform the ETL process and create a complete copy of the DynamoDB table in S3.

NoSQL

NoSQL PostgreSQL AWS SQL

20 Solved End-to-End Big Data Projects with Source Code

ProjectPro

MAY 31, 2021

This blog lists over 20 big data projects you can work on to showcase your big data skills and gain hands-on experience in big data tools and technologies. Table of Contents What is a Big Data Project? Kicking off a big data analytics project is always the most challenging part. How do you Create a Good Big Data Project?

Big Data

Big Data Coding Project Hadoop

Elasticsearch or Rockset for Real-Time Analytics: How Much Query Flexibility Do You Have?

Rockset

FEBRUARY 25, 2021

Rockset, on the other hand, provides full-featured SQL and an API endpoint interface that allows developers to quickly join across data sources like DynamoDB and Kafka. Analyze Semi-Structured Data As Is The data feeding modern applications is rarely in neat little tables. Furthermore, storing this data in tables is rarely sufficient.

SQL

SQL Data Pipeline Kafka Database

You Can’t Out-Architect Bad Data?

Monte Carlo

SEPTEMBER 6, 2022

We had the opportunity to interview him for our blog a few weeks ago on these and other tips for building out the first data team at your startup. But the best data engineers and leaders understand two things: 1) they can’t anticipate all the ways data incidents can arise and 2) the consequences for these incidents are becoming more severe.

Software Engineer

Software Engineer Software Engineering Retail Data

DataOps: What Is It, Core Principles, and Tools For Implementation

phData: Data Engineering

JANUARY 3, 2022

Table of Contents How Impactful is Your Data? If you’ve ever done this, you’re likely familiar with vlookups, pivot tables, and trying to keep multiple tabs’ worth of data straight. Now part of the Apache Foundation, it originally was developed by CollabNet, Inc. Want to Save This eBook for Later? No problem!

IT

IT AWS Software Engineer Software Engineering

100+ Data Engineer Interview Questions and Answers for 2023

ProjectPro

JULY 27, 2021

This blog is your one-stop solution for the top 100+ Data Engineer Interview Questions and Answers. In this blog, we have collated the frequently asked data engineer interview questions based on tools and technologies that are highly useful for a data engineer in the Big Data industry. that leverage big data analytics and tools.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

70+ Azure Interview Questions and Answers to Prepare in 2023

ProjectPro

DECEMBER 10, 2021

This blog covers the top 50 most frequently asked Azure interview questions and answers. Table of Contents Why Must You Prepare For Azure Interview Questions? Well, this Azure interview questions and answers blog will help you land your dream cloud computing job role! So, let's dive right into it!

BI

BI Cloud Computing SQL Database

5 Key Takeaways from #Current2023

Cloudera

OCTOBER 17, 2023

Recently, Confluent hosted Current 2023 (formerly Kafka summit) in San Jose on Sept 26th and 27th. With few conferences curating content specific to streaming developers, Current has historically been an important event for anyone trying to keep a pulse on what’s happening in the streaming space. Flink is here to stay.

Database-centric

Database-centric Kafka Pipeline-centric Database

Deploying Kafka Streams and KSQL with Gradle – Part 3: KSQL User-Defined Functions and Kafka Streams

Confluent

JULY 10, 2019

As discussed in part 2, I created a GitHub repository with Docker Compose functionality for starting a Kafka and Confluent Platform environment, as well as the code samples mentioned below. gradlew composeUp. KSQL user-defined functions. KSQL user-defined functions.

Kafka

Kafka Java Bytes SQL

Reliable Data Exchange with the Outbox Pattern and Cloudera DiM

Cloudera

MARCH 15, 2023

In this post, I will demonstrate how to use the Cloudera Data Platform (CDP) and its streaming solutions to set up reliable data exchange in modern applications between high-scale microservices, and ensure that the internal state will stay consistent even under the highest load. Introduction Many modern application designs are event-driven.

PostgreSQL

PostgreSQL Kafka Database Data

Journey to Event Driven – Part 3: The Affinity Between Events, Streams and Serverless

Confluent

FEBRUARY 27, 2019

FaaS functions only solve the compute part, but where is data stored and managed, and how is it accessed? What is more, as the world adopts the event-driven streaming architecture, how does it fit with serverless? The key to event-first systems design is understanding that a series of events captures behavior.

Kafka

Kafka AWS Architecture Cloud

Sqoop Interview Questions and Answers for 2023

ProjectPro

JUNE 23, 2016

One, often over-looked part of Hadoop job interview is - thorough preparation. So, here’s how ProjectPro helps you get ready for your interview for a Hadoop developer job role.This blog contains commonly asked hadoop mapreduce interview questions and answers that will help you ace your next hadoop job interview.

Hadoop

Hadoop MySQL Relational Database Java

Streaming Market Data with Flink SQL Part I: Streaming VWAP

Cloudera

MAY 4, 2021

Event-driven and streaming architectures enable complex processing on market events as they happen, making them a natural fit for financial market applications. Flink SQL is a data processing language that enables rapid prototyping and development of event-driven and streaming applications. Streaming VWAP.

SQL

SQL Business Analyst Data Java

Putting Events in Their Place with Dynamic Routing

Confluent

APRIL 4, 2019

Event-driven architecture means just that: It’s all about the events. In a microservices architecture, events drive microservice actions. No event, no shoes, no service. In the most basic scenario, microservices that need to take action on a common stream of events all listen to that stream.

Kafka

Kafka Data Cleanse Retail Finance

SQL Streambuilder Data Transformations

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Webinars

Trending Sources

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

Webinars

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Best Practices for Data Ingestion with Snowflake: Part 3

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Using Change Data Capture for Warehouse Analytics

Real-Time CDC With Rockset And Confluent Cloud

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

Stream Processing vs. Real-Time Analytics Databases

Fraud Detection with Cloudera Stream Processing Part 1

Addressing the Challenges of Sample Ratio Mismatch in A/B Testing

Demystifying event streams: Transforming events into tables with dbt

Async APIs - don't confuse your events, commands and state by David Hope

2022 Summer Intern Projects Article #3

Digital Transformation is a Data Journey From Edge to Insight

Leveraging CockroachDB’s Change Feed for Real-Time Inventory Data Processing

Delta: A Data Synchronization and Enrichment Platform

How Rockset Enables SQL-Based Rollups for Streaming Data

The Evolution of Enforcing our Professional Community Policies at Scale

A Beginner’s Guide to Learning PySpark for Big Data Processing

Updates, Inserts, Deletes: Comparing Elasticsearch and Rockset for Real-Time Data Ingest

Data Pipeline- Definition, Architecture, Examples, and Use Cases

Data Engineering Weekly #110

Using Graph Processing for Kafka Stream Visualizations

Online Data Migration from HBase to TiDB with Zero Downtime

100+ Kafka Interview Questions and Answers for 2023

5 Steps for Migrating from Elasticsearch to Rockset for Real-Time Analytics

50 PySpark Interview Questions and Answers For 2023

Analytics on DynamoDB: Comparing Elasticsearch, Athena and Spark

20 Solved End-to-End Big Data Projects with Source Code

Elasticsearch or Rockset for Real-Time Analytics: How Much Query Flexibility Do You Have?

You Can’t Out-Architect Bad Data?

DataOps: What Is It, Core Principles, and Tools For Implementation

100+ Data Engineer Interview Questions and Answers for 2023

70+ Azure Interview Questions and Answers to Prepare in 2023

5 Key Takeaways from #Current2023

Deploying Kafka Streams and KSQL with Gradle – Part 3: KSQL User-Defined Functions and Kafka Streams

Reliable Data Exchange with the Outbox Pattern and Cloudera DiM

Journey to Event Driven – Part 3: The Affinity Between Events, Streams and Serverless

Sqoop Interview Questions and Answers for 2023

Streaming Market Data with Flink SQL Part I: Streaming VWAP

Putting Events in Their Place with Dynamic Routing

Stay Connected