Data Engineering Digest

Data News — Week 24.08

Christophe Blefari

FEBRUARY 23, 2024

JVM vs. SQL data engineer — There's a big discussion in the community about what real data engineering is. Is it DataFrames or SQL? Still, I prefer SQL/Python data engineering, as you know me. I did not read the paper except the introduction and a the first schema, but it looks like awesome. PyIceberg 0.6.0:

Data Lake

Data Lake PostgreSQL MongoDB Scala

Data News — Week 23.16

Christophe Blefari

APRIL 21, 2023

As introduction Tristan gives the original vision of dbt that became mainstream, today. A lot of data teams embraced dbt, or at least the SQL with engineering practices to transform data in cloud data warehouses. This is a preambule to cross-project dependencies I guess. Building a ChatGPT Plugin for Medium.

Raw Data

Raw Data Data Datasets SQL

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Data Engineering Podcast

OCTOBER 15, 2023

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. With Materialize, you can!

Process

Process Building SQL BI

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Our First Netflix Data Engineering Summit

Netflix Tech

DECEMBER 14, 2023

Streaming SQL on Data Mesh using Apache Flink Mark Cho, Guil Pires and Sujay Jain, Engineers from the Netflix Data Platform talk about how a managed Streaming SQL using Apache Flink can help unlock new Stream Processing use cases at Netflix.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Your Guide to Flink SQL: An In-Depth Exploration

Confluent

SEPTEMBER 12, 2023

Get an in-depth introduction to Flink SQL. Learn how it relates to other APIs, its built-in functions and operations, which queries to try first, and see syntax examples.

SQL

SQL IT

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Introduction At Lyft, we have used systems like Apache ClickHouse and Apache Druid for near real-time and sub-second analytics. Real-time Ingestion Events from our real-time analytics pipeline were configured to be sent into our internal Flink application, streamed to Kafka, and written into Druid. This was our main form of ingestion.

Kafka

Kafka Data Ingestion Datasets Architecture

Enriching Streams with Hive tables via Flink SQL

Cloudera

NOVEMBER 18, 2022

Introduction. Flink SQL does this and directs the results of whatever functions you apply to the data into a sink. Therefore, there are two common use cases for Hive tables with Flink SQL: A lookup table for enriching the data stream. A sink for writing Flink results. Using Flink DDL with JDBC connector.

SQL

SQL Database Accessible Accessibility

Data News — Week 23.02

Christophe Blefari

JANUARY 14, 2023

In the end Python and SQL are still here for good. 🫠 If after this small introduction you want a deeper comparison of Polars you can check Modern Polars by Kevin Heavey or a 40 minutes YouTube video that explains Polars internals. With this release you can really mix Python and SQL code.

Python

Python Kafka Data Scala

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

This one is a gitbook with a lot of content but I recommend you specifically to read the introduction to data engineering. The main difference between both is the fact that your computation resides in your warehouse with SQL rather than outside with a programming language loading data in memory. How do computer works? This is not.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Building Real Time Applications On Streaming Data With Eventador

Data Engineering Podcast

APRIL 19, 2020

In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. How does it fit into an application architecture?

Building

Building PostgreSQL MongoDB SQL

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Data Engineering Podcast

NOVEMBER 18, 2018

Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. Can you start by describing what Flink is and how the project got started? What are some of the primary ways that Flink is used? What are some use cases that Flink is uniquely qualified to handle?

Process

Process Scala Google Cloud Kafka

Cloudera Streaming Analytics 1.4: the unification of SQL batch and streaming

Cloudera

JUNE 7, 2021

It was the first release to incorporate SQL Stream Builder (SSB) from the acquisition, and brought rich SQL processing to the already robust Apache Flink offering. The team’s focus turned to bringing Flink Data Definition Language ( DDL) and the batch interface into SSB with that completed. A bit of Flink history.

SQL

SQL Manufacturing Finance Architecture

Streaming Data Pipelines Made SQL With Decodable

Data Engineering Podcast

OCTOBER 28, 2021

He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

Data Pipeline

Data Pipeline SQL Data Warehouse Data Lake

Airflow XCOM: The Ultimate Guide

Marc Lamberti

SEPTEMBER 22, 2023

One solution could be to store the accuracies in a database and fetch them back in the task choosing_model with an SQL request. It is not Spark or Flink. If you want to learn more about Airflow, check out my course, The Complete Hands-On Introduction to Apache Airflow. What is an Airflow XCOM? Have a great day!

MySQL

MySQL Data Pipeline Database Python

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? Can you start by explaining what Spark is? What are some of the main use cases for Spark? Who uses Spark? What are the tools offered to Spark users?

Scala

Scala MySQL Kafka Hadoop

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Data Engineering Podcast

AUGUST 3, 2021

With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Interview Introduction How did you get involved in the area of data management? Interview Introduction How did you get involved in the area of data management?

Data Lake

Data Lake Data Warehouse Hadoop Architecture

DataOps For Streaming Systems With Lenses.io

Data Engineering Podcast

JULY 6, 2020

In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Many different systems provide a SQL interface to streaming data on various substrates.

Systems

Systems Kafka SQL Government

Building The Materialize Engine For Interactive Streaming Analytics In SQL

Data Engineering Podcast

DECEMBER 22, 2019

Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures Interview Introduction How did you get involved in the area of data management? In the list of features, you highlight full support for ANSI SQL 92.

SQL

SQL Building Engineering Java

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Data Engineering Podcast

SEPTEMBER 7, 2020

Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Presto is and its origin story?

Architecture

Architecture Data Architecture SQL Engineering

Data Engineering Weekly #137

Data Engineering Weekly

JULY 2, 2023

Data Engineering Weekly Is Brought to You by RudderStack RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. LinkedIn write about Hoptimator for auto generated Flink pipeline with multiple stages of systems.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

How Shopify Is Building Their Production Data Warehouse Using DBT

Data Engineering Podcast

FEBRUARY 8, 2021

Your host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Shopify platform is?

Data Warehouse

Data Warehouse Building BI SQL

Change Data Capture For All Of Your Databases With Debezium

Data Engineering Podcast

JANUARY 5, 2020

Your host is Tobias Macey and today I’m interviewing Randall Hauch and Gunnar Morling about Debezium, an open source distributed platform for change data capture Interview Introduction How did you get involved in the area of data management? What is Debezium and what problems does it solve?

Database

Database Kafka PostgreSQL MySQL

Top 20+ Big Data Certifications and Courses in 2023

Knowledge Hut

SEPTEMBER 6, 2023

Big Data Frameworks : Familiarity with popular Big Data frameworks such as Hadoop, Apache Spark, Apache Flink, or Kafka are the tools used for data processing. AWS Certified Data Analytics - Specialty exam (DAS-C01) Introduction : AWS Certified Data Analytics – Specialty is for experienced individuals.

Big Data

Big Data Certification Hadoop Scala

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. Can you describe what RisingWave is and the story behind it?

SQL

SQL Data Lake High Quality Data Data Pipeline

Build Real Time Applications With Operational Simplicity Using Dozer

Data Engineering Podcast

JULY 23, 2023

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data.

Building

Building SQL Machine Learning Data Ingestion

Towards a Reliable Device Management Platform

Netflix Tech

AUGUST 30, 2021

By Benson Ma , Alok Ahuja Introduction At Netflix, hundreds of different device types, from streaming sticks to smart TVs, are tested every day through automation to ensure that new software releases continue to deliver the quality of the Netflix experience that our customers enjoy.

Management

Management Kafka Transportation Cloud

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

NOVEMBER 26, 2023

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. With Materialize, you can!

Architecture

Architecture Data Lake High Quality Data SQL

Declarative Data Pipelines with Hoptimator

LinkedIn Engineering

JUNE 26, 2023

Enter Flink We've recently adopted Apache Flink at Linkedin, and Flink SQL has changed the way we think about data pipelines and stream processing. Flink is often seen as a stream processing engine, and historically the APIs have reflected that. Flink SQL) for a single runtime (e.g. Espresso, Venice).

Data Pipeline

Data Pipeline Kafka MySQL SQL

Getting Started with Cloudera Stream Processing Community Edition

Cloudera

AUGUST 10, 2022

Cloudera Stream Processing (CSP), powered by Apache Flink and Apache Kafka, provides a complete stream management and stateful processing solution. In CSP, Kafka serves as the storage streaming substrate, and Flink as the core in-stream processing engine that supports SQL and REST interfaces. Flink and SQL Stream Builder.

Process

Process Kafka PostgreSQL MySQL

Stream Processing vs. Real-Time Analytics Databases

Rockset

MARCH 27, 2023

One additional note: while many stream processing platforms support declarative languages like SQL, they also support Java, Scala, or Python, which are appropriate for advanced use cases like machine learning. These state designations are related to the “continuous query” concept that we discussed in the introduction. Stateful Or Not?

Database

Database Process Scala SQL

Building a Data Platform in 2024

Towards Data Science

FEBRUARY 9, 2024

The introduction of data streaming doesn’t inherently demand a complete overhaul of the data platform’s structure. Distributed SQL engines like Presto , Trino and their numerous managed counterparts ( Pandio , Starburst ), have emerged to traverse Data Lakes, enabling users to use SQL to join diverse data across various physical locations.

Building

Building Transportation Data Lake Metadata

The Good and the Bad of Apache Spark Big Data Processing

AltexSoft

JULY 18, 2023

Spark SQL brings native support for SQL to Spark and streamlines the process of querying semistructured and structured data. Besides SQL syntax, it supports Hive Query Language, which enables interaction with Hive tables. This section will discuss two popular alternatives to Spark: Hadoop and Flink. Data analysis.

Big Data

Big Data Data Process Process Hadoop

Data Access API over Data Lake Tables Without the Complexity

Towards Data Science

SEPTEMBER 27, 2023

Intro Data lake tables are mostly utilized by data engineering teams using big data compute engines, such as Spark or Flink, as well as by data analysts and scientists creating models and reports with heavy SQL query engines, such as Trino or Redshift. Initializing sql.DB Connection to DuckDB As mentioned, we will wrap the sql.DB

Data Lake

Data Lake Accessible Accessibility SQL

Low Code And High Quality Data Engineering For The Whole Organization With Prophecy

Data Engineering Podcast

JULY 16, 2021

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

High Quality Data

High Quality Data Data Engineering Data Engineer Coding

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Data Engineering Podcast

NOVEMBER 20, 2021

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.

Data Lake

Data Lake Data Integration Lambda Architecture Process

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Data Engineering Podcast

FEBRUARY 20, 2022

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.

Python

Python Data Process IT Building

Declarative Machine Learning Without The Operational Overhead Using Continual

Data Engineering Podcast

SEPTEMBER 19, 2021

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.

Machine Learning

Machine Learning Data Warehouse Banking Metadata

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Co-Authors: Sumedh Sakdeo , Lei Sun , Sushant Raikar , Stanislav Pak , and Abhishek Nath Introduction At LinkedIn, we build and operate an open source data lakehouse deployment to power Analytics and Machine Learning workloads.

Big Data

Big Data Data Management Management Metadata

Building A Data Lake For The Database Administrator At Upsolver

Data Engineering Podcast

JUNE 1, 2020

In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake? How is the SQL layer in Upsolver implemented?

Data Lake

Data Lake Database Building Lambda Architecture

The Good and the Bad of Hadoop Big Data Framework

AltexSoft

JULY 29, 2022

To facilitate data ingestion, there are Apache Flume aggregating log data from multiple servers and Apache Sqoop designed to transport information between Hadoop and relational (SQL) databases. You may use Apache Kafka to ingest live data streams and Apache Spark, Apache Storm, or Apache Flink to process them in real or near-real time.

Hadoop

Hadoop Big Data Google Cloud NoSQL

Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix Tech

NOVEMBER 20, 2023

Introduction Netflix relies on data to power its business in all phases. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

Process

Process Data Pipeline Datasets SQL

Data Product Strategies: How Cloudera Helps Realize and Accelerate Successful Data Product Strategies

Cloudera

AUGUST 20, 2021

Introduction. For example, the Cloudera Data Flow experience offers an integrated event processing capability to deliver low-latency analytics by combining Flow Management (using Apache NiFi), Streams Messaging (using Apache Kafka) and Stream Processing / Analytics (using Apache Flink / SQL Stream Builder).

Data Warehouse

Data Warehouse Data Architecture Cloud

Streaming SQL with Apache Flink: A Gentle Introduction

Rock the JVM

FEBRUARY 5, 2023

Enter Giannis: Flink SQL is a powerful high level API for running queries on streaming (and batch) datasets. Streaming (and Batch) SQL 1.1 Batch SQL Queries operate on static data, i.e. on data stored on disk, already available and the results are considered complete. This is called a Dynamic Table. version : " 3.7"

SQL

SQL Kafka Metadata Database

Data News — Week 24.08

Data News — Week 23.16

Webinars

Trending Sources

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Webinars

Our First Netflix Data Engineering Summit

Your Guide to Flink SQL: An In-Depth Exploration

Druid Deprecation and ClickHouse Adoption at Lyft

Enriching Streams with Hive tables via Flink SQL

Data News — Week 23.02

How to learn data engineering

Building Real Time Applications On Streaming Data With Eventador

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Cloudera Streaming Analytics 1.4: the unification of SQL batch and streaming

Streaming Data Pipelines Made SQL With Decodable

Airflow XCOM: The Ultimate Guide

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

DataOps For Streaming Systems With Lenses.io

Building The Materialize Engine For Interactive Streaming Analytics In SQL

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Data Engineering Weekly #137

How Shopify Is Building Their Production Data Warehouse Using DBT

Change Data Capture For All Of Your Databases With Debezium

Top 20+ Big Data Certifications and Courses in 2023

Tackling Real Time Streaming Data With SQL Using RisingWave

Build Real Time Applications With Operational Simplicity Using Dozer

Towards a Reliable Device Management Platform

Addressing The Challenges Of Component Integration In Data Platform Architectures

Declarative Data Pipelines with Hoptimator

Getting Started with Cloudera Stream Processing Community Edition

Stream Processing vs. Real-Time Analytics Databases

Building a Data Platform in 2024

The Good and the Bad of Apache Spark Big Data Processing

Data Access API over Data Lake Tables Without the Complexity

Low Code And High Quality Data Engineering For The Whole Organization With Prophecy

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Declarative Machine Learning Without The Operational Overhead Using Continual

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

Building A Data Lake For The Database Administrator At Upsolver

The Good and the Bad of Hadoop Big Data Framework

Incremental Processing using Netflix Maestro and Apache Iceberg

Data Product Strategies: How Cloudera Helps Realize and Accelerate Successful Data Product Strategies

Streaming SQL with Apache Flink: A Gentle Introduction

Stay Connected