Data Engineering Digest

Building ETL Pipelines With Generative AI

Data Engineering Podcast

OCTOBER 1, 2023

Summary Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. How can you get the best results for your use case?

Building

Building BI SQL Machine Learning

Our First Netflix Data Engineering Summit

Netflix Tech

DECEMBER 14, 2023

Learn more about how batch and streaming data pipelines are built at Netflix. Streaming SQL on Data Mesh using Apache Flink Mark Cho, Guil Pires and Sujay Jain, Engineers from the Netflix Data Platform talk about how a managed Streaming SQL using Apache Flink can help unlock new Stream Processing use cases at Netflix.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Data Engineering Podcast

JULY 9, 2023

The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. dbt vs. informatica vs. ETL scripts, etc.) What is the impact on the underlying compute engine on the modeling strategies used? How does it compare to dimensional modeling strategies? (e.g.

Database-centric

Database-centric Machine Learning SQL Data Engineering

Webinars

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

Understanding the nature of the late-arriving data and processing requirements will help decide which pattern is most appropriate for a use case. In this case, the order of signups wouldn’t matter, and individual signup records are independent of each other.

Data Process

Data Process Process Metadata Finance

Apache Spark Use Cases & Applications

Knowledge Hut

MAY 2, 2024

Spark also has out of the box support for Machine learning and Graph processing using components called MLlib and GraphX respectively. Spark also has support for streaming data using Spark Streaming. Most of the production-grade and large clusters use YARN and Mesos as the resource manager.

Scala

Scala Hospitality Healthcare Retail

Startup Spotlight: Patch Helps Devs Unblock Pipelines With Data Packages

Snowflake

DECEMBER 21, 2023

We needed to combine them with data from our operational stores and event streams to deliver interactive billing reports, user notifications, AI-based services and programmatic data access. These interfaces are designed to make using Snowflake data in production as easy as importing a code library. Simply import and write code.

Software Engineer

Software Engineer Software Engineering Database Data Pipeline

One Big Cluster Stuck: The Right Tool for the Right Job

Cloudera

JUNE 26, 2023

Over time, using the wrong tool for the job can wreak havoc on environmental health. Take precaution using CDSW as an all-purpose workflow management and scheduling tool. Using CDSW primarily for scheduling and automating any type of workflow is a misuse of the service. Monitoring: should I use WXM or Cloudera Manager?

ETL Tools

ETL Tools Programming Language Datasets Data Pipeline

ETL for Snowflake: Why You Need It and How to Get Started

Ascend.io

DECEMBER 19, 2023

If you’re working with Snowflake or just starting to explore its capabilities, you might be wondering: Do I really need ETL for Snowflake? Is it possible to rely solely on Snowflake’s own features, or is there a strong case for bringing ETL into the mix? If so, where do I get started? But first, a disclaimer.

ETL Tools

ETL Tools IT Data Pipeline Data Warehouse

Build AI-driven near-real-time operational analytics with Amazon Aurora zero-ETL integration with Amazon Redshift and ThoughtSpot

ThoughtSpot

OCTOBER 27, 2023

Every business that analyzes their operational (or transactional) data needs to build a custom data pipeline involving several batch or streaming jobs to extract transactional data from relational databases , transform it, and load it into the data warehouse. Zero-ETL integration is set up between Amazon Aurora and Amazon Redshift.

Building

Building MySQL Data Warehouse SQL

Build AI-driven near-real-time operational analytics with Amazon Aurora zero-ETL integration with Amazon Redshift and ThoughtSpot

ThoughtSpot

OCTOBER 27, 2023

Every business that analyzes their operational (or transactional) data needs to build a custom data pipeline involving several batch or streaming jobs to extract transactional data from relational databases , transform it, and load it into the data warehouse. Zero-ETL integration is set up between Amazon Aurora and Amazon Redshift.

Building

Building MySQL Data Warehouse SQL

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

A quick overview of what everyone used for years (and still using it for some of us). Obviously as data is different than "traditional product" — in term of users for instance — a data engineer uses other tools. This is close to what we also call ETL or ELT. It addresses different use-cases.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

SQL Streambuilder Data Transformations

Cloudera

FEBRUARY 21, 2023

SQL Stream Builder (SSB) is a versatile platform for data analytics using SQL as a part of Cloudera Streaming Analytics, built on top of Apache Flink. It enables users to easily write, run, and manage real-time continuous SQL queries on stream data and a smooth user experience. This might be OK for some cases.

SQL

SQL Kafka Raw Data Data

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

Some techniques we used were: 1. Using fixed lookback windows to always reprocess data, assuming that most late-arriving events will occur within that window. However, this approach usually leads to redundant data reprocessing, thereby increasing ETL processing time and compute costs. Psyberg: The Game Changer!

Data Engineering

Data Engineering Data Engineer Engineering Metadata

10 Essential Azure Data Engineer Skills to Improve in 2023

Knowledge Hut

NOVEMBER 17, 2023

The position of Azure Data Engineers is becoming increasingly important as businesses attempt to use the power of data for strategic decision-making and innovation. An Azure Data Engineer is like a data expert who uses special tools to organize and clean up information so that a company can use it to make smart choices.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

Organisations harness this stream of information, leveraging it to innovate, optimise, and thrive. Where Is CDC Used and Who Uses It? Types of CDC Hands-On Implement CDC in Snowflake Using Streams Conclusion What Is CDC and Its Benefits? Where Is CDC Used and Who Uses It?

Telecommunication

Telecommunication Metadata Healthcare Finance

Zero ETL: What’s Behind the Hype?

Ascend.io

SEPTEMBER 12, 2023

As businesses find themselves increasingly reliant on big data and analytics, the traditional process of data integration, primarily ETL (Extract, Transform, Load), can sometimes act as a bottleneck. Amazon Web Services (AWS) recognized this issue and unveiled the concept of zero ETL at re:Invent 2022. What Is Zero ETL?

Amazon Web Services

Amazon Web Services Data Warehouse MySQL AWS

What is Data Extraction? Examples, Tools & Techniques

Knowledge Hut

JANUARY 30, 2024

We'll demystify its importance, explore real-world examples that showcase its practical uses, dig into the toolbox of tools and techniques available to us, and even venture into the world of advanced practices that elevate data extraction to an art. Use Case Essential for data preprocessing and creating usable datasets.

ETL Tools

ETL Tools Database-centric Data Mining Data Cleanse

Streaming Data Integration Without The Code at Equalum

Data Engineering Podcast

NOVEMBER 30, 2020

With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen.

Data Integration

Data Integration Coding BI Kafka

Simplifying Data Integration Through Eventual Connectivity

Data Engineering Podcast

JULY 28, 2019

Summary The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL? Can you talk through an example use case?

Data Integration

Data Integration Metadata Media Architecture

Self Service Data Exploration And Dashboarding With Superset

Data Engineering Podcast

APRIL 26, 2021

One of the most common and widely used methods of access is through a business intelligence dashboard. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. Give it a listen and then take it for a test drive today.

Business Intelligence

Business Intelligence Data Warehouse Hadoop Data Pipeline

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

Data Engineering Podcast

MARCH 8, 2021

explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. What are the problems/use cases that Molecula solves for? In this episode H.O. When is Molecula the wrong choice?

IT

IT Data Warehouse MongoDB Kafka

Data News — Week 23.07

Christophe Blefari

FEBRUARY 18, 2023

Last year DataOps has been used in many different ways to describe so many data-related different tasks. To me it stops here, all the marketing derivation of it saying we do data products using DataOps methodology is just marketing. This article shows how you can do it with Polars that leverage Arrow using less memory.

Software Engineer

Software Engineer Software Engineering Data Validation Data

What is a Data Pipeline?

Grouparoo

OCTOBER 26, 2021

This may include a data warehouse when it’s necessary to pipeline data from your warehouse to various destinations as in the case of a reverse ETL pipeline. Destinations can vary depending on the use case of the pipeline. Unlike traditional ETL systems, data pipelines don’t have to move data in batches.

Data Pipeline

Data Pipeline ETL Tools ETL System Data Warehouse

Introduction to MongoDB for Data Science

Knowledge Hut

NOVEMBER 3, 2023

MongoDB is used for data science, meaning that we utilize the capabilities of this NoSQL database system as part of our data analysis and data modeling processes, which fall under the realm of data science. It can thus be used effectively for almost all types of data types found in analytic or data science projects.

MongoDB

MongoDB Data Science NoSQL ETL Tools

Data Pipeline- Definition, Architecture, Examples, and Use Cases

ProjectPro

DECEMBER 7, 2021

This blog will give you an in-depth knowledge of what is a data pipeline and also explore other aspects such as data pipeline architecture, data pipeline tools, use cases, and so much more. As data is expanding exponentially, organizations struggle to harness digital information's power for different business use cases.

Data Pipeline

Data Pipeline Architecture Kafka AWS

Real World Change Data Capture At Datacoral

Data Engineering Podcast

MARCH 22, 2021

Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming.

Data Warehouse

Data Warehouse Metadata Data Lake Hadoop

How to Translate SQL Scripts Into Matillion Jobs

phData: Data Engineering

JULY 12, 2023

What is Matillion ETL? Matillion ETL is a platform designed to help you speed up your data pipeline development by connecting it to many different data sources, enabling teams to rapidly integrate and build sophisticated data transformations in a cloud environment with a very intuitive low-code/no-code GUI. With that, let’s dive in!

SQL

SQL Database Data Pipeline Coding

How to Translate SQL Scripts Into Matillion Jobs

phData: Data Engineering

APRIL 21, 2023

With that, let’s dive in What is Matillion ETL? Matillion ETL is a platform designed to help you speed up your data pipeline development by connecting it to many different data sources, enabling teams to rapidly integrate and build sophisticated data transformations in a cloud environment with a very intuitive low-code/no-code GUI.

SQL

SQL Database Data Pipeline Coding

What is the ETL Process?

Grouparoo

DECEMBER 14, 2021

The ETL data integration process has been around for decades and is an integral part of data analytics today. In this article, we’ll look at what goes on in the ETL process and some modern variations that are better suited to our modern, data-driven society. What is ETL?

Process

Process Raw Data Data Warehouse Data Pipeline

Table Types Are Evolving And So Is Monte Carlo

Monte Carlo

MARCH 4, 2024

Most major data cloud providers support both use cases. So for example, a data team may be using Snowflake with external Iceberg tables to leverage the convenience of modern metadata management while keeping the data well structured and efficient under the hood. In this case the subscription column name has been changed.

Metadata

Metadata Data Lake Data Warehouse Data Engineering

Moving Machine Learning Into The Data Pipeline at Cherre

Data Engineering Podcast

APRIL 19, 2021

Summary Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming.

Data Pipeline

Data Pipeline Machine Learning Data Warehouse Datasets

Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Data Engineering Podcast

MAY 8, 2022

Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. What are some of the core use cases that you are focused on supporting? Can you describe what TigerGraph is and the story behind it? polyglot persistence, etc.)

Database

Database Data Lake BI Business Intelligence

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Data Engineering Podcast

AUGUST 6, 2022

Summary The optimal format for storage and retrieval of data is dependent on how it is going to be used. For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. Sifflet also offers a 2-week free trial.

Machine Learning

Machine Learning Database MySQL PostgreSQL

Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel

Data Engineering Podcast

APRIL 10, 2022

Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. What are the stages of the data lifecycle where Gretel is used? What are the most interesting, innovative, or unexpected ways that you have seen Gretel used?

Engineering

Engineering Data Lake Data Engineering Data Engineer

Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data

Data Engineering Podcast

SEPTEMBER 11, 2022

Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sifflet also offers a 2-week free trial.

Data Pipeline

Data Pipeline Building MongoDB Scala

Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB

Data Engineering Podcast

OCTOBER 23, 2022

Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. What are the use cases/capabilities that you are targeting with those products? What are the use cases/capabilities that you are targeting with those products?

Database

Database MySQL Cloud MongoDB

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. What are some of the main use cases for Spark? Who uses Spark?

Scala

Scala MySQL Kafka Hadoop

Evolving An ETL Pipeline For Better Productivity

Data Engineering Podcast

JUNE 3, 2019

Summary Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Media

Media Data Pipeline Machine Learning Data Science

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Data Engineering Podcast

APRIL 24, 2022

Summary There are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Machine Learning

Machine Learning Systems Data Lake Metadata

Three Reference Architectures for Real-Time Analytics On Streaming Data

Rockset

APRIL 26, 2023

This is part three in Rockset’s Making Sense of Real-Time Analytics (RTA) on Streaming Data series. In part 1 , we covered the technology landscape for real-time analytics on streaming data. In part 2 we covered the differences between real-time analytics databases and stream processing. The database has two primary jobs.

Architecture

Architecture Transportation Data Lake Insurance

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Cloudera

JULY 21, 2022

Cloudera users can securely connect Rill to a source of event stream data, such as Cloudera DataFlow , model data into Rill’s cloud-based Druid service, and share live operational dashboards within minutes via Rill’s interactive metrics dashboard or any connected BI solution. Native streaming ingestion support from Kafka and Kinesis.

BI

BI Digital Media Data Warehouse Kafka

Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster

Data Engineering Podcast

JULY 24, 2022

Summary The current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. What do you have planned for that event and what does the release mean for users who have been refraining from using the framework until now?

MongoDB

MongoDB Scala MySQL Data Lake

Python for Data Engineering

Ascend.io

SEPTEMBER 14, 2023

Here’s how Python stacks up against SQL, Java, and Scala based on key factors: Feature Python SQL Java Scala Performance Offers good performance which can be enhanced using libraries like NumPy and Cython. Typing Dynamically typed, but can use type hints. Ease-Of-Use Celebrated for its concise and clear syntax.

Data Engineering

Data Engineering Data Engineer Python Engineering

Ascending with MotherDuck

Ascend.io

JUNE 22, 2023

Ascend recognized the capabilities of DuckDb early on and used it in its own platform. DuckDb powers Ascend’s advanced, user-friendly data pipeline observability features – a perfect example of an embedded analytics use case! Simplify with Ascend Complex data delivery tasks can slow down your analytics process.

NoSQL

NoSQL Google Cloud MySQL Data Pipeline

Building ETL Pipelines With Generative AI

Our First Netflix Data Engineering Summit

Webinars

Trending Sources

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Webinars

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Apache Spark Use Cases & Applications

Startup Spotlight: Patch Helps Devs Unblock Pipelines With Data Packages

One Big Cluster Stuck: The Right Tool for the Right Job

ETL for Snowflake: Why You Need It and How to Get Started

Build AI-driven near-real-time operational analytics with Amazon Aurora zero-ETL integration with Amazon Redshift and ThoughtSpot

Build AI-driven near-real-time operational analytics with Amazon Aurora zero-ETL integration with Amazon Redshift and ThoughtSpot

How to learn data engineering

SQL Streambuilder Data Transformations

1. Streamlining Membership Data Engineering at Netflix with Psyberg

10 Essential Azure Data Engineer Skills to Improve in 2023

Unleashing the Power of CDC With Snowflake

Zero ETL: What’s Behind the Hype?

What is Data Extraction? Examples, Tools & Techniques

Streaming Data Integration Without The Code at Equalum

Simplifying Data Integration Through Eventual Connectivity

Self Service Data Exploration And Dashboarding With Superset

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

Data News — Week 23.07

What is a Data Pipeline?

Introduction to MongoDB for Data Science

Data Pipeline- Definition, Architecture, Examples, and Use Cases

Real World Change Data Capture At Datacoral

How to Translate SQL Scripts Into Matillion Jobs

How to Translate SQL Scripts Into Matillion Jobs

What is the ETL Process?

Table Types Are Evolving And So Is Monte Carlo

Moving Machine Learning Into The Data Pipeline at Cherre

Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel

Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data

Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Evolving An ETL Pipeline For Better Productivity

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Three Reference Architectures for Real-Time Analytics On Streaming Data

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster

Python for Data Engineering

Ascending with MotherDuck

Stay Connected