Data Engineering Digest

Fast Copy-On-Write within Apache Parquet for Data Lakehouse ACID Upserts

Uber Engineering

JUNE 29, 2023

Experience the power of row-level secondary indexing in Apache Parquet, enabling 3-20X faster upserts and unlocking new possibilities for efficient table ACID operations in today’s Lakehouse architecture.

Architecture

Architecture Data

Seamlessly Migrate Your Apache Parquet Data Lake to Delta Lake

databricks

JUNE 6, 2023

Apache Parquet is one of the most popular open source file formats in the big data world today. Being column-oriented, Apache Parquet allows.

Data Lake

Data Lake Big Data Data Data Engineering

Comparing Performance of Big Data File Formats: A Practical Guide

Towards Data Science

JANUARY 17, 2024

Parquet vs ORC vs Avro vs Delta Lake Photo by Viktor Talashuk on Unsplash The big data world is full of various storage systems, heavily influenced by different file formats. You’ll explore four widely used file formats: Parquet , ORC , Avro , and Delta Lake. These will be used for Parquet, Avro, ORC, and Delta Lake.

Big Data

Big Data Data Data Storage SQL

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Data News — Week 24.12

Christophe Blefari

MARCH 22, 2024

On my side I'll talk about Apache Superset and what you can do to build a complete application with it. Finally, xAI released Grok-1 in open — The weights are available in torrent / HF and everything is under Apache License. Now give me the news. Sometimes it looks like you speak to a child.

Electronics

Electronics Media Data Python

Data News — Week 23.05

Christophe Blefari

FEBRUARY 3, 2023

Microsoft Azure announced managed Airflow — Starting this week you'll be able to launch Apache Airflow within Azure Data Factory. Parquet best practices: the art of filtering — How to leverage Parquet filtering to save processing time. The feature is in public preview.

BI

BI Google Cloud SQL Machine Learning

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

Apache Iceberg’s ecosystem of diverse adopters, contributors and commercial support continues to grow, establishing itself as the industry standard table format for an open data lakehouse architecture. Your other engines that may be writing to the table, such as Apache Spark or Apache Flink, can continue to write, and Snowflake can read.

Building

Building Metadata Cloud Storage AWS

Supporting Diverse ML Systems at Netflix

Netflix Tech

MARCH 7, 2024

Data: Fast Data Our main data lake is hosted on S3, organized as Apache Iceberg tables. For ETL and other heavy lifting of data, we mainly rely on Apache Spark. We use Apache Arrow to decode Parquet and to host an in-memory representation of data.

Systems

Systems Media Machine Learning Data Warehouse

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Data Engineering Podcast

NOVEMBER 22, 2017

In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. You’ve each developed a new on-disk data format, Avro and Parquet respectively.

Hadoop

Hadoop Data Storage Data Pipeline SQL

From Hive Tables to Iceberg Tables: Hassle-Free

Cloudera

JULY 14, 2023

Therefore, Apache Iceberg table format is poised to replace the traditional Hive table format in the coming years. They simply read the underlying data (not even full read, they just read the parquet headers) and create corresponding Iceberg metadata files. You can change the data format say from “orc” to “parquet.’’

Metadata

Metadata Data Warehouse Big Data Ecosystem Java

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Cloudera

JANUARY 15, 2021

Cloudera Data Warehouse is a highly scalable service that marries the SQL engine technologies of Apache Impala and Apache Hive with cloud-native features to deliver best-in-class price-performance for users running data warehousing workloads in the cloud. CDW supports running queries on either Apache Hive or Apache Impala engines.

Data Warehouse

Data Warehouse Cloud Consulting SQL

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Data Engineering Podcast

AUGUST 3, 2021

Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Interview Introduction How did you get involved in the area of data management?

Data Lake

Data Lake Data Warehouse Hadoop Architecture

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

Introduction Apache Iceberg has recently grown in popularity because it adds data warehouse-like capabilities to your data lake making it easier to analyze all your data — structured and unstructured. File formats like Parquet work better if the underlying file size is large. Opening files is costly.

Bytes

Bytes Metadata Data Lake SQL

Effective Pandas Patterns For Data Engineering

Data Engineering Podcast

JANUARY 30, 2022

Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. How does it work? How does it work?

Data Engineering

Data Engineering Data Engineer Engineering Python

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Data Engineering Podcast

MAY 22, 2022

That’s why Kyligence has built on top of the leading open source OLAP engine for data lakes, Apache Kylin. Kubeflow Airflow AWS Step Functions Protocol Buffers XGBoost MLFlow Dagster Podcast Episode Prefect Podcast Episode Arrow Parquet Metaflow Pytorch Podcast.__init__ __init__ Episode dbt FastAPI Podcast.__init__

Machine Learning

Machine Learning Data Engineering Data Engineer Cloud

Data Engineering Annotated Monthly – October 2021

Big Data Tools

NOVEMBER 8, 2021

Apache Spark® has been released and there are a load of changes, including ANSI SQL support, Pandas API layer over PySpark, and lots and lots of other things. Apache Ranger 2.2.0 If you are curious about what Apache Ranger is – it’s the framework set up to maintain security over the whole Hadoop platform. Parquet support?”

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – October 2021

Big Data Tools

NOVEMBER 8, 2021

Apache Spark® has been released and there are a load of changes, including ANSI SQL support, Pandas API layer over PySpark, and lots and lots of other things. Apache Ranger 2.2.0 If you are curious about what Apache Ranger is – it’s the framework set up to maintain security over the whole Hadoop platform. Parquet support?”

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

Data Engineering Podcast

DECEMBER 30, 2019

__init__ Episode Apache NiFi Podcast Episode Luigi Dagster Podcast Episode Prefect The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast Summary DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure.

Process

Process Building Hadoop Java

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

An open lakehouse, and the birth of Apache Iceberg. Apache Iceberg was built from inception with the goal to be easily interoperable across multiple analytic engines and at a cloud-native scale. The cloud native table format was open sourced into Apache Iceberg by its creators. Apache Iceberg’s real superpower is its community.

Data Lake

Data Lake Data Warehouse BI SQL

Self Service Data Exploration And Dashboarding With Superset

Data Engineering Podcast

APRIL 26, 2021

__init__ Episode Preset ASP (Active Server Pages) VBScript Data Warehouse Institute Ralph Kimball Bill Inmon Ubisoft Hadoop Tableau Looker Podcast Episode The Future of Business Intelligence Is Open Source Supercharging Apache Superset Redash Podcast.__init__

Business Intelligence

Business Intelligence Data Warehouse Hadoop Data Pipeline

Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop

Uber Engineering

MARCH 12, 2017

With the evolution of storage formats like Apache Parquet and Apache ORC and query engines like Presto and Apache Impala , the Hadoop ecosystem has the potential to become a general-purpose, unified serving layer for workloads that can tolerate latencies … The post Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop appeared (..)

Hadoop

Hadoop Process Engineering Data Architecture

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

I’ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn’t able to make it do what I wanted it to. This new release brings exciting features like support for Apache Iceberg! Apache Pulsar takes a step in this direction and adds an official management UI! There are several solutions.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

I’ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn’t able to make it do what I wanted it to. This new release brings exciting features like support for Apache Iceberg! Apache Pulsar takes a step in this direction and adds an official management UI! There are several solutions.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Data Engineering Podcast

JANUARY 6, 2019

__init__ Episode Cloudera Manager Apache Sentry Collibra The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast Summary The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs.

Data Analytics

Data Analytics Hadoop Kafka Media

SnowflakeDB: The Data Warehouse Built For The Cloud

Data Engineering Podcast

DECEMBER 8, 2019

Links SnowflakeDB Free Trial Stack Overflow Data Warehouse Oracle DB MPP == Massively Parallel Processing Shared Nothing Architecture Multi-Cluster Shared Data Architecture Google BigQuery AWS Redshift AWS Redshift Spectrum Presto Podcast Episode SnowflakeDB Semi-Structured Data Types Hive ACID == Atomicity, Consistency, Isolation, Durability 3rd Normal (..)

Data Warehouse

Data Warehouse Cloud AWS Relational Database

Modernizing Data Pipelines using Cloudera Data Platform – Part 1

Cloudera

JUNE 2, 2021

For example, all common databases (redshift, snowflake, mongo, hbase, … ), service integrations (salesforce) and popular file formats (avro, parquet, ORC, csv…) and dozens more listed here. Integrate with a wide swathe of 3rd party data sources with Spark to provide an extensive library that can be leveraged in CDE.

Data Pipeline

Data Pipeline Data Warehouse Machine Learning Data Architect

Making The Open Data Lakehouse Affordable Without The Overhead At Iomete

Data Engineering Podcast

OCTOBER 9, 2022

To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Iomete Fivetran Podcast Episode Airbyte Podcast Episode Snowflake Podcast Episode Databricks Collibra Podcast Episode Talend Parquet Trino Spark Presto Snowpark Iceberg Podcast Episode Iomete dbt adapter Singer Meltano Podcast Episode (..)

Metadata

Metadata AWS MongoDB MySQL

How to Use Apache Iceberg in CDP’s Open Lakehouse

Cloudera

AUGUST 8, 2022

In June 2022, Cloudera announced the general availability of Apache Iceberg in the Cloudera Data Platform (CDP). Iceberg is a 100% open-table format, developed through the Apache Software Foundation , which helps users avoid vendor lock-in and implement an open lakehouse. . Fine-grained access control by SDX integration (Ranger) .

Data Warehouse

Data Warehouse BI Machine Learning SQL

The Data Janitor Letters - October 2021

Pipeline Data Engineering

NOVEMBER 4, 2021

ROAPI: An API Server for Static Datasets Mark Litwintschik, #bigdata Consultant ROAPI is an API Server that exposes CSV, JSON and Parquet files without the need to write any code. SeattleDataGuy There’s no better time to jump into the world of data engineering. Announcing Streamlit 1.0! ?

PostgreSQL

PostgreSQL Finance Consulting Software Engineer

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Knowledge Hut

DECEMBER 21, 2023

HDFS HDFS is the abbreviated form of Hadoop Distributed File System and is a component of Apache Hadoop. Mahout Overview: Apache Mahout is an open-source ML library that helps leverage big data computation through Hadoop MapReduce. The outcome derived from Apache Pig is stored in HDFS. Pros: Apache Pig is very easy to learn.

Hadoop

Hadoop Big Data NoSQL Unstructured Data

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Data Engineering Podcast

SEPTEMBER 7, 2020

__init__ Episode Tensorflow Spark The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast Summary Databases are limited in scope to the information that they directly contain.

Architecture

Architecture Data Architecture SQL Engineering

Automating Your Production Dataflows On Spark

Data Engineering Podcast

NOVEMBER 4, 2019

To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Ascend Kubernetes BigQuery Apache Spark Apache Beam Go Language SHA Hashes PySpark Delta Lake Podcast Episode DAG == Directed Acyclic Graph PrestoDB MinIO (..)

Programming Language

Programming Language Kafka Media Data Engineering

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Engineering Podcast

DECEMBER 16, 2019

To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Marquez DataEngConf Presentation WeWork Canary Yahoo Dremio Hadoop Pig Parquet Podcast Episode Airflow Apache Atlas Amundsen Podcast Episode Uber DataBook (..)

Metadata

Metadata PostgreSQL Datasets Data Warehouse

Data Engineering Annotated Monthly – October 2022

Big Data Tools

NOVEMBER 9, 2022

Apache Doris 1.1.3 – Here’s another interesting database for you. We aren’t aware of many MPP databases, and none of them are under the motley umbrella of the Apache Software Foundation. release supports Apache Parquet as an output file format. Here’s what’s happening in the world of data engineering right now.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Data Engineering Annotated Monthly – October 2022

Big Data Tools

NOVEMBER 9, 2022

Apache Doris 1.1.3 – Here’s another interesting database for you. We aren’t aware of many MPP databases, and none of them are under the motley umbrella of the Apache Software Foundation. release supports Apache Parquet as an output file format. Here’s what’s happening in the world of data engineering right now.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

New Snowflake Features Released in February 2023

Snowflake

MARCH 21, 2023

Streaming Data Ingestion Snowpipe Streaming, Now in Public Preview Ingest rowsets from business application and IoT devices or from Apache Kafka topics directly into Snowflake at low latency. Check out Felipe Hoffa’s video on how to use Snowsight to get from data to decision faster.

Retail

Retail Healthcare Data Ingestion Consulting

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

Data Engineering Podcast

OCTOBER 9, 2018

Petersburg University of Fine Mechanics And Optics C C++ In-Memory Database RAM (Random Access Memory) Flash Storage Oracle DB PostgreSQL Podcast Episode Kafka Kinesis Wealth Management Data Warehouse ODBC S3 HDFS Avro Parquet Data Serialization Podcast Episode Broadcast Join Shuffle Join CAP Theorem Apache Arrow LZ4 S2 Geospatial Library Sybase SAP (..)

PostgreSQL

PostgreSQL BI Data Warehouse Machine Learning

5 Layers of Data Lakehouse Architecture Explained

Monte Carlo

JANUARY 5, 2024

At this layer, an organization might use tools like Amazon Data Migration Service ( Amazon DMS ) for importing data from RDBMSs and NoSQL databases, Apache Kafka for data streaming, and many more. Metadata layer The metadata layer manages and organizes the metadata associated with the data that’s been ingested and stored.

Architecture

Architecture Data Lake Metadata Unstructured Data

Data Lakehouse Architecture Explained: 5 Layers

Monte Carlo

JANUARY 5, 2024

At this layer, an organization might use tools like Amazon Data Migration Service ( Amazon DMS ) for importing data from RDBMSs and NoSQL databases, Apache Kafka for data streaming, and many more. Metadata layer The metadata layer manages and organizes the metadata associated with the data that’s been ingested and stored.

Architecture

Architecture Data Lake Metadata Unstructured Data

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

Apache Pegasus 2.3.0 – Have you ever been in a situation where you were designing a storage architecture and all the solutions in some areas just seemed wrong, leaving you to choose between an unsuitable option and an even less suitable one? Apache Pegasus might be the alternative you are looking for, if not now, then in your next project.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – September 2022

Big Data Tools

OCTOBER 10, 2022

Apache Pegasus 2.3.0 – Have you ever been in a situation where you were designing a storage architecture and all the solutions in some areas just seemed wrong, leaving you to choose between an unsuitable option and an even less suitable one? Apache Pegasus might be the alternative you are looking for, if not now, then in your next project.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

Cloudera

SEPTEMBER 24, 2020

Some of the most powerful results come from combining complementary superpowers, and the “dynamic duo” of Apache Hive LLAP and Apache Impala, both included in Cloudera Data Warehouse , is further evidence of this. Uses Parquet as the preferred file format. Aren’t two superheroes better than one?

Data Warehouse

Data Warehouse SQL Engineering Metadata

Monte Carlo Announces Delta Lake, Unity Catalog Integrations To Bring End-to-End Data Observability to Databricks

Monte Carlo

JUNE 28, 2022

Over the past decade, Databricks and Apache Spark™ not only revolutionized how organizations store and process their data, but they also expanded what’s possible for data teams by operationalizing data lakes at an unprecedented scale across nearly infinite use cases. Complete data lake coverage no matter the metastore.

Data Lake

Data Lake Metadata AWS Data Warehouse

Data Engineering Weekly #105

Data Engineering Weekly

OCTOBER 30, 2022

DuckDB is gaining much attention on this promise, and the Dagster team writes about its experimental data warehouse built on top of DuckDB, Parquet, and Dagster. link] Bart Maertens: A new Execution Information Logging Platform in Apache Hop Profiling and auditing the workflow is essential for operating the data pipeline at scale.

Data Engineering

Data Engineering Data Engineer Engineering Data Ingestion

Python for Data Engineering

Ascend.io

SEPTEMBER 14, 2023

Strong especially in big data, with tools like Apache Spark. For those venturing into data lakes and distributed storage, tools like Hadoop’s Pydoop and PyArrow for Parquet ensure that Python isn’t left behind. PySpark allows Python to interface with Apache Spark, making distributed data tasks more approachable.

Data Engineering

Data Engineering Data Engineer Python Engineering

Fast Copy-On-Write within Apache Parquet for Data Lakehouse ACID Upserts

Seamlessly Migrate Your Apache Parquet Data Lake to Delta Lake

Webinars

Trending Sources

Comparing Performance of Big Data File Formats: A Practical Guide

Webinars

Data News — Week 24.12

Data News — Week 23.05

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Supporting Diverse ML Systems at Netflix

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

From Hive Tables to Iceberg Tables: Hassle-Free

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Optimization Strategies for Iceberg Tables

Effective Pandas Patterns For Data Engineering

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Data Engineering Annotated Monthly – October 2021

Data Engineering Annotated Monthly – October 2021

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

The Future of the Data Lakehouse – Open

Self Service Data Exploration And Dashboarding With Superset

Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop

Data Engineering Annotated Monthly – May 2022

Data Engineering Annotated Monthly – May 2022

Performing Fast Data Analytics Using Apache Kudu - Episode 64

SnowflakeDB: The Data Warehouse Built For The Cloud

Modernizing Data Pipelines using Cloudera Data Platform – Part 1

Making The Open Data Lakehouse Affordable Without The Overhead At Iomete

How to Use Apache Iceberg in CDP’s Open Lakehouse

The Data Janitor Letters - October 2021

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Automating Your Production Dataflows On Spark

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Engineering Annotated Monthly – October 2022

Data Engineering Annotated Monthly – October 2022

New Snowflake Features Released in February 2023

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

5 Layers of Data Lakehouse Architecture Explained

Data Lakehouse Architecture Explained: 5 Layers

Data Engineering Annotated Monthly – September 2022

Data Engineering Annotated Monthly – September 2022

Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

Monte Carlo Announces Delta Lake, Unity Catalog Integrations To Bring End-to-End Data Observability to Databricks

Data Engineering Weekly #105

Python for Data Engineering

Stay Connected