Data Engineering Digest

Comparing Performance of Big Data File Formats: A Practical Guide

Towards Data Science

JANUARY 17, 2024

Parquet vs ORC vs Avro vs Delta Lake Photo by Viktor Talashuk on Unsplash The big data world is full of various storage systems, heavily influenced by different file formats. It’s crucial to know which file format fits which scenario. Then you’ll learn to read and write data in each format.

Big Data

Big Data Data Data Storage SQL

5 Layers of Data Lakehouse Architecture Explained

Monte Carlo

JANUARY 5, 2024

Data lakehouse architecture combines the benefits of data warehouses and data lakes, bringing together the structure and performance of a data warehouse with the flexibility of a data lake. Data lakehouse architecture is an increasingly popular choice for many businesses because it supports interoperability between data lake formats.

Architecture

Architecture Data Lake Metadata Unstructured Data

Data Lakehouse Architecture Explained: 5 Layers

Monte Carlo

JANUARY 5, 2024

Data lakehouse architecture combines the benefits of data warehouses and data lakes, bringing together the structure and performance of a data warehouse with the flexibility of a data lake. Data lakehouse architecture is an increasingly popular choice for many businesses because it supports interoperability between data lake formats.

Architecture

Architecture Data Lake Metadata Unstructured Data

Webinars

The Product Manager’s Guide to Optimizing DX for Systemic Impact

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Hands-On Introduction to Delta Lake with (py)Spark

Towards Data Science

FEBRUARY 15, 2023

Nowadays, almost by default, organizations will have to deal with data in different formats (CSV, pdf, video, parquet, etc), hence the success of blob storage like amazon’s S3. What is Delta Lake? Before going into further details on Delta Lake, we need to remember the concept of Data Lake, so let’s travel through some history.

Data Lake

Data Lake Data Warehouse Hadoop Data Architecture

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

Different vendors offering data warehouses, data lakes, and now data lakehouses all offer their own distinct advantages and disadvantages for data teams to consider. Commonly, you’ll find data formats such as JSON, Apache Parquet, and Apache Avro in these environments. But, the options for data storage are evolving quickly.

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

Apache Iceberg is a high-performance, open table format, born-in-the cloud that scales to petabytes independent of the underlying storage layer and the access engine layer. By being a truly open table format, Apache Iceberg fits well within the vision of the Cloudera Data Platform (CDP). 2: Open formats. What is Apache Iceberg?

Metadata

Metadata Data Architecture BI Machine Learning

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Monte Carlo

NOVEMBER 14, 2023

Over the last few months, Apache Iceberg has come to the forefront as a promising new open-source table format that removes many of the largest barriers to lakehouse adoption – namely, the high-latency and lack of OLTP (Online Transaction Processing) support afforded by Apache Hive. Is your data lake a good fit for Iceberg?

Data Lake

Data Lake Metadata Data Warehouse SQL

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore).

Big Data

Big Data Data Management Management Metadata

Data Lakehouse: Concept, Key Features, and Architecture Layers

AltexSoft

NOVEMBER 10, 2021

The pun being obvious, there’s more to that than just a new term: Data lakehouses combine the best features of both data lakes and data warehouses and this post will explain this all. In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its raw formats just like data lakes. Data lake.

Architecture

Architecture Data Lake Data Warehouse Metadata

Understand your data requirements by David Hope

Scott Logic

NOVEMBER 6, 2023

This implies differences in customer data focused systems vs analytics systems and means we can’t often use the customer facing DBs for reporting and data science, even via separate read replicas. They’d typically push data they’ve collected into the warehouse system or data lake.

Data Lake

Data Lake Kafka BI Unstructured Data

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

You can also access data through non-relational databases such as Apache Cassandra, Apache HBase, Apache Hive, and others like the Hadoop Distributed File System. It makes building C extensions for Python as trivial as writing Python itself. A considerable number of companies use Delta Lake to process exabytes of data every month.

Big Data

Big Data Project Metadata Programming Language

DataOps: What Is It, Core Principles, and Tools For Implementation

phData: Data Engineering

JANUARY 3, 2022

You can read the full guide without giving us your email — keep scrolling !) Most companies begin by using Microsoft Excel , downloading CSV files from a variety of sources in order to clean data, perform analytics, and generate reports. and is a centralized repository where all files and data are stored on a central server.

IT

IT AWS Software Engineer Software Engineering

The Good and the Bad of Databricks Lakehouse Platform

AltexSoft

MARCH 30, 2023

It combines the best elements of a data warehouse, a centralized repository for structured data, and a data lake used to host large amounts of raw data. To dive deeper into details, read our article Data Lakehouse: Concept, Key Features, and Architecture Layers. Delta Lake integrations.

Scala

Scala Data Lake BI Google Cloud

Data Engineering Digest

Comparing Performance of Big Data File Formats: A Practical Guide

5 Layers of Data Lakehouse Architecture Explained

Webinars

Trending Sources

Data Lakehouse Architecture Explained: 5 Layers

Webinars

Hands-On Introduction to Delta Lake with (py)Spark

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

Data Lakehouse: Concept, Key Features, and Architecture Layers

Understand your data requirements by David Hope

20 Best Open Source Big Data Projects to Contribute on GitHub

DataOps: What Is It, Core Principles, and Tools For Implementation

Top 100 Hadoop Interview Questions and Answers 2023

The Good and the Bad of Databricks Lakehouse Platform

Stay Connected