Data Engineering Digest

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes. Iterations of the lakehouse.

Data Lake

Data Lake Data Warehouse BI SQL

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. Data lakes are notoriously complex.

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Making Email Better With AI At Shortwave

Data Engineering Podcast

APRIL 21, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. Data lakes are notoriously complex. From a product perspective, what are the data challenges that are posed by email?

Data Lake

Data Lake High Quality Data Data Pipeline Machine Learning

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Version Your Data Lakehouse Like Your Software With Nessie

Data Engineering Podcast

MARCH 10, 2024

Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. Data lakes are notoriously complex. Your first 30 days are free! Want to see Starburst in action?

Data Lake

Data Lake High Quality Data Data Pipeline Architecture

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

Data Engineering Podcast

MAY 5, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. Data lakes are notoriously complex. Your first 30 days are free! Want to see Starburst in action?

Building

Building Data Lake High Quality Data Machine Learning

Build Your Second Brain One Piece At A Time

Data Engineering Podcast

APRIL 28, 2024

In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain. Data lakes are notoriously complex. Your first 30 days are free! Want to see Starburst in action?

Building

Building Data Lake High Quality Data Machine Learning

Reconciling The Data In Your Databases With Datafold

Data Engineering Podcast

MARCH 17, 2024

Summary A significant portion of data workflows involve storing and processing information in database engines. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data. Data lakes are notoriously complex.

Database

Database Data Lake High Quality Data Data Workflow

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

MARCH 24, 2024

Summary A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features.

Data Lake

Data Lake High Quality Data Hadoop Data Pipeline

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

SQL

SQL Data Lake High Quality Data Data Pipeline

When And How To Conduct An AI Program

Data Engineering Podcast

MARCH 3, 2024

Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization. Data lakes are notoriously complex. Join us at the top event for the global data community, Data Council Austin. Your first 30 days are free!

Programming

Programming Data Lake High Quality Data Data Pipeline

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Data Engineering Podcast

APRIL 7, 2024

Summary Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. Dagster offers a new approach to building and running data platforms and data pipelines.

Data Lake

Data Lake High Quality Data BI Data Workflow

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. Data lakes are notoriously complex. Join us at the top event for the global data community, Data Council Austin.

Database

Database Technology Data Lake High Quality Data

Designing A Non-Relational Database Engine

Data Engineering Podcast

APRIL 14, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment.

Non-relational Database

Non-relational Database Relational Database Database Designing

Data Sharing Across Business And Platform Boundaries

Data Engineering Podcast

FEBRUARY 11, 2024

Summary Sharing data is a simple concept, but complicated to implement well. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. Dagster offers a new approach to building and running data platforms and data pipelines.

Data Lake

Data Lake High Quality Data Government Data Pipeline

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. What are the open questions today in technical scalability of data engines?

Data Process

Data Process Process Data Lake High Quality Data

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Data Engineering Podcast

MARCH 31, 2024

Summary Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. Dagster offers a new approach to building and running data platforms and data pipelines.

Project

Project Data Lake High Quality Data Data Workflow

Build A Data Lake For Your Security Logs With Scanner

Data Engineering Podcast

JANUARY 28, 2024

Summary Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. SIEM) A query engine is useless without data to analyze.

Data Lake

Data Lake Building High Quality Data AWS

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. Data projects are notoriously complex.

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

Unlocking Your dbt Projects With Practical Advice For Practitioners

Data Engineering Podcast

NOVEMBER 19, 2023

Summary The dbt project has become overwhelmingly popular across analytics and data engineering teams. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data projects are notoriously complex. While it is easy to adopt, there are many potential pitfalls.

Project

Project Data Lake High Quality Data SQL

Designing Data Platforms For Fintech Companies

Data Engineering Podcast

DECEMBER 31, 2023

Summary Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector. Want to see Starburst in action?

Designing

Designing Data Lake High Quality Data SQL

Adding An Easy Mode For The Modern Data Stack With 5X

Data Engineering Podcast

DECEMBER 17, 2023

Summary The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.

Data Lake

Data Lake High Quality Data SQL Architecture

Shining Some Light In The Black Box Of PostgreSQL Performance

Data Engineering Podcast

NOVEMBER 5, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team.

PostgreSQL

PostgreSQL Data Lake High Quality Data SQL

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Data Engineering Podcast

DECEMBER 10, 2023

Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. With Materialize, you can! Introducing RudderStack Profiles.

Data Lake

Data Lake High Quality Data SQL Architecture

Data Engineering Weekly #167

Data Engineering Weekly

APRIL 14, 2024

Meta introduces the Open-Vocabulary Embodied Question Answering (OpenEQA) framework —a new benchmark to measure an AI agent’s understanding of its environment by probing it with open-vocabulary questions. link] Aishwarya Srinivasan: How Microsoft's 1-bit LLM is going to change the LLM landscape?

Data Engineering

Data Engineering Data Engineer Engineering Business Intelligence

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

Data Engineering Podcast

NOVEMBER 12, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team.

Software Engineer

Software Engineer Software Engineering Engineering Data Lake

Designing Data Transfer Systems That Scale

Data Engineering Podcast

DECEMBER 3, 2023

Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. With Materialize, you can!

Systems

Systems Designing Data Lake SQL

Building Trust in Public Sector AI Starts with Trusting Your Data

Cloudera

DECEMBER 1, 2023

These government-led efforts have had a profound impact on the development and adoption of AI solutions in the public sector, paving the way for a future where data-driven decision-making and automation are the norm. This requires a holistic approach that addresses the key areas of security, governance, and trustworthy data.

Building

Building Government Transportation Data Governance

An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem

Data Engineering Podcast

OCTOBER 16, 2022

Summary The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse.

Data Lake

Data Lake Food MongoDB Scala

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Making The Open Data Lakehouse Affordable Without The Overhead At Iomete

Data Engineering Podcast

OCTOBER 9, 2022

Summary The core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use.

Metadata

Metadata AWS MongoDB MySQL

Data Engineering Weekly #161

Data Engineering Weekly

MARCH 3, 2024

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Editor’s Note: Chennai, India Meetup - March-08 Update We are thankful to Ideas2IT to host our first Data Hero’s meetup.

Data Engineering

Data Engineering Data Engineer Pipeline-centric Engineering

Data Engineering Weekly #152

Data Engineering Weekly

DECEMBER 10, 2023

RudderStack, one of the leading alternatives to Segment , is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. link] Research Paper: ChatGPT’s First Anniversary - Are Open-Source Large Language Models Catching Up?

Data Engineering

Data Engineering Data Engineer Engineering Metadata

The future of data architecture is hybrid: choosing your hybrid-first data strategy starts at Cloudera Now 2022

Cloudera

AUGUST 9, 2022

With all of the buzz around cloud computing, many companies have overlooked the importance of hybrid data. The truth is, the future of data architecture is all about hybrid. As a leader in hybrid data, Cloudera is positioned to help organizations take on the challenge of managing and analyzing data wherever it resides.

Data Architecture

Data Architecture Architecture Cloud Computing Cloud

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Co-Authors: Sumedh Sakdeo , Lei Sun , Sushant Raikar , Stanislav Pak , and Abhishek Nath Introduction At LinkedIn, we build and operate an open source data lakehouse deployment to power Analytics and Machine Learning workloads. While functional, our current setup for managing tables is fragmented.

Big Data

Big Data Data Management Management Metadata

The Future Is Hybrid Data, Embrace It

Cloudera

JUNE 7, 2022

We live in a hybrid data world. In the past decade, the amount of structured data created, captured, copied, and consumed globally has grown from less than 1 ZB in 2011 to nearly 14 ZB in 2020. Impressive, but dwarfed by the amount of unstructured data, cloud data, and machine data – another 50 ZB.

IT

IT Unstructured Data Data Architecture Government

The View From The Lakehouse Of Architectural Patterns For Your Data Platform

Data Engineering Podcast

JULY 3, 2022

Summary The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. Atlan is the metadata hub for your data ecosystem.

Architecture

Architecture Metadata MongoDB MySQL

Don’t Get Left Behind in the AI Race: Your Easy Starting Point is Here

Cloudera

MARCH 26, 2024

Cloudera: Your Trusted Partner in AI With over 25 Exabytes of Data Under Management and hundreds of customers leveraging our platform for Machine Learning, Cloudera has a long and successful history as an industry leader. ” said Sanjeev Mohan, Principal, SanjMo & Former Gartner Research VP, Data & Analytics.

Machine Learning

Machine Learning Banking Professional Services Architecture

Data Engineering Weekly #164

Data Engineering Weekly

MARCH 24, 2024

Companies are more open to adopting Gen AI for their internal use cases but have reservations about rolling it out to their clients. link] Kai Waehner: The Data Streaming Landscape 2024 This is a comprehensive overview of the state of the data streaming landscape in 2024.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Top 10 Data Science Companies in 2024

Knowledge Hut

JANUARY 18, 2024

Data Science is an amalgamation of several disciplines, including computer science, statistics, and machine learning. As the world on the internet is becoming our second home, Big Data has exploded. Data Science is the study of this big data to derive a meaningful pattern.

Data Science

Data Science Amazon Web Services Finance Big Data

Educating ChatGPT on Data Lakehouse

Cloudera

MARCH 17, 2023

As the use of ChatGPT becomes more prevalent, I frequently encounter customers and data users citing ChatGPT’s responses in their discussions. I love the enthusiasm surrounding ChatGPT and the eagerness to learn about modern data architectures such as data lakehouses, data meshes, and data fabrics.

Education

Education Unstructured Data Data Lake Data Warehouse

The Future of Data Warehousing

Monte Carlo

JANUARY 16, 2024

As every company becomes a data company, and more users within these companies are discovering new uses for previously unavailable data, existing infrastructure and tools are not just meeting that demand but creating new demands. At the center of it all is the data warehouse, the lynchpin of any modern data stack.

Data Lake

Data Lake Data Warehouse Unstructured Data AWS

Databricks Data + AI Summit 2023 Keynote Recap: LakehouseIQ, Delta Lake 3.0, and More!

Monte Carlo

JUNE 28, 2023

Like the omnipresent San Francisco drizzle, Data + AI keynote attendees steadily trickled down the escalators and toward the Moscone Center Hall B + C. There was quite a bit of speculation that attendance would suffer at both events of the major data conferences taking place this week, but that is clearly not the case.

Data Warehouse

Data Warehouse Scala Unstructured Data Government

Top 10 Data & AI Trends for 2024

Monte Carlo

DECEMBER 20, 2023

“The data and AI space moves fast. Wondering what’s next for the future of data engineering and GenAI? Each year, we chat with one of the data industry’s pioneering leaders about their predictions for the modern data stack – and share a few of our own. Ready to see the future? Big data will get small 8.

Amazon Web Services

Amazon Web Services Metadata AWS Data

Top 10 Data & AI Predictions for 2024

Monte Carlo

DECEMBER 20, 2023

“The data and AI space moves fast. Wondering what’s next for the future of data engineering and GenAI? Each year, we chat with one of the data industry’s pioneering leaders about their predictions for the modern data stack – and share a few of our own. Ready to see the future? Big data will get small 8.

Amazon Web Services

Amazon Web Services Metadata AWS Data

The Future of the Data Lakehouse – Open

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Webinars

Trending Sources

Making Email Better With AI At Shortwave

Webinars

Version Your Data Lakehouse Like Your Software With Nessie

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

Build Your Second Brain One Piece At A Time

Reconciling The Data In Your Databases With Datafold

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Tackling Real Time Streaming Data With SQL Using RisingWave

When And How To Conduct An AI Program

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Designing A Non-Relational Database Engine

Data Sharing Across Business And Platform Boundaries

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Build A Data Lake For Your Security Logs With Scanner

Modern Customer Data Platform Principles

Unlocking Your dbt Projects With Practical Advice For Practitioners

Designing Data Platforms For Fintech Companies

Adding An Easy Mode For The Modern Data Stack With 5X

Shining Some Light In The Black Box Of PostgreSQL Performance

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Data Engineering Weekly #167

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

Designing Data Transfer Systems That Scale

Building Trust in Public Sector AI Starts with Trusting Your Data

An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Making The Open Data Lakehouse Affordable Without The Overhead At Iomete

Data Engineering Weekly #161

Data Engineering Weekly #152

The future of data architecture is hybrid: choosing your hybrid-first data strategy starts at Cloudera Now 2022

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

The Future Is Hybrid Data, Embrace It

The View From The Lakehouse Of Architectural Patterns For Your Data Platform

Don’t Get Left Behind in the AI Race: Your Easy Starting Point is Here

Data Engineering Weekly #164

Top 10 Data Science Companies in 2024

Educating ChatGPT on Data Lakehouse

The Future of Data Warehousing

Databricks Data + AI Summit 2023 Keynote Recap: LakehouseIQ, Delta Lake 3.0, and More!

Top 10 Data & AI Trends for 2024

Top 10 Data & AI Predictions for 2024

Stay Connected