Data Warehouses Vs Operational Data Stores Vs Data Lakes – How To Store Your Data For Analytics

Data Warehouses Vs Operational Data Stores Vs Data Lakes – How To Store Your Data For Analytics

August 3, 2023 data analytics strategy data engineering consulting 0
snowflake consulting

Photo by Leif Christoph Gottwald on Unsplash

A few months ago, I uploaded a video where I discussed data warehouses, data lakes, and transactional databases. However, the world of data management is evolving rapidly, especially with the resurgence of AI and machine learning. There are numerous other methods that technical teams are utilizing to handle their data effectively. In this presentation, I aim to explore the various ways companies are currently managing their data.

It is crucial to recognize that well-managed data is fundamental for the successful performance of advanced algorithms, models, and LLMs. As I mentioned in my previous video, the emergence of data engineering as a significant indicator over the last ten years highlights its growing importance. Data engineering and data management will continue to play a pivotal role in the foreseeable future as organizations grapple with increasingly complex data sets.

Given that selecting the right data storage paradigm is critical for your company, I will be delving into some common terms you are likely to encounter in this domain. Additionally, I will provide clear definitions and share how these approaches have proven beneficial in my personal experiences.

data storage analytics consulting

ODS – Operational Data Stores

Let’s begin with an often overlooked data storage solution known as an operational data store (ODS).

Unlike traditional data warehouses used for historical data analysis, an ODS provides a current, integrated, and consistent view of operational data from various sources. It acts as an intermediary layer between operational systems (e.g., transactional databases, CRM systems, or ERP systems) and the data warehouse or data mart.

Many companies actually use what could more accurately be called an ODS vs. a data warehouse, even if they call it a data warehouse or some form of ODS with historical tracking. It really can feel like splitting hairs to a degree.

Again, an ODS is exclusively supposed to deal with current operational data and basic status-level reporting.

But I have seen several companies employ some middle ground where it really acts as a catch all. Teams may use it to report data, recalculate Salesforce roll-ups and just about everything else!

Data Lakes

When I broke into the data world, data lakes had become all the rage. Every enterprise I knew had someone working on a data lake project. Everyone thought that a schema-on-read approach would help reduce the speed to value data scientists and businesses would gain.

After all, modeling data and creating standardized data pipelines takes time. Even with the fanciest of tools, having a solid data governance strategy and set of processes to get data from raw to production isn’t generally quick (regardless of what vendors tell you).

OK, enough context.

What is a Data Lake?

A data lake acts as a centralized repository that stores large volumes of structured, semi-structured, and unstructured data.

It is designed to store raw data in its native format without the need for predefined schemas or transformations.

The concept of a data lake emerged as a response to the challenges posed by traditional data warehousing approaches, which require extensive data modeling and schema design before data can be stored and analyzed. Data modeling takes time, and as data science was growing causing a need for cheap compute along with it. Thus, the idea of a cheaper and less planning intensive option was very enticing to many companies.

Overall, these data stores still are decently popular, just generally as a component in a larger data infrastructure vs the main place everyone gets their data from. Some companies rely on them or some form of hybrid between a data lake and a data warehouse (what’s that called again?).

Data Warehouses

Whenever I talk about data warehouses, I generally allude to my past, where I worked on a data warehouse without knowing it was a different design approach for storing data (compared to a transactional database).

The company was using Microsoft SQL Server, built tables with primary keys (they just didn’t call them IDs), and there seemed to be some normalization. But it wasn’t the same type of database I had learned about in my database course, nor did anyone tell me what a data warehouse was. I had to go and eventually figure it out through my own readings.

For those who don’t know, a data warehouse is a structured repository that centralizes an organizations data. It also helps make integrating data from various sources easy as well as tracks historical data.

Or as the Father of Data Warehousing would say…

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process. — Bill Inmon

Unlike a data lake, developing a data warehouse takes a lot of upfront time because you need to plan out the schema. The data should actually follow some form of structure.

This is that term that everyone seems to be talking about recently, data modeling. Data modeling involves gathering requirements, talking with stakeholders, digging into data systems and understanding their quirks and a whole lot more. That is one reason many of these projects can take months if not years.

Despite all of the new variations and methods that we take on to try to report and manage data for analytics. The data warehouse has stood the test of time. At the very least in terms of holding mind share. Even when it comes to building solutions that aren’t what could be deemed a traditional data warehouse, many of the terms and vocabulary that denote a data warehouse get used.

The problem really arose with data warehouse when companies just wanted to speed up the time to value of their data projects.

In turn, this is also why many companies have looked to other methods and approaches to developing an analytical storage system.

SeattleDataGuy’s Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Data Marts

Another data store you’ll often hear about is a data mart.

A simple way of thinking about what data marts are is that they are more focused versions of a data warehouse.

Data warehouses will often have large amounts of heavily integrated data coming from various sources. This wouldn’t be easy to work with from a department-by-department perspective. This might also be why Innmon feels that solutions like Snowflake and likely ELT-approaches tend to lean toward creating data marts as they generally focus on sucking data into a data store but not spending time integrating it.

He also felt this way about Kimball modeling.

In truth, the Kimball model was for data marts, not a data warehouse. A data mart and a data warehouse are fundamentally different things. — Bill Inmon

To give Snowflake a little bit of weight in this article, here is how they defined a data mart:

A data mart is a curated subset of data often generated for analytics and business intelligence users. Data marts are often created as a repository of pertinent information for a subgroup of workers or a particular use case. – Snowflake

At this point, it should be clear that there is a certain level of focus that comes with data marts, whereas data warehouses tend to cover a vast array of subjects.

But what really ends up happening at many companies is there are multiple data warehouses, data marts and other forms of analytical storage systems that get developed.

Data Mesh

More recently, you’ve likely heard of the term data mesh. Unlike many of the other terms, this doesn’t really refer to a specific type of data store but instead, more generally refers to a company’s  larger data architecture (which could be made up of multiple data marts, warehouses, etc.)

Now when this term became popularized, I was working for Facebook. The interesting thing about this was we never called what we were creating “a data mesh.”

But it’s essentially what it was.

We essentially followed the four principles listed by Zhamak Dehghani, which are:
  1. Domain-oriented decentralized data ownership and architecture
  2. Data as a product
  3. Self-serve data infrastructure as a platform
  4. Federated computational governance

We actually had discussions on our data team at Facebook in terms of referring to our data as a product and how we can more easily make the data accessible to first-time users, whether it be how we name tables or the quality we’d expect.

And with all the various products, it’d likely take far too long to take the EDW approach, so each product team developed their own data warehouse/mart inside their own namespace.

Now, the benefit we had compared to other companies is that most of our data was integrated at the source, meaning we didn’t have to spend too much time figuring out which ID could join a data set from one application to another.

Thus, we had a data mesh, but we never really called it that.

What I found made the approach at Facebook successful was having a reliable data catalog and other forms of data lineage because it made it easy to figure out what data was where.

Now, what about alternative forms to data warehouse?

I mean, data warehouse like architecture with a slight set of nuances that separate them?

That’s where our next two examples come in.

Data Vaults

Dan Linstedt, the creator of the Data Vault says that a DV is:

 The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses. – Dan Linstedt

Now I had to recently help a company simplify their data warehouse which had been developed about half-way using the data vault process and we didn’t end up going forward with that approach.

But.

In general a data vault should make changes over time easier. Thus, providing companies and teams that have a more agile approach to their data warehouse development a lot of benefits(of course, most data teams need to constantly change their data warehouse).

Overall, the advice I have been given in terms of data vaults generally is, if you’ve implemented a data vault before, then its a great choice but if you’ve never gone through the process of developing one, it can prove challenging.

Data Lakehouses

It’s difficult to cover various forms of data stores without referencing a Data Lakehouse. Again we can lean into Bill Inmon who has provided his take on a Data Lakehouse. He defines it as an…

Extension or evolution of the data warehouse. The data warehouse applied to structured data. The data lakehouse applies to structured data, textual data and analog/IoT data. — Bill Inmon

The main focus of a Data Lakehouse is to try to mesh the best of both worlds. Balancing the structure and governance of a data warehouse and the open nature of a data lake. There are clearly certain solutions and vendors that benefit from this definition (as others benefit from a more traditional DW view).

In the end, it’s always fun watching all the debates and arguments on exactly which way we should be storing data.

Why Does It Matter?

In many ways, the constant pedantic definition of various data stores is exhausting. We could spend all day arguing why one company’s data warehouse is really just a data lake or a data mart.

We could also spend all day arguing why one company needs a data warehouse while another one is operating just fine using Power BI for its data modeling layer.

But that doesn’t actually drive value for your data team. Instead, you need to first figure out which approach will likely fit your problem best and then use it. Will you run into limitations and issues with all of these?

Yes.

But that’s why it’s critical to define what your company wants to do with its data first.

Do you need to spend a lot of time focused on integrating and modeling data, or do you just need to report on a few key metrics?

Do you have a lot of large and unstructured data, or data that exists in multiple sources and namespaces, or just a few data sources that need to be centralized and lightly integrated?

To many degrees, your choice does and doesn’t matter.

To the engineering and analytics team, it matters because it impacts their ability to deliver value with data.

But to the business, all they will care about is that they can get numbers fast. If you’re new to data, welcome to the world of balancing business needs with engineering needs (and the desire to bolster a resume).

Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.

What Is Apache Airflow

How to build a data pipeline using Delta Lake

A comprehensive introduction to change data capture (CDC)

Intro To Databricks – What Is Databricks

Is Apache Airflow Due for Replacement? The First Impression Of mage-ai