Data Entropy: More Data, More Problems?

How to navigate and embrace complexity in a modern data organisation.

Salma Bakouk

Published in

Towards Data Science

10 min readMay 18, 2023

Source: https://unsplash.com/@brett_jordan

“It’s like the more money we come across, the more problems we see” Notorious B.I.G

Webster’s dictionary defines Entropy in thermodynamics as a measure of the unavailable energy in a closed thermodynamic system that is also usually considered to be a measure of the system’s disorder.

In Information Theory, the concept of information entropy was introduced by Claude Shannon in 1948 and represents for a random variable the level of “surprise,” “information”, and “uncertainty” related to the various possible outcomes. Some nice reads for my math nerds out there (here and here).

In a broader context, Entropy is the tendency of things to become disordered over time and for general disorder and uncertainty to increase in a system or environment.

If you are a data practitioner in today’s flourishing data ecosystem and are asked by a peer or a business stakeholder to describe your data platform, I imagine you would use a combination of the following words: modern / cloud-based, modular, constantly evolving, flexible, scalable, secure, etc. But between you and I, you know that you’d like to throw: messy, unpredictable, chaotic, expensive and disorganised into the mix as well.

Do any of the below scenarios seem familiar?

Business users are unable to find and access data assets critical to their workflows.
Stakeholders are constantly questioning the accuracy of the numbers they see in business dashboards.
Data engineers spend countless hours troubleshooting broken pipelines.
Every “minor” change upstream results in mayhem.
The data team is constantly burning out and has a high employee turnover.
Stakeholders fail to see the ROI behind expensive data initiatives.

The list goes on.

Every organisation wants to become data-driven, but the reality is that many organisations are hoarding data, spending millions of dollars per year on technology and human resources, and hoping for the best. Do you see how Entropy is not just in Thermodynamics?

So what does Entropy look like in the context of a data platform?

Infrastructure & Technology

According to research by IDC, companies globally spend over $200 bn on data and analytics solutions per year to leverage data to drive innovation and business prosperity. However, with the lack of direction and poor strategy, most companies end up with bloated technology and data stacks, where technical debt and maintenance costs are constantly on the rise.

Infrastructure and technology entropy can manifest itself in one or a few of the following ways

Too many overlapping tools with a heterogeneous mix of in-house and external solutions. When it comes to Data & Infrastructure tooling, less is more. If you are just starting, your analytics needs are relatively straightforward, and your data stack should mirror that. That said, you should also be thinking ahead and favor solutions that can scale with your needs and provide the level of flexibility you will require in your hyper growth phase. To learn more about how to set the proper foundation for a successful data practice depending on your company’s stage of Data Maturity, read this. If you are in the process of upgrading an existing platform, the key is to be strategic, think about business criticality, and prioritise accordingly. As a rule of thumb, before investing in a new solution, you must ensure that complete migration from your existing solutions and processes is possible and strategically plan the deprecation process to minimise technical debt.
Siloed Data. Tech Target, defines a data silo as a repository of data controlled by one department or business unit and, therefore, not wholly or easily accessible by other departments within the same organisation. Although they might seem innocuous, data silos often lead to the creation of unnecessary information barriers and the dilution of overall data quality and data governance standards.
Absence or poor adoption of company-wide guidelines surrounding the creation and deployment of data products. This can be both a cultural and a technology problem, we will tackle the cultural aspect later in this article, but for the technology part, this often arises when the technical foundation does not allow for flexibility and democratisation of both the data and the infrastructure producing it. Not only does this lead to slower development cycles and poor-quality data solutions, but it can also seriously hinder the data maturity process of an organisation.

People & Culture

“Culture eats strategy for breakfast” Peter Drucker.

The famous quote by Austrian-American management consultant and author is particularly pertinent regarding an organisation’s data strategy. Data plays a central role in modern organisations; the centricity here is not just a figure of speech, as data teams often sit between traditional IT and different business functions. For that, data practitioners are expected to manage and deal with stakeholders of different backgrounds and communication styles. This may lead to one or a combination of the following:

Lack of alignment between IT and the business. Business and IT teams have fundamentally different purposes and responsibilities within an organisation. However, they both work towards the same objectives: improve overall business performance, reduce costs and achieve sustainable growth.
Lack of alignment between IT and Data Management functions. For most organisations, the data practice is the new kid on the block relative to software. And while it makes sense to treat data management as a separate entity, complete segregation from IT & engineering is not a sensible choice. Fully separating the two might hurt data production cycles and lead to a number of inefficiencies, not to mention the detriment it can have on knowledge transfer.
Lack of alignment between Data Producers and Data Consumers when it comes to data Service-Level-Agreements (SLAs). We call them consumers, but are they treated as such?
Unclear ownership around data quality. Is it the responsibility of the data producers? Is it the responsibility of the consumers? Is it the responsibility of the data product managers? Few roles are increasingly on the rise to remediate this, such as data quality analysts, data ops, data governance strategists, etc.. Yet, the vast majority of organisations still struggle to pinpoint where the responsibility of something as fundamental as the quality of the data assets lies.
Data is a second-class citizen. Every company wants to become data-driven, but even in today’s environment, data-driven decision-making remains an elusive idea for many. When it comes to enabling and fostering a robust data-driven culture, buying software is by far the easiest part.
Data engineers spend more than 50% of their time dealing with data incidents. There is a lot of value to be gained from having data engineers focus on revenue-generating activities rather than troubleshooting data pipeline issues. For one, it reduces the amount of time they spend on repetitive tasks and allows them to focus on more important things.

Data Quality or lack thereof

High occurrence of data quality incidents of varying implications and magnitude. In a 2021 Gartner report, bad data costs businesses around 13 million dollars annually. But what do data quality issues look like? There are a variety of metrics to measure data quality and assess how data deviates from expectations. More can be found in this blog.
Multiple versions of what seemingly is the same thing. What is the single source of truth? This is particularly exacerbated by the existence of silos and lack of alignment between teams. As a result of this, confusion and frustration spirals among stakeholders trying to make decisions based on the data, leading them to question everything that is produced by the data team and sometimes even its existence.
Data producers and data consumers are in constant conflict, often about who should own data reliability and at what stage of the pipeline. While this might be contained at relatively small organisations, it becomes insurmountable at a larger scale.

So how do you reduce Entropy within your data platform?

While the second law of thermodynamics states that Entropy either increases with time or remains constant and never decreases, luckily for us data practitioners, that rule does not apply. Here are some tips on how to reduce Entropy within your Data Platform:

People & Culture

Data-driven culture starts at the (very) top. For a company to create and foster a data-driven culture, top management and C-level need to instil and nurture a fact-based decision-making mentality. When the general expectation from top management is that every business case and every decision needs to be backed by facts, and they lead by example, it becomes second nature for operators to follow suit. A great read on this here.
Data leaders are also business leaders. As a data leader, you are, by design, in a hybrid role, as data needs to serve a specific business purpose. No company is in the business of doing analytics for analytics. Not only have you earned your seat at the table by being an exceptional technology leader, but you can also play an integral role in the overall business performance. In order to do so, you need to think like a business leader, which means having a pragmatic output-oriented approach to data and data projects.
Think big but celebrate small wins. There are no shortcuts to data maturity, and while you should have ambitious goals and take on big projects, it is essential to know when to take a step back and celebrate small achievements to keep the team motivated and engaged. It is essential to take a fact-driven approach here as well, as it creates a propensity for being data-driven and lays a good foundation for a strong data culture.

Processes

“If you can’t describe what you are doing as a process, you don’t know what you’re doing” W. Edwards Deming

Complexity is inevitable. Learn to embrace it. As your data platform scales, it will become more and more complex. It is important to remember that complexity is not bad. It is inevitable. The key here is to manage complexity to make the system easier to understand and work with.
Let go of what doesn’t serve you. One of the challenges with data platform complexity is that it can lead to a lot of “technical debt”. This is when there are parts of the system that are no longer serving a purpose but are kept around because they are too difficult or time-consuming to get rid of. It is essential to periodically review your data platform and cut unnecessary parts to keep it lean and efficient.
Be pragmatic, review processes frequently and make changes where necessary. As your data platform evolves, your processes should too. It is essential to review operations regularly and make changes as needed to keep them up-to-date and relevant.

Technology

Software ROI. When it comes to data platform technology, it is crucial to ensure you are getting a good return on your investment. This means choosing software that fits your purpose and will meet your needs now and in the future. When deciding, it is also essential to consider things like scaling costs, maintenance fees, and training.
Know when to build and when to buy. There is a lot of great data platform software available on the market, and in most cases, purchasing an off-the-shelf solution will be more cost-effective than building your own. Key factors to consider when weighing the pros and cons of each approach are: cost, time, quality and scalability. While in certain cases it might make sense to build something custom to solve a very specific problem, it often leads to the accumulation of technical debt and does not allow for scalability, not to mention the opportunity cost that comes with building and maintaining the solution.
Data Observability. Observability in data is the ability to understand and measure the health status of different data assets and components of the data platform. As Data Entropy represents the disorder and chaos within a data platform, data observability solutions emerged to solve this type of problems.
As explained in this blog, data observability is a concept that comes from software and, before that, from control theory. The idea of software observability — think of companies like Datadog and New Relic — is to help software engineers better understand what’s going on inside their applications and monitor their health status. This can now be applied to data. So, observability can be defined as the ability of organizations to gain actionable insights regarding the health status of their data. As I’ve explained in this blog, data observability has four main pillars:
Metrics: measure the quality of the data
Metadata: access and monitor external characteristics about the data
Lineage: map the dependencies between data assets
Logs: how data interacts with the “outside world”
The best way to conceptualise Data Observability is to think of it as an overseeing layer to a modern data infrastructure aimed at allowing data practitioners to see into and comprehend the increasingly complex web of data assets that characterise a modern enterprise data platform. By connecting to each compartment of the data stack from ingestion, ETL/ELT, modelling, warehousing, all the way to BI/analytics, reverse ETL, ML etc a data observability solution should be able to provide an actionable insight into the health status of data assets at each stage of the pipeline.

Conclusion

Data Entropy is expensive. According to a research done by IBM, poor data quality costs a staggering $3.1 trillion to the U.S. economy annually, not to mention the impact it can have on an organisation’s competitive standing and reputation. Is Entropy in data inevitable? Is it a direct result of scale and expansion? Should it be embraced? Yes, yes and yes. Amid the inevitable rise in complexity, leading organisations are seeking practical solutions to help them navigate data entropy and ensure that their expensive software and data platform investments are yielding the best ROI possible. Data Observability emerged as a solution to Data Entropy, by providing a smooth 360 view of the health status of data assets and how they interact with and within the platform data practitioners can finally invest their time in value creating tasks that propel the business forward.

Data Entropy: More Data, More Problems?

How to navigate and embrace complexity in a modern data organisation.

Conclusion

Written by Salma Bakouk