How healthy is your data?
Updated: Oct 28, 2020
The topic for the day is going to be a bit more technical than usual, focusing on the quality of the data.
I'm going to be honest here: there might be things I'm going to miss, I might see quality from a different perspective than other benchmarking frameworks, I might even look a bit too idealistic. But hey, if I can't be like that on my blog, then where?
The traits that I am going to list should be for the data that acts as input for analysis. It will describe the way how the data should look like to make it easy to extract key aspects or provide a full picture on a specific topic.
Very rarely will the data be in this state, hence my disclaimer that my view might be a little bit idealistic.
Your data must make sense!
Data making sense is not just that it relates to something which has a meaning (which by the way, it should).
The way how data presents itself should provide a story: this customer is with us for five years, lives in Lithuania, and is a frequent buyer of our wines; this business is in the landscaping business, has 25 employees, and valuation is 1 million dollars.
These are trivial examples making the point that when you look at a specific model, and its values, you'll get enough details to extract insightful information from, without needing to furtherly involve other models.
Very rarely will the data be like this, that is why the effort in modeling the raw data is needed, and this is a topic altogether, which I will talk about in the next posts.
Your data must tell the right story!
Data-driven companies rely on their data in order to make business decisions. It can be from the smallest one like deciding what sodas to buy for the office to P&L optimization.
KPIs, impact on the business, growth, you name it, can be calculated using SQL and some
data and so the data govern the business activity and can create shifts and all sort of movements in decisions. It's one way to do it, many companies do, but not all.
Having such importance, data should be leading and not misleading. I'm not questioning the way how data is interpreted but rather how accurate is the data you interpret.
Data is inserted, updated, moved, modeled, aggregated and so multiple points of failure are introduced in the process. From the source until your data warehouse, if done wrongly, data can transform in such ways that it gives you rather a completely different story.
The customer inserts an address in your application, with a nice postcode of 6 characters. Later in the process, when moving the data to the database used for analyses, the postcode gets truncated to 5 characters, and all of a sudden the customers live at whole new addresses.
Again, a trivial example and easy to find the route cause. But imagine truncating the decimals of the business cost figures; not such a small problem anymore.
Your data must be constantly maintained!
The range of importance of data is from critical/core to temporary generated tables that after 5 minutes of a lifetime are discarded.
Importance of data can be given by many factors and no matter which one we are referring to, the data ready to be analyzed needs to be well documented, needs to have someone taking ownership over it, and be constantly taken care of.
Anything changes and the logic around the data is not correct anymore, is the same as the example of truncating postcodes or decimals of costs. Yes, the logic around the data and the values that it can take needs to be accurate too.
It's the owner's/maintainer's job to make sure that both the data is well-documented and that the data is accurate. This is an ever ongoing process and one of great importance which... frequently is happening. It does not necessary that the business will crumble the next day, but for sure it is not a pleasant ride to work with such data.
Most of the time, data is messy, requires a lot effort to understand it, clean it and prepare it before even starting to write the analysis itself.
It's not impossible to achieve; for sure it is not easy, but is rewarding.
So... next time when you create some model, you become a maintainer of some tables or you move data around, think that there is out there someone, at some point that will consume it. That consumer might have the ability to find things about you, like where you're living 🙃.