How do you know if the data you are using is the right data to be using?
I can’t count the number of times I asked myself that question. In general, just about every new analysis or project or research or whatever it is you are using data for, you have to ask that question at some point.
Even data you have used a hundred times and comes from a highly trusted source needs to be scrutinized.
Now if you work with data everyday in a familiar format, from the same source and with no changes to the data gathering and storage process you don’t have to spend much time validating it. Usually you will see problems when something just doesn’t look right when you are doing the analysis.
On the other hand, things get a whole lot trickier when you are using data from a source you don’t use often, or something has changed in the way the data is populated or if it’s the first time you are using the data.
When this happens, I have a few suggestions on how to validate the data.
First off, pull the data, do your analysis and draw some conclusions. If it passed the eye test and it feels ok to you, then your job is just to validate it.
One simple way to do this is pull the data again the exact same way to make sure you get the exact same data. Or change one parameter like the dates used in the query. See if that significantly alters the way the data looks and feels.
Another option is to have someone else do the same thing independently. See if they get the same results you do. You can also find someone who knows the data to look over you work to see if it makes sense to them.
Whatever you do, the best way to prevent publishing or using bad data is to involve someone else. Not always possible, I know, but it’s the best way to go.
Another suggestion is to get the data, do some analysis and then step away for a while. Come back to it with fresh eyes. Don’t let our minds play tricks on us by making us see what we want to see and not what is really there.
I have seen several articles showing research that most time doing data analysis is actually spent cleaning data. In a lot of businesses the data lake as become a data swamp, clogged with bad or unusable data. As the % of unstructured data increases daily, its easy to see how data swamps have become the norm. Even he most robust data collection and mining can run afoul if the data is not trustworthy.
So getting back to the last post… know how the data is populated. Who, when, why, how, how often, with what filters… things like that. I can’t stress this enough. No matter how good you are at analysis, or what tool you are using to do the analysis, if you don’t have an understanding of what happens to the data before it gets to you then you are probably not drinking from a clean lake.