Information Confidence or Data Credibility?
Data Credibility or Information Confidence, take your pick. It doesn’t matter. What does matter is that Thomas C. Redman, Ph.D. has written a timely article on data credibility, now available in the December 2013 Harvard Business Review (HBR). If you have an online subscription, you can read it here. Also have a look at the recent posts in this blog on Information Confidence and Data Quality Metrics.
Data Quality – The Field Guide
Tom Redman released Data Quality – The Field Guide back in 2001. Published by Digital Press, it is worth finding a copy. One of Tom’s essential messages was that a database is like a lake. If the lake were to become polluted, Tom suggested four alternative choices (p.54).
1 – Ignore the pollution and treat those who get sick from the lake water
2 – Filter the lake water, remove the sources of pollution, and put the water back in the lake
3 – Filter small amounts of lake water each day. One could filter water at the point of entry into the lake. Thus, only clean water enters the lake. Or, one could filter water right before it is used.
4 – Identify sources of pollutants and eliminate (or at least mitigate) them.
The parallel to cleaning a polluted database by applying a series of fix-ups (sub-optimal approaches) or fixing the problems at their source (Yes, this is the right one!) is clear. The question for me is why organizations have not internalized this message and focused on making data useful and appropriate for the data’s users. Data has producers and consumers. They need to connect. How should that happen?
Back in 2001, Tom wrote (p. 177)
Senior management must lead the data quality program if it is to be widely implemented
Let’s see what Tom is saying today. Here is a hint… it hasn’t changed much.
Today – Data’s Credibility Problem
Well, the article’s tag line is “Management – not technology – is the solution”. Tom opens with a vignette about a marketing executive noticing and directing the correction of a market share number, then rewarding the individual who found the error, and finally institutionalizing error checking, all without going back to the source of the flawed data to insist it be systematically corrected. So, 12 years on, this mythical marketing executive is not cleaning the source of the pollution.
The Impact of “Error leak-through”
Tom mentions the potential for data errors leaking through to wreak havoc on data consumers and other stakeholders, despite best efforts to clean the polluted data lake. Bad data in a medical device can kill. In my post on Information Confidence, I referred to the Information Confidence Integrity Level as a measure of the need for data quality as determined by its intended use. You can see that data with a high Information Confidence Integrity Level, like data used to control medical devices, demands a comprehensive quality program and significant investment. Read about bad data as the “social disease of business” – part 1 here, and part 2 here, to find out more.
Put the data producers and the data consumers in the same room
Tom Redman believes, and I agree, that for improving data quality for its consumers, “a little communication goes a long way”. When data producers understand how their consumers will use the data, and understand the Information Confidence Integrity Level required, the quality of data is bound to improve. When suppliers fill the data lake and don’t know whether the lake’s users will be (figuratively speaking) drinking the water, cooking with it, or dumping sewage there, the supplier’s approach to quality will be anyone’s guess.
Make Sure New Data is Clean!
Rather than launch a massive effort to clean up existing bad data, companies should focus on improving the way new data are created.
Eventually, the half-life of older data will pass. With luck, the old, bad stuff will sink into the bottom of the lake, and pollute no more.
The Bottom Line
The messages about data quality written in 2001 remain valid and actionable today. Moreover, the dialog between data producers and data consumers should be active and ongoing. Data producers should understand the Information Confidence Integrity Level expected by their data consumers to ensure they can deliver the right data at the right level of quality, when and where it is needed.