Why clean data is so important


Often data is collected as part of a process but is not essential to being able to complete the process. For such data items, unless quality is enforced in some way, it is very likely accuracy will fall below 100%. I while ago I wrote about predicting sales from quotations but what I did not explain was that the data came from two systems, one of which did not enforce adequate validation of a user input field required to join that system’s data to the other system’s data Presumably this did not matter to the business process or analysis required when the system was first put in. Much useful analysis can be performed on data that is not 100% accurate, for example looking at ratios over time however there is always going to be some doubt about the results. My examination of machine learning techniques was possible after I had ‘cleaned’ the data by removing and records that could not be matched between the two systems. My results showed only a small improvement of being able to predict a sale over that of tossing a coin to make the prediction, which did not seem particularly useful in the context being examined. However, in some application a small improvement over a 50/50 guess could be very important (e.g. share trading) and in this case even slightly inaccurate data could be giving misleading results. Because of the potential use of data, unforeseen when it was originally captured, I would advise architects to be less tolerant of poor data quality.

Leave a Reply