Just last month, an article has been shared This showed that over 30% of the data used by Google for one of the shared machine learning models was mislabeled with the wrong data. Not only was the model itself full of errors, but the actual training data used by that model itself was also full of errors. How could anyone using Google’s model hope to trust the results if it’s full of human-made errors that computers can’t correct. And Google isn’t alone with significant misleading data labels Study MIT in 2021 found that nearly 6% of images in the industry standard ImageNet database are mislabeled and, in addition, “labeling errors were found in the test sets of 10 of the most commonly used computer vision, natural language, and audio datasets.” How can we hope to trust or use these models if the data used to train these models is so bad?
The answer is that you cannot trust this data or these models. As AI says, garbage in is definitely garbage, and AI projects suffer from significant bad data garbage. If Google, ImageNet and others are making this mistake, surely you are making this mistake too. Research from Cognilytica shows that over 80% of AI project time is spent managing data, from collecting and aggregating that data to cleaning and labeling it. Even with this much time spent, mistakes are bound to happen, and that’s if the data is of good quality to begin with. Bad data equals bad results. This has been the case for all kinds of data-oriented projects for decades, and now it’s a major problem for AI projects as well, which are basically just big data projects.
Data quality is more than “bad data”
Data is at the heart of AI. What drives AI and ML projects is not the programming code, but the data from which the learning must come. Too often, organizations move too quickly with their AI projects only to realize later that their poor data quality is causing their AI systems to fail. If you don’t have your data in good quality, don’t be surprised if your AI projects suffer.
Data quality does not simply include “bad data,” such as incorrect data labels, missing or incorrect data points, noisy data, or low-quality images. Important data quality issues also arise when acquiring or merging datasets. They also arise when downloading data and enhancing data with third-party datasets. Each of these actions, and more, introduces many potential sources of data quality issues.
Of course, how do you perceive the quality of your data before you even start your AI project? It is important to assess the state of your data beforehand and not move forward with your AI project only to realize too late that you do not have the good quality data required for your project. Teams need to understand their data sources, such as streaming data, customer data, or third-party data, and then how to successfully merge and combine the data from these different sources. Unfortunately, most data doesn’t come in nice, good-to-use condition. You must remove extraneous data, incomplete data, duplicate data, or data that cannot be used in any other way. You will also need to filter this data to minimize bias.
But we’re not done yet. You should also consider how the data should be transformed to meet your specific requirements. What are you going to do to implement data cleansing, data transformation and data manipulation? Not all data is created equal, and over time you will experience data decay and data drift.
Have you thought about how you’re going to track that data and evaluate that data to make sure the quality stays at the level you need? If you need labeled data, how do you get that data? There are also data augmentation steps you may need to consider. If you need to do additional data augmentation, how will you track it? Yes, there are many steps involved with quality data, and these are all elements that you need to think about in order to make your project successful.
Data labeling in particular is a common area where many teams get stuck. For supervised learning approaches to work, they need to be fed good, clean, and well-labeled data so that they can learn by example. If you are trying to recognize images of boats in the ocean, then you need to feed the system good, clean, well-labeled images of boats to train your model. That way, when you give it an image it hasn’t seen before, it can give you a high degree of certainty as to whether or not the image has a boat in it. If you train your system with boats in the ocean only on sunny days with no cloud cover, then how do you expect the AI system to react when it sees a boat at night or a boat with 50% cloud cover? If your test data doesn’t match real-world data or real-world scenarios, then you’re in trouble.
Even when teams spend a lot of time making sure their test data is perfect, often the quality of the training data doesn’t reflect real-world data. In a publicly documented For example, AI industry leader Andrew Ng discussed how in his work with Stanford Health the quality of data in his test environment did not match the quality of real-world medical images, rendering his AI models useless outside of the test environment. This caused the entire project to stall and fail, jeopardizing millions of dollars and years of investment.
Planning for project success
All this activity focused on data quality can seem overwhelming, which is why these steps are often skipped. But of course, as mentioned above, bad data is what kills AI projects. Therefore, not paying attention to these steps is a major cause of overall AI project failure. This is why organizations are increasingly adopting best practice approaches such as CRISP-DM, Agile and CPMAI to ensure that critical data quality steps that will help prevent AI project failure are not missing or bypassed.
The theme of teams often moving forward without planning for project success is all too common. Indeed, the second and third phases of both CRISP-DM and CPMAI methodology are “Data Understanding” and “Data Preparation”. These steps even precede the first step of model building and are therefore considered best practice for those AI organizations that want to succeed.
Indeed, if the Stanford medical project had adopted CPMAI or similar approaches, they would have realized long before the million-dollar, multi-year mark that data quality issues would sink their project. While it may be comforting to realize that even luminaries like Andrew Ng and companies like Google make significant data quality mistakes, you still don’t want to be unnecessarily part of that club and let data quality issues plague projects your AI.