MIT Study Shows Labeling Errors in More Than 10% of AI Testing Datasets, Causes Unstable Benchmark Results

Lee Mercado, Tech Times 29 March 2021, 04:03 am

Turkish Artist Uses Artificial Intelligence To Share Historical Ottoman Archives — (Photo : Chris McGrath/Getty Images)

Open-House Day At The FU Center For Robotics And Artificial Intelligence — (Photo : Adam Berry/Getty Images)

Massachussets Institute of Technology (MIT)'s most recent study concluded that labeling errors found on datasets used in AI benchmark testing lead scientists to draw incorrect conclusions on machine learning's performance in the real world.

According to Engadget's article published on Monday, March 29, the conclusion comes after a team of computer scientists found that around 3.4 percent of the data have been inaccurately labeled, causing problems in AI Systems.

Also Read: Artificial Intelligence: Google, Microsoft, Facebook, Amazon and IBM Form Partnership on AI

MIT's AI Dataset Study and Startling Findings

In July, Venturebeat reported MIT researcher's findings that the well-known "ImageNet" dataset exhibit "systematic annotation issues."

MIT Researchers analyzed 10 test sets from datasets, including ImageNet, and found over 2,900 errors in the ImageNet validation set alone.

When used as a benchmark data set, the errors in the dataset were proved to have an incorrect position in correlation to direct observation or ground truth.

In the paper titled, "From ImageNet to Image Classification: Contextualizing Progress on Benchmarks," researchers wrote, "A noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for."

They added that it is essential for the research's future to develop annotation pipelines that capture the ground truth better while maintaining its scalability.

The MIT team also found after closely examining ImageNet's "benchmark task misalignment" that about 20% of ImageNet photos have multiple objects in them, dropping the general accuracy to 10 percent.

Shibani Santurkar, a co-author of the research, said in an International Conference on Machine Learning (ICML) presentation that capturing the ImageNet image content may require more than a single ImageNet label.

She added that since the labels are thought of as ground truths, it could cause a misalignment between the ImageNet benchmark and the real-world object recognition task.

When the research team corrected the said errors, it made the benchmark results from the test sets unstable.

This is due to the distribution of labeling errors at higher-capacity models, which would reflect the mistake significantly compared to the study's smaller models.

Google's QuickDraw, a collection of about 50 million drawings submitted by the players of the game "Quick, Draw!," also exhibited the findings on a much larger scale.

Researchers estimated that 10.12 percent of the total labels had been mislabeled, such as mislabeled sentiments - when an Amazon product review was described as unfavorable when it is actually positive.

Dataset Labeling Error: How Could It Truly Impact the Real-World?

A notable and perhaps infamous example of the labeling error's impact comes from Google's attempt to assist in curbing the COVID-19 pandemic.

Last year, Algorithm Watch reported that a branch of the search engine giant's Artificial Intelligence that focuses on automated image labeling sparked controversy, and the researchers of the experiment had been alleged as racists.

Twitter Error

In the experiment, the Google Vision Cloud automatically labeled an image of a dark-skinned individual holding a hand-held thermometer as a "gun." In contrast, a similar image with a light-skinned individual was labeled as an "electronic device."