Google has faced an embarrassing setback with its Gemini AI exhibiting bias, leading Senior Vice President Prabhakar Raghavan to vow for betterment amidst expert skepticism about fully eradicating AI bias.

"We cannot be certain that there is no inherent bias within the AI models since it is often latent until some use case exposes it," Joseph Regensburger, VP of Research, Immuta, told Tech Times in an interview.

He explained that, for the most part, commercial LLMs are trained on data from the open internet. For instance, ChatGPT-3's largest single source of data is from CommonCrawl, which is essentially just raw web page data. It's used since it's a vast and varied dataset, but it isn't vetted for content accuracy or any specific use case.

That said, he doesn't believe vetting alone would help. In his book Demystifying AI for the Enterprise, Bob Rogers, CEO of data science company Oii.ai, wrote about a credit card algorithm that gave women lower ratings simply because, historically, women had lower incomes. 

"Biases can be subtle and deeply ingrained in the language and structures of the data sources, making them challenging to identify and eliminate,"Christopher Bouzy, CEO and founder of Spoutible, who created  Bot Sentinel, a Twitter analytics service that tracks disinformation, told Tech Times in an interview. 

Bouzy says companies and research institutions working on developing language models (LLMs) usually go through procedures to carefully assess and organize the data used to train their models. This process, which involves both automated tools and human reviewers, may involve removing harmful or biased content.  

Crunching Data

Many experts agree that while it's unlikely that bias will ever be completely eliminated, they believe organizations can minimize its impact by acknowledging that fact and taking steps to monitor bias. 

One way Bouzy suggested is for companies to actively seek out and use diverse datasets. "This involves not just a variety of sources but also ensuring that minority and marginalized voices are represented in the data," explained Bouzy.

He also said AI models should not be static and need to be continuously monitored for biases as they interact with the real world and updated accordingly. This process of identifying and correcting biases that emerge could involve both automated systems and human oversight. 

Ed Baum, COO at hiring experts TalentGenius, has another perspective. Based on his experience, he believes that while it's good to have unbiased training data, what's even more important is what comes out of the AI.

While creating their AI-powered tools, Baum's team found various anomalies in the way their AI interpreted and presented results. They were able to reign it in by tweaking the prompts driving their AI.

"Today, if you want an LLM like OpenAI to give a balanced opinion on whether a job is a good fit for someone or not, you need to prompt it to be biased towards giving a negative opinion if it fits since OpenAI has been trained to be relentlessly positive," Baum told Tech Times in an interview.

Holding AI Responsible

No one knew Google was licensing AI training data from Reddit if it wasn't for Reddit's IPO documents. Could transparency in AI training data by tech giants, akin to ingredient labels on food products, be beneficial?Those in favor argue that since the data is made up of contributions made by us, we do have a right to know how our data is being utilized. Also, disclosing the training data would give everyone, including those with technical chops, an opportunity to scrutinize the data, identify potential biases or errors, and perhaps even suggest improvements.

Although Bouzy said it's a compelling idea that could potentially increase transparency and accountability in AI development, this approach has several challenges. For starters, these massive troves of training data often encompass a significant portion of the public internet, which includes copyrighted materials, private data, and information that cannot be easily disclosed due to legal and privacy concerns. He also fears that revealing the data could do more harm since it'll expose these AIs to more targeted manipulation. 

Rogers suggested the best way forward is personal accountability. "Whether or not companies make data public, there should always be a human responsible for the outcome," Rogers told Tech Times in an interview.

Using the example of cars, Rogers said if you crash into someone, the driver is responsible. Or, when the brakes fail, somebody should be held to account, which could be the car manufacturer, or the last mechanic who tinkered with the brakes. He argued that just as there's a chain that leads back to a human in the case of cars, so it should be with training AI. 

"There should be people ultimately accountable for biases or issues that arise from training these models," said Rogers. "Hold them responsible when biases creep in. Applaud them when biases and other issues are weeded out."

About the author: Mayank Sharma is a technology writer with two decades of experience in breaking down complex technology and getting behind the news to help his readers get to grips with the latest buzzwords and industry milestones.  He has had bylines on NewsForge, Linux.com, IBM developerWorks, Linux User & Developer magazine, Linux Voice magazine, Linux Magazine, and HackSpace magazine. In addition to Tech Times, his current roster of publications include TechRadar Pro, and Linux Format magazine. Follow him at https://twitter.com/geekybodhi

ⓒ 2024 TECHTIMES.com All rights reserved. Do not reproduce without permission.
Join the Discussion