In the quest to create increasingly sophisticated large language models, AI companies are encountering a daunting obstacle: the depletion of accessible internet data. 

The Wall Street Journal reports that these companies have nearly exhausted the available resources of the open internet, signaling an impending scarcity of data crucial for AI model training.

Who would have thought that they would run out of data someday?

Seeking Alternative Data Sources

AI Companies Are Running Out of Internet in Data Training Model
(Photo : Carlos Muza from Unsplash)
Despite billions of dollars that AI firms invest in AI training models, companies cannot address the elephant in the room: running out of internet.

With traditional internet data reserves dwindling, AI firms are exploring alternative avenues for acquiring training data. Some are turning to publicly available video transcripts and the generation of synthetic data by AI algorithms. However, this approach presents its own set of challenges, including a higher risk of AI model hallucinations due to reliance on artificially generated data.

Related Article: AI May Take Over Management Positions in Scientific Research, According to Study

Concerns Surrounding Synthetic Data

According to FirstPost, the reliance on synthetic data has sparked concerns among experts about the potential drawbacks of training AI models using such datasets. There are apprehensions about the phenomenon termed "digital inbreeding," wherein AI models trained on AI-generated data may encounter stability issues, leading to suboptimal performance or failure.

Controversial Approaches to Data Training

In response to the data scarcity problem, AI giants like OpenAI are considering unconventional strategies for training their models. 

For instance, ChatGPT maker OpenAI is reportedly contemplating using transcriptions from publicly available YouTube videos to train its GPT-5 model. However, such approaches have drawn criticism and may even invite legal challenges from video content creators.

Addressing Data Scarcity With AI Training Model

GERMANY-INTERNET-AI-ARTIFICIAL-INTELLIGENCE

(Photo : KIRILL KUDRYAVTSEV/AFP via Getty Images)
A photo taken on February 26, 2024 shows the logo of the ChatGPT application developed by US artificial intelligence research organization OpenAI on a smartphone screen (L) and the letters AI on a laptop screen in Frankfurt am Main, western Germany.

Despite the challenges, companies like OpenAI and Anthropic are actively working on enhancing synthetic data quality to address the data scarcity issue. While specific methodologies are still under wraps, these firms aim to develop synthetic data of superior quality to sustain AI model training.

Hope for Breakthroughs

Although concerns about data scarcity loom large, many experts remain optimistic about the potential for technological breakthroughs to mitigate these challenges. 

While predictions suggest that AI may exhaust its usable training data in the near future, significant advancements in AI research could offer solutions to alleviate this predicament.

Sustainable AI Development Practices

Amidst the race for larger and more advanced AI models, there's a growing realization of the environmental impact associated with their development. 

Some advocate for a shift in focus towards sustainable AI development practices, considering factors such as energy consumption and the environmental impact of rare-earth mineral mining for computing chips.

Back in November 2023, Tech Times reported that AI firms are on the verge of running out of high-quality training data. Months later, the topic resurfaced and it appeared that data depletion is another problem they must overcome.

Read Also: New Research Uses AI to Precisely Predict Monsoon Rainfall up to 30 Days in Advance

Joseph Henry

ⓒ 2024 TECHTIMES.com All rights reserved. Do not reproduce without permission.
Join the Discussion