The Guardian Blocks OpenAI's Content Access Amid Growing AI Content Scraping Concerns Across Newsrooms

John Lopez, Tech Times 01 September 2023, 05:09 pm

The Guardian, a prominent British news outlet, has recently taken decisive action to block OpenAI from accessing its content for AI products like ChatGPT.

This move comes as a response to mounting concerns that OpenAI and other AI developers are utilizing unlicensed content to fuel their AI tools.

The Guardian's Stand Against AI Content Scraping

The Guardian has confirmed that it has effectively prevented OpenAI from deploying software designed to harvest its content.

This decision marks a significant development in the ongoing debate surrounding AI technology, particularly generative AI, which produces text, images, and audio based on human prompts.

The spokesperson for Guardian News & Media emphasized their stance, stating, "The scraping of intellectual property from the Guardian's website for commercial purposes is, and has always been, contrary to our terms of service."

A critical aspect of these AI tools is their training process, which involves feeding them vast amounts of data from the open internet, including news articles. This data helps the AI models predict the most likely word or sentence to follow a user's prompt.

More Sites Blocking OpenAI

OpenAI, while not disclosing the data sources behind ChatGPT's training, announced in August that it would allow website operators to block its web crawler from accessing their content.

However, this move does not retroactively remove material from existing training datasets. Several publishers and websites have already blocked OpenAI's GPTBot crawler, including industry giants like CNN, Reuters, the Washington Post, Bloomberg, the New York Times, and others.

It is also important to note that OpenAI did not pay for the data that it obtained from the Internet. Adding to the fact that its crawlers grabbed data without permission, websites that produced scraped content were not compensated. This is especially notable given that OpenAI is valued at $30 billion.

The Broader Impact on AI and Newsrooms

Axios reported that OpenAI's introduction of the GPTBot crawler triggered a wave of blocks from nearly 20% of the top 1000 websites in the world, including The New York Times, Reuters, and CNN.

Amazon, Quora, and Indeed are among the significant sites leading the charge in blocking AI bots, while the Common Crawl Bot, another data-gathering crawler used by some AI services, faces blocks on 6.77% of the top 1000 sites.

Google and other web giants view the work of their data crawlers as fair use. However, publishers and intellectual property holders have long contested this, resulting in multiple lawsuits.

Commercialization of AI

The commercialization of AI, exemplified by OpenAI's expected revenue exceeding $1 billion in the coming year, has heightened tensions.

News companies are navigating the fine line between embracing AI for potential profit and dealing with the ethical questions surrounding AI's role in newsrooms, especially when public trust in news organizations is at a historic low.

Stay posted here at Tech Times.