Block GPTBot's access to the content of some websites.

Media companies prevent GPTBot from accessing their content

The GPTBot web crawler from OpenAI is no longer permitted to access the content of several media organizations, including the New York Times, CNN, Reuters, and the Australian Broadcasting Corporation.

The AI chatbot ChatGPT is created by OpenAI, and its web crawler searches websites to enhance its AI models.

The Guardian discovered that several significant news websites, including CNN, Reuters, Chicago Tribune, Australian Broadcasting Corporation, Canberra Times, and Newcastle Herald, had also prevented the web crawler from accessing their sites. The New York Times has also blacklisted GTBot.

To train their systems and enable them to respond to user queries in ways that replicate human language patterns, large language models like ChatGPT need a ton of data, but the companies that support them frequently keep the existence of copyrighted material in their datasets a secret.

The GTBot block can be found in publishers' robots.txt files, which specify which URLs a web crawler is permitted to access.

"Allowing GPTBot access to your site helps increase the accuracy of AI models and improves their overall capabilities and safety," OpenAI wrote in a blog post that included instructions on how to disable its web crawler.

The Guardian claims that the aforementioned media outlets prohibited a web crawler controlled by OpenAI in August. A web crawler for the Common Crawl open online data warehouse that is also utilized by AI projects, CCBot, has also been prohibited by several media organizations.

CNN acknowledged to the Guardian that it recently forbade GTBot from using its headlines, but it made no mention of any further measures it intended to take in response to the incorporation of its content into AI systems.

According to a Reuters representative, "Reuters regularly reviews the terms and conditions of the robots.txt file and the site, and given that intellectual property is the lifeblood of our business, we must protect our copyrights."

According to a spokeswoman, The New York Times recently amended its terms of service to make the prohibition against using its material for AI training and development even clearer.

International news organizations are researching the use of artificial intelligence in news collecting and how to handle the content that is gathered by businesses creating these systems to teach groups.

Early in August, media groups including Agence France-Presse signed an open letter advocating for AI regulation. The petition demanded openness over the composition of all training suites used to generate AI models and the requirement for prior clearance for the use of copyrighted material.

The GTBot web crawler has been blacklisted by a few significant websites, including Amazon, Quora, Shutterstock, and wikiHow, according to research by OriginalityAI, a business that verifies content for artificial intelligence.