The Guardian has joined a growing list of websites that have blocked OpenAI’s GPTBot from scraping their websites.
The British daily newspaper announced its decision on its website last Friday and joins CNN, Reuters, the Washington Post, Bloomberg, and the New York Times in blocking GPTBot. While it didn’t give a full explanation of the reasons behind the decision it did mention some common industry concerns.
It cited the ongoing copyright lawsuits brought by authors like Sarah Silverman and the calls from British book publishers to protect their work from being exploited by AI.
The Guardian acknowledged that generative AI tools like ChatGPT are doing some impressive things, but some of the semantics in the announcement reveal a less enthusiastic view of how AI companies are going about their business.
The announcement noted that ChatGPT was trained on vast amounts of data “culled” from the internet and that it acted to stop the company from using software that “harvests” its data.
It hasn’t come right out and shouted ‘Stop thief!’ but the message is pretty clear.
A spokesperson for the publisher of the Guardian and Observer, said, “The scraping of intellectual property from the Guardian’s website for commercial purposes is, and has always been, contrary to our terms of service.”
In a sign that it may be open to allowing data scraping in the future, the spokesperson said, “The Guardian’s commercial licensing team has many mutually beneficial commercial relationships with developers around the world, and looks forward to building further such relationships in the future.”
Interestingly, The Guardian also noted concerns over the potential that generative AI has for producing disinformation. It didn’t explain how this concern related to its decision to block GPTBot, but as a news publisher, this is an obvious area of concern.
Ethical and copyright issues aside, it may also be that The Guardian website servers have been experiencing similar challenges to those that X had.
Earlier this year Elon Musk said that a significant amount of the load that X servers were experiencing came from a multitude of AI scraper bots. He hasn’t blocked them outright and also intends to use public tweets to train his xAI model.
When an AI bot visits a website and encounters a robot.txt file “blocking” it, then it doesn’t scrape the website out of courtesy, not because it is unable to.
Once the copyright issues are settled in law, I wonder how long courtesy will continue to trump AI’s insatiable appetite for data.