OpenAI disclosed its GPTBot earlier this month, and since then the internet’s biggest sites have increasingly been moving to block the web scraper from accessing their sites.
AI content detector, Originality.ai, has been keeping tabs on the top 1,000 websites to see which of them have blocked web scrapers like GPTBot.
Blocking GPTBot from scraping a website is easily done by adding two lines to the website’s robots.txt file. And more and more sites are beginning to do just that.
The figures reflected in the Originality.ai report show that a week ago 91 sites blocked GPTBot. Just over a week later that figure has jumped to 111, an increase of 22%
An increase of 20 sites doesn’t sound like much, but when you consider the amount of data these websites have and continue to produce then it’s significant. The top 5 sites that now block GPTBot are:
The amount of data that has become off-limits for OpenAI to use to train its models from just those five websites is considerable.
If you look at the complete list of 1,000 sites it’s interesting to note which have blocked GPTBot, and which have decided not to, for now.
While Shutterstock has blocked GPTBot, other stock photography sites like iStock haven’t. When it comes to stock photography you’ve got to wonder if that particular AI-scraping horse didn’t bolt some time ago already.
It makes more sense that news companies like The New York Times and CNN have blocked the bot. But other top news sites like Forbes and The Guardian have so far chosen not to block the scraper.
OpenAI has said that allowing GPTBot to scrape sites “can help AI models become more accurate and improve their general capabilities and safety.” The company also said that its bot doesn’t peek behind paywalls or look at sites that collect personal information.
It may be that sites like YouTube, X, and BBC take OpenAI at its word and see the potential value in allowing AI bots to use their data in a responsible way. If they decided to use ChatGPT in their business they would want it to work as well as possible.
These companies may also realize the potential traffic that they could miss out on if they block the biggest AI scraper. Imagine what would happen to their traffic if websites decided to block Google’s bot out of principle.
It’s also interesting to note that none of the sites on the list have blocked Anthropic’s bot. Does the industry in general feel that OpenAI will treat its data differently than Anthropic will?
You’d think that if a company made a decision to block AI scrapers it would block all of them, and not just one.
OpenAI is involved in some landmark AI copyright lawsuits that could potentially make a big difference to this list. It will be interesting to see which big sites decide to block the bot and even if we see some changing their decision to do so.