OpenAI discretely unveiled GPTBot, a dedicated web scraper for collecting training data.
Edit: It’s currently unclear with GPTBot is the same/updated bot as OpenAI used to scrape data alongside Common Crawl in 2018/2019 or whether this is a new/evolved version. Either way, this is the first time they’ve published data about how to prevent it from scraping website data.
OpenAI has published information about GPTBot on its website here, including details about how website administrators can prevent it from crawling and scraping their websites.
To block GPTBot from crawling a website, administrators can adjust the settings in the robots.txt file. This file, a standard tool in website management that dates back some 30 years, indicates which areas of the website are off-bounds to crawlers.
To briefly delineate crawling from scraping, crawlers trawl through website content while scrapers extract the data. It’s a two-part process, although typically, the two are collectively simply called “scraping.”
OpenAI also revealed the IP address block used by GPTBot, available here, providing another option for inhibiting the bot’s activity.
Some speculate whether this provides OpenAI another layer of protection against allegations of unpermitted data usage.
OpenAI and other AI developers are being snowed under by lawsuits relating to how they used people’s data without their permission.
Now, website administrators must proactively prevent their sites from being scraped for training data, placing the onus on them to prevent their site’s data from ending up in training datasets.
It’s worth noting that GPTBot isn’t the only tool of its kind. OpenAI has used other datasets to train its models, including the Common Crawl dataset.
Like GPTBot, the CCBot crawler can also be controlled by adding specific lines of code in the robots.txt file.
How to prevent ChatGPT from crawling your site’s data
OpenAI will use GPTBot for targeted data scraping, but it can be stopped from scraping entire websites or specific web pages. Read OpenAI’s full documentation here.
OpenAI published the following information:
GPTBot is identified by its user agent token “GPTBot.” The complete user-agent string associated with it is: “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)”.
By editing the robots.txt file, GPTBot can be blocked from accessing an entire website or selected portions.
To inhibit GPTBot from accessing a site, administrators can edit their website’s robots.txt file as follows:
User-agent: GPTBot
Disallow: /
Parts of websites can be allowed/disallowed by the following:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
OpenAI has also made public the IP ranges used by GPTBot available here. Although only one range has been listed, more may be added in due course.