OpenAI inconspicuously unveils its own data scraper, GPTBot

OpenAI discretely unveiled GPTBot, a dedicated web scraper for collecting training data.

Edit: It’s currently unclear with GPTBot is the same/updated bot as OpenAI used to scrape data alongside Common Crawl in 2018/2019 or whether this is a new/evolved version. Either way, this is the first time they’ve published data about how to prevent it from scraping website data.

OpenAI has published information about GPTBot on its website here, including details about how website administrators can prevent it from crawling and scraping their websites.

To block GPTBot from crawling a website, administrators can adjust the settings in the robots.txt file. This file, a standard tool in website management that dates back some 30 years, indicates which areas of the website are off-bounds to crawlers.

To briefly delineate crawling from scraping, crawlers trawl through website content while scrapers extract the data. It’s a two-part process, although typically, the two are collectively simply called “scraping.”

OpenAI also revealed the IP address block used by GPTBot, available here, providing another option for inhibiting the bot’s activity.

Some speculate whether this provides OpenAI another layer of protection against allegations of unpermitted data usage.

OpenAI and other AI developers are being snowed under by lawsuits relating to how they used people’s data without their permission.

Now, website administrators must proactively prevent their sites from being scraped for training data, placing the onus on them to prevent their site’s data from ending up in training datasets.

It’s worth noting that GPTBot isn’t the only tool of its kind. OpenAI has used other datasets to train its models, including the Common Crawl dataset.

Like GPTBot, the CCBot crawler can also be controlled by adding specific lines of code in the robots.txt file.

How to prevent ChatGPT from crawling your site’s data

OpenAI will use GPTBot for targeted data scraping, but it can be stopped from scraping entire websites or specific web pages. Read OpenAI’s full documentation here.

OpenAI published the following information:

GPTBot is identified by its user agent token “GPTBot.” The complete user-agent string associated with it is: “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)”.

By editing the robots.txt file, GPTBot can be blocked from accessing an entire website or selected portions.

To inhibit GPTBot from accessing a site, administrators can edit their website’s robots.txt file as follows:

User-agent: GPTBot

Disallow: /

Parts of websites can be allowed/disallowed by the following:

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

OpenAI has also made public the IP ranges used by GPTBot available here. Although only one range has been listed, more may be added in due course.

OpenAI inconspicuously unveils its own data scraper, GPTBot

How to prevent ChatGPT from crawling your site’s data

Join The Future

Sam Jeans

RELATED POSTS

OpenAI announces “SearchGPT” to try and stay at the front of the pack

Meta releases Llama 3.1 models, sticks with open strategy

Senate probes OpenAI’s safety and governance after whistleblower claims

Google’s AI predicts weather using fraction of computing power

OpenAI inconspicuously unveils its own data scraper, GPTBot

How to prevent ChatGPT from crawling your site’s data

Join The Future

Sam Jeans

RELATED POSTS

OpenAI announces “SearchGPT” to try and stay at the front of the pack

Meta releases Llama 3.1 models, sticks with open strategy

Senate probes OpenAI’s safety and governance after whistleblower claims

Google’s AI predicts weather using fraction of computing power

FREE PDF EXCLUSIVEStay Ahead with DailyAI

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI