Perplexity AI has found itself at the center of a firestorm over its data collection practices.
Perplexity essentially fuses a search engine with generative AI, returning AI-generated content related to the user’s search query.
The processes enabling this likely involve scraping content from numerous websites, including those that explicitly prohibit it.
The scandal erupted on June 11 when Forbes reported that Perplexity had lifted an entire article from its site, complete with custom illustrations, and repurposed it with only minimal attribution.
Not long after, WIRED conducted an investigation that uncovered evidence of Perplexity scraping content from websites that forbid automated data collection.
A website can request that its content isn’t scraped by web crawlers through a file called “robots.txt.”
This exclusion protocol communicates with web crawlers and other automated bots. It’s a simple text file placed on a website’s server that specifies which pages or sections of the website should not be accessed or scraped.
The robots.txt file has been a widely respected convention since the early days of the web. It helps website owners control their content and prevent unauthorized data collection.
Although not legally binding, it has long been considered best practice for web crawlers to follow the instructions outlined in a website’s robots.txt file.
Jason Kint, CEO of Digital Content Next, a trade group representing online publishers, minced no words in his assessment of Perplexity’s web scraping processes.
“By default, AI companies should assume they have no right to take and reuse publishers’ content without permission,” he said.
“If Perplexity is skirting terms of service or robots.txt, the red alarms should be going off that something improper is going on.”
Amazon investigates
These revelations have prompted Amazon Web Services (AWS), which hosts a server implicated in Perplexity’s alleged improper scraping, to launch an investigation.
AWS strictly prohibits customers from engaging in abusive or illegal activities that violate its terms of service.
Perplexity CEO Aravind Srinivas initially brushed off the concerns, asserting they reflected “a deep and fundamental misunderstanding” of the company’s operations and the internet at large.
However, in a subsequent interview with Fast Company, he conceded that Perplexity relied on an unnamed third-party vendor for web crawling and indexing, suggesting they were to blame for any robots.txt violations.
Srinivas declined to identify the company, citing a non-disclosure agreement.
For the moment, Perplexity appears determined to weather the storm, with a spokesperson downplaying the AWS probe as “standard procedure” and indicating the company has made no changes to its operations.
However, the startup’s defiant stance may prove untenable as the groundswell of concern over AI’s data practices continues to build.