A group of authors filed a class-action lawsuit against Anthropic in a California court on Monday. The authors claim Anthropic built its business by “stealing hundreds of thousands of copyrighted books.”
The three authors, Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson claim that their books were part of the dataset that Anthropic used to train its family of Claude models. In their suit, they allege that Anthropic was guilty of “downloading and copying hundreds of thousands of copyrighted books taken from pirated and illegal websites.”
The authors questioned Anthropic’s claim to be a public benefit company saying, “It is no exaggeration to say that Anthropic’s model seeks to profit from strip-mining the human expression and ingenuity behind each one of those works.”
The Pile
The books in question are part of a controversial dataset called Books3, which previously formed part of a larger dataset called The Pile. It’s generally accepted, but not admitted, that just about every one of the big LLMs trained their models on The Pile.
The Pile consists of around 825GB of academic papers, books, websites, technical documents, and more. One of The Pile’s architects is an independent developer named Shawn Presser. Presser created the Books3 dataset in 2020 and added it to The Pile.
Books3 contains 196,640 books in plain text format by famous authors like Stephen King as well as the authors that brought this lawsuit. It’s believed that Presser used Bibliotik, a notorious torrent tracker used by an invite-only community of book pirates, as the source for Books3.
Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data.
Now you do. Now everyone does.
Presenting “books3”, aka “all of bibliotik”
– 196,640 books
– in plain .txt
– reliable, direct download, for years: https://t.co/KKSrhEAnrDthread 👇 pic.twitter.com/m6bdpHfYJx
— Shawn Presser (@theshawwn) October 25, 2020
When The Pile was hosted and made publicly available online by the nonprofit EleutherAI, it noted its reasons for including the pirated books. EleutherAI said, “We included Bibliotik because books are invaluable for long-range context modeling research and coherent storytelling.”
In August 2023, Books3 was removed from the “most official” copy of The Pile, but by that time it had been used by pretty much all the big names in AI model development.
In July 2024, Anthropic publicly acknowledged that it used The Pile to train its Claude models. While Anthropic is yet to respond to the lawsuit, it’ll likely revert to the same “fair use” defense that OpenAI and others facing similar lawsuits are using.
The real damage
Besides the copyright issue, the lawsuit reveals the genuine fear that authors have of AI taking over their source of income.
The suit alleges that “Anthropic, in taking authors’ works without compensation, has deprived authors of book sales and licensing revenues.” That may be hard to prove. Claude will describe the book “The Feather Thief” by Kirk Wallace Johnson, but it declines to reproduce even a single page.
I suspect Claude is lying when it responds with “I apologize, but I don’t have access to the actual text of “The Feather Thief” or its first page,” because it goes on to describe what takes place on page 1. If you want to read the book, you’ll need to buy it or go to a library.
Even so, the authors say that “Anthropic’s Claude and other LLMs like it seriously threaten the livelihood” of authors. They say that writing work is “starting to dry up as a result of generative AI systems trained on those writers’ works, without compensation, to begin with.”
As evidence of this, the suit relates how a man named Tim Boucher “wrote” 97 books using Claude and ChatGPT in less than a year, and sold them at prices from $1.99 to $5.99.
The lawsuit is calling for a jury trial and unspecified damages. It will be interesting to see if the jurors value copyright law more than the utility of AI models like Claude.