AI lawsuits are coming thick and fast as US comedian and author Sarah Silverman and authors Christopher Golden and Richard Kadrey file lawsuits against OpenAI and Meta.
The trio alleges copyright infringement, stating that their work was unlawfully used for training ChatGPT and LLaMA, Meta’s open-source large language model (LLM).
ChatGPT relies on the analysis of a colossal amount of data sourced from the internet – it’s this data which teaches it how to handle natural language. Many questions surround the origin of this training data and the methods used to retrieve it, and suspicions deepen now creators are discovering their work is possibly contained within that training data.
In this latest lawsuit, OpenAI and Meta are accused of using the plaintiffs’ copyrighted books as training data without their consent.
The lawsuits suggest that the materials were sourced from “shadow library” websites. Shadow libraries contain large quantities of illegally copied information, including sites such as Bibliotik, Library Genesis, and Z-Library. Shadow libraries are similar to torrents – they’re tough to prevent and control.
OpenAI is accused of accurately summarizing 3 books when prompted: Silverman’s “The Bedwetter,” Golden’s “Ararat,” and Kadrey’s “Sandman Slim.” While the AI could learn about such books from Wikipedia summaries and similar, this wouldn’t explain the level of detail contained in the summaries.
The lawsuit against Meta names several works by Kadrey and Golden, plus “The Bedwetter,” referring to a Meta paper indicating the use of material from shadow libraries, which the lawsuit labels as “blatantly illegal.”
Meta’s paper says, “We include two book corpora in our training dataset: the Gutenberg Project, which contains books that are in the public domain, and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.”
Joseph Saveri and Matthew Butterick, lawyers representing the trio, have reported increasing concerns about ChatGPT’s unsettling ability to mimic copyrighted text.
Research has shown that GPT-4 almost definitely learned from copyrighted works.
However, this could be because they’re popular and widely circulated or appear in school and university course readings.
In any case, that wouldn’t strictly excuse AI companies from using such texts in their training data.
AI-related lawsuits on the rise
AI has become the center of a storm of lawsuits, many of which are considered the first of their kind.
The same attorneys also represent US authors Mona Awad and Paul Tremblay in a separate but near-identical class action lawsuit against OpenAI.
And again, that same legal team, Saveri and Butterick, are representing 3 artists – Sarah Andersen, Kelly McKernan, and Karla Ortiz – in a lawsuit against image generators Stability AI and Midjourney.
That same law firm represented a case against Microsft and GitHub, alleging their AI tool Copilot AI tool profited from the work of open-source programmers. It’s a very similar case – the plaintiffs argue that the AI tool is trained on information that contains “open-source” data that is extracted unlawfully.
Here, the defendants claim, “Section 1202(b) of America’s Digital Millennium Copyright Act “is about identical ‘copies … of a work’ – not about stray snippets and adaptations.” AI companies may argue similarly against authors, suggesting that the summaries of their work are insufficient to support their argument that the books appear in the training data in full.
Either way, the allegations are piling up, indicating a trend of mounting legal pressures on AI companies.
AI regulations such as the EU AI Act are set to require companies to disclose information about copyrighted data in their training data. Whether that’ll have the desired effect is yet to be seen.