In a statement of written evidence to the UK House of Lords, OpenAI stated that creating AI tools without using copyrighted material is “impossible.”
This comes amid an intensifying debate surrounding copyright’s interaction with AI, with authors, writers, and media outlets like the New York Times lodging lawsuits against OpenAI, Microsoft, Stability AI, Anthropic, Google, and Midjourney, to name but a few.
Large language models (LLMs) such as ChatGPT and image generators like Midjourney, which recently hit the headlines for creating a database of 16,000 artists for model training purposes, rely on extensive copyrighted data for their training.
In fact, copyright data forms the mainstay of AI training material because it’s abundant, covers a broad spectrum of human creativity, and is easily retrieved from the internet.
AI companies argue this data is ‘fair use’ for their model training purposes, but many others disagree.
In response to the House of Lords communications and digital select committee, OpenAI recently emphasized their need for copyrighted material for training LLMs like GPT-4.
OpenAI stated, “Because copyright today covers virtually every sort of human expression – including blogposts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials.”
The company further argued that restricting training materials to public domain sources would result in poor AI systems.
“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens,” OpenAI added.
You can read the entire written evidence submission here, which also touches on the future trajectory of AI, catastrophic risks, to which OpenAI advertises their Frontier Model Forum and Preparedness team, and regulation.
The public reacts
Reactions to these statements have not exactly been sympathetic.
Dr. Gary Marcus, for example, a prominent voice in the industry, said this essentially self-labels AI models as a monetization device for stolen copyright work.
Indeed, it seems like this is almost a Freudian slip on OpenAI’s part, admitting that their business model is unworkable without manipulating the law.
There’s a palpable sense of injustice with so few in the upper echelons of Silicon Valley benefitting from the work of so many.
OpenAI’s statement also asserts that they understand ‘the needs’ of today’s ‘citizens,’ exposing a widening disconnect between big tech’s view of generative AI as a humanitarian, even philanthropic project and people’s fears it’s stealing their data and displacing their skills.
Dr. Marcus commented, “[AI companies]…should go back to the drawing board—and figure out how to build software that doesn’t have a plagiarism problem—rather than fleecing artists, writers, and other content providers.”
Now we know why Sam Altman went around the world last summer meeting world leaders: his company won’t make it big unless they can convince governments to give them one of the biggest handouts in history. https://t.co/Pcc8FchG1a
— Gary Marcus (@GaryMarcus) January 8, 2024
Lawsuits are racking up
This also comes amid several lawsuits against OpenAI, with notable authors like John Grisham, Jodi Picoult, and George RR Martin suing the company in September last year for alleged “systematic theft on a mass scale.”
Two esteemed journalists, Nicholas Gage and Nicholas Basbanes, lodged yet another complaint against OpenAI and Microsoft last week, adding to the growing number of legal challenges faced by AI companies from both the writing and visual arts communities.
OpenAI also responded to the New York Times lawsuit, stating they feel it’s “without merit,” seen below.
We build AI to empower people, including journalists.
Our position on the @nytimes lawsuit:
• Training is fair use, but we provide an opt-out
• “Regurgitation” is a rare bug we’re driving to zero
• The New York Times is not telling the full storyhttps://t.co/S6fSaDsfKb— OpenAI (@OpenAI) January 8, 2024
These developments raise concerns about the potential legal liabilities AI companies might face this year and in the future. How will they adapt? Will the public’s growing resistance have any impact on the industry’s trajectory?
And how can you ethically train large-scale generative AI models? Are ethics even compatible with the technology’s current incarnation?
AI companies’ defenses are holding up so far, but the wedge between AI developers’ ideas of ‘fair use’ and how others perceive it is widening.