In the frantic pursuit of AI training data, tech giants OpenAI, Google, and Meta have reportedly bypassed corporate policies, altered their rules, and discussed circumventing copyright law.
A New York Times investigation reveals the lengths these companies have gone to harvest online information to feed their data-hungry AI systems.
In late 2021, OpenAI researchers developed a speech recognition tool called Whisper to transcribe YouTube videos when facing a shortage of reputable English-language text data.
Despite internal discussions about potentially violating YouTube’s rules, which prohibit using its videos for “independent” applications,
NYT found that OpenAI ultimately transcribed over one million hours of YouTube content. Greg Brockman, OpenAI’s president, personally assisted in collecting the videos. The transcribed text was then fed into GPT-4.
Google also allegedly transcribed YouTube videos to harvest text for its AI models, potentially infringing on video creators’ copyrights.
This comes days after YouTube’s CEO said such activity would violate the company’s terms of service and undermine creators.
In June 2023, Google’s legal department requested changes to the company’s privacy policy, allowing publicly available content from Google Docs and other Google apps for a wider range of AI products.
Meta, facing its own data shortage, has considered various options to acquire more training data.
Executives discussed paying for book licensing rights, buying the publishing house Simon & Schuster, and even harvesting copyrighted material from the internet without permission, risking potential lawsuits.
Meta’s lawyers argued that using data to train AI systems should fall under “fair use,” citing a 2015 court decision involving Google’s book scanning project.
Ethical concerns and the future of AI training data
The collective actions of these tech companies highlight the critical importance of online data in the booming AI industry.
These practices have raised concerns about copyright infringement and the fair compensation of creators.
A filmmaker and author, Justine Bateman, told the Copyright Office that AI models were taking content – including her writing and films – without permission or payment.
“This is the largest theft in the United States, period,” she said in an interview.
In the visual arts, MidJourney and other image models have been proven to generate copyright content, like scenes from Marvel movies.
With some experts predicting that high-quality online data could be exhausted by 2026, companies are exploring alternative methods, such as generating synthetic data using AI models themselves. However, synthetic training data comes with its own risks and challenges and might adversely impact the quality of models.
OpenAI CEO Sam Altman himself acknowledged the finite nature of online data in a speech at a tech conference in May 2023: “That will run out,” he said.
Sy Damle, a lawyer representing Andreessen Horowitz, a Silicon Valley venture capital firm, also discussed the challenge: “The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data. The data needed is so massive that even collective licensing really can’t work.”
The NYT and OpenAI are locked in a bitter copyright lawsuit, with the Times seeking what would likely be millions in damages.
OpenAI hit back, accusing the Times of ‘hacking’ their models to retrieve examples of copyright infringement.
By ‘hacking,’ they mean jailbreaking or red-teaming, which involves targeting the model with specially formulated prompts intended to break to manipulate outcomes.
The NYT said they wouldn’t have to resort to jailbreaking models if AI companies were transparent about the data they’d used.
Undoubtedly, this inside investigation further paints Big Tech’s data heist as ethically and legally unacceptable.
With lawsuits mounting up, the legal landscape surrounding the use of online data for AI training is extremely precarious.