YouTube CEO Neal Mohan said OpenAI’s potential use of YouTube videos to train text-to-video model Sora would violate its terms of service.
Mohan told Bloomberg, “If Sora used content from YouTube it would be a ‘clear violation’ of its terms of service.”
There will be no love lost between YouTube and OpenAI, with each drawn on different sides of the Big Tech divide.
Sora is OpenAI’s revolutionary new text-to-video model, which is still being tested. It signifies generative AI’s conquest of all media forms, starting with text, then images, and now audio and video.
Generative video and audio come with a new set of risks for AI companies to negotiate, such as their models producing near-exact replicas of copyright material.
We’ve already witnessed this with text-to-audio model Suno, which produces very similar audio to famous songs like Queen’s “Bohemian Rhapsody” and ABBA’s “Dancing Queen.”
Neither OpenAI nor most AI companies have been notably transparent about their reliance on vast amounts of internet-sourced data, including copyrighted material, to train models.
OpenAI even acknowledged the challenges of avoiding copyrighted data in its development processes, stating in a submission to the British House of Lords that “it was ‘impossible” to build the technology without it.”
That was somewhat of a Freudian slip that exposed an inconvenient truth about AI training data.
However, despite OpenAI stating copyright data is unequivocally vital for generative AI, infringement has not yet been proven in a court of law, reflecting how copyright law in its current incarnation was simply not born for this era.
On the topic of training Sora specifically, OpenAI CTO Mira Murati, in an interview with Wall Street Journal, seemingly didn’t know what content was used to train Sora, including whether any YouTube content was involved.
Murati said, “I’m actually not sure about that,” when questioned about the content sources for Sora’s training, adding that any data utilized was either “publicly available or licensed.”
It’s not a gleaming report of transparency for OpenAI as they prepare to release their groundbreaking new model – one they’re already using to tender for business within Hollywood for its potential applications in film and TV.
Sora already caused producer Tyler Perry to pause an $800 million studio expansion, hinting at potentially massive upheaval for the creative industries ahead.
YouTube’s CEO speaks about Sora
YouTube CEO Mohan showed his awareness of the ongoing discussions about AI training practices. He hinted at OpenAI’s need to clarify the use of YouTube data.
He told Bloomberg, “From a creator’s perspective, when a creator uploads their hard work to our platform, they have certain expectations. One of those expectations is that the terms of service is going to be abided by. It does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service. Those are the rules of the road in terms of content on our platform.”
YouTube’s terms of service explicitly “prohibit unauthorized scraping or downloading of YouTube content,” a policy confirmed by a spokesperson for YouTube in light of Mohan’s comments.
Alphabet, YouTube’s parent, is keenly developing their own AI tools. We can expect backlash if OpenAI directly or indirectly used YouTube videos to train Sora.
The AI data gold rush has led to strategic partnerships and licensing agreements between tech companies and content providers. Numerous lawsuits are still in progress in the domains of text and image generation, but these remain largely inconclusive.
First, even when AI models expose themselves by reproducing copyrighted work (such as MidJourney spitting out images from Marvel movies or the Simpsons), their black box nature makes it nigh-impossible to determine where this data was retrieved and when precisely the infringement occurred.
Secondly, while AI-generated audio, images, video, etc., might illustrate strong evidence of infringement, it’s not as clear-cut as you or me copying an image of Mickey Mouse and selling it for millions without permission.
In response to these legal pressures, AI companies are starting to deal on valuable data.
For instance, Reddit’s $60 million per year licensing deal with Google for training AI tools exemplifies the formal arrangements emerging in the industry.
Similarly, media organizations such as The Associated Press and Axel Springer have entered into agreements allowing their content to be used for AI training, with provisions for attribution in AI-generated responses.
This presents its own challenges. Generative AI is costly to build and run, and now, AI companies must pay for the data rather than simply extract it from the internet.