In the mad dash to dominate the AI industry, tech giants are pushing ethical boundaries and testing the limits of public trust.
A pattern of recent revelations raises alarm bells about data privacy, fair competition, and the concentration of power and talent.
First off, an investigation by Proof News and WIRED uncovered that Apple, NVIDIA, Anthropic, and Salesforce have been using a dataset containing subtitles from over 170,000 YouTube videos to train their AI models.
This dataset, known as “YouTube Subtitles,” was compiled without the consent of content creators, potentially violating YouTube’s terms of service.
The scale of this data mining operation is staggering. It includes content from educational institutions like Harvard, popular YouTubers such as MrBeast and PewDiePie, and even major news outlets like The Wall Street Journal and the BBC.
Investigation reveals that a dataset used for gen AI training by Apple & others contains copyrighted YouTube transcripts accessed without permission. More info:
– The Pile dataset contains transcripts of 170k YouTube videos
– Used by Apple, Anthropic, Nvidia, Salesforce & more… pic.twitter.com/RE0UjhumA3— Ed Newton-Rex (@ednewtonrex) July 16, 2024
YouTube is yet to react, but back in April, CEO Neal Mohan said OpenAI’s potential use of videos to train text-to-video model Sora would violate its terms of service, telling Bloomberg, “If Sora used content from YouTube it would be a ‘clear violation’ of its terms of service.”
OpenAI isn’t among the accused on this occasion, and we don’t know whether YouTube will attempt to take action if the new allegations are proven truthful.
This is far from the first time tech companies have been caught in the crosshairs for data usage practices.
In 2018, Facebook faced intense scrutiny over the Cambridge Analytica scandal, where millions of users’ data was harvested without consent for political advertising.
More pertinently to AI, in 2023, it was discovered that a dataset called Books3, containing over 180,000 copyrighted books, had been used to train AI models without authors’ permission. This led to a wave of lawsuits against AI companies, with authors claiming copyright infringement.
That’s just one example from an ever-growing stack of lawsuits emanating from every corner of the creative industries. Universal Music Group, Sony Music, and Warner Records are among the most prolific entities that added their names to the list after joining together to target text-to-audio AI companies Udio and Suno.
In their rush to build more advanced AI models, it seems as if tech companies have adopted an “ask for forgiveness, not permission” approach to data acquisition.
The Microsoft-Inflection merger
While the YouTube scandal unfolds, Microsoft’s recent hiring spree from AI startup Inflection has caught the eye of UK regulators.
The Competition and Markets Authority (CMA) has launched a phase one merger investigation, probing whether this mass hiring constitutes a de facto merger that could stifle competition in the AI sector.
This incisive move by Microsoft included scooping up Inflection’s co-founder Mustafa Suleyman (a former Google DeepMind executive) and a significant portion of the startup’s staff.
Inflection once marketed itself as a proud independent AI lab. They then proved that a dying breed.
It takes on added weight when considering Microsoft’s existing partnerships in the AI field. The company has already invested a total of some $13 billion in OpenAI, raising questions about market concentration.
Thickening the plot, Microsoft recently retreated from its non-voting seat at OpenAI. Experts say this likely resulted from a decision to rein in the company’s oversight to appease antitrust authorities.
Alex Haffner, a competition partner at law firm Fladgate, said of Microsoft’s surprise decision, “It is hard not to conclude that Microsoft’s decision has been heavily influenced by the ongoing competition/antitrust scrutiny of its (and other major tech players) influence over emerging AI players such as OpenAI.”
A trust deficit?
Both the YouTube data mining scandal and Microsoft’s hiring practices contribute to a growing trust deficit between Big Tech and the public.
An immediate impact is that content creators have become more guarded about their work in fear of exploitation.
This could have a knock-on effect on content creation and sharing, ultimately impoverishing the very platforms that tech companies rely on for data.
Similarly, the concentration of AI talent in a few major companies is homogenizing AI development and limiting diversity.
For tech companies, rebuilding trust will likely require more than just compliance with future regulations and antitrust investigations.
Questions linger: can we harness the true potential of AI while preserving ethics, fair competition, and public trust?