16,000 artist names controversially leaked as Midjourney “styles”

January 5, 2024
Midjourney AI

Over 16,000 artists’ names have been linked with the non-consensual training of Midjourney’s image generation models.

The Midjourney artist database is attached to an amended lawsuit submitted against Sta­bil­ity AI, DeviantArt, and Mid­jour­ney, filed under Exhibit J, and in a recently leaked public Google spreadsheet, part of which can be viewed in the Internet Archive here

Artist Jon Lam shared screenshots on X from a Midjourney Discord chat where developers discuss using artist names and styles from Wikipedia and other sources.

The spreadsheet is believed to have originally been sourced from Midjourney’s development team and squares up with the leaked Discord chats from Midjourney developers, which allude to the artist’s work being mapped to ‘styles.’

By encoding artist work as ‘styles,’ Midjourney can efficiently recreate work in their style. 

Lam writes, “Midjourney developers caught discussing laundering, and creating a database of Artists (who have been dehumanized to styles.”

Lam also shared videos of lists of artists, including those used for Midjourney styles and another list of ‘proposed artists.’ Numerous X users stated their names were on these lists. 

One screenshot appears to show a statement by Midjourney CEO David Holz celebrating the addition of 16,000 artists to the training program. 

Another shows a Midjourney developer discussing that you have to “launder it” through a “Codex,” though, without context, it’s tough to say whether this is referring to artists’ work.

Others (not Midjourney employees) in that same conversation refer to how processing artwork through an AI model essentially disembodies it from copyright.

One says, “all you have to do is just use those scraped datasets and the conveniently forget what you used to train the model. Boom legal problems solved forever.”

How legal cases are developing

In legal cases submitted against Midjourney, Stability AI, and also OpenAI, Meta, and Google (but for text-based work, rather than images), artists, writers, and others have found it tough to prove their work is really ‘inside’ the model verbatim.

That would be the smoking gun they need to prove copyright violations.  

Copyright, in general, remains poorly defined in the era of AI. AI models are trained on data that has to come from somewhere, and what better source to find that data than the internet?

The developers ‘scrape’ what’s termed as ‘open,’ ‘open-source,’ or ‘public’ data from the internet, but again, these concepts are poorly defined. It might be fair to say that when AI developers smelled the imminent gold rush, they seized as much ‘open’ data from the internet as they could and used it to train their models.

Legal processes are slow; AI is lightspeed in comparison. It was very easy for developers to outflank copyright law and train models long before copyright holders and the law that governs intellectual property could react.

The reaction process is now underway, but both the AI training process and the technical process involved in generating AI outputs (e.g., text or images) from user inputs challenge the nature of intellectual property law.

Specifically, it’s a) hard to prove that AI models are definitely trained on copyright material and b) hard to prove their outputs replicate copyright material sufficiently.

There’s also the issue of accountability. AI companies like OpenAI and Midjourney at least partly used data harvested by others rather than harvesting it themselves. So, would it not be the original data scrapers liable for infringement?

In the context of this recent situation at Midjourney, Midjourney’s models, like others, will always reproduce a mixture of works contained within its data. Artists can’t easily prove what pieces they’ve used. 

For example, when a recent copyright case against Midjourney, Stability AI, and DeviantArt was dismissed (it’s since been resubmitted with new plaintiffs), Federal Judge Orrick identified several defects in the way the claims were framed, particularly in their understanding of how AI image generators function. 

The original lawsuit alleged that Stability AI, in training its Stable Diffusion model, stored compressed copies of the images. 

Stability AI refuted this, clarifying that the training process involves extracting attributes such as lines, shades, and colors and developing parameters based on these attributes rather than storing copies of the images.

Orrick’s ruling highlighted the need for the plaintiffs to amend their claims to more accurately represent the operation of these AI models. 

This includes a need for a clearer explanation of whether the claim against Midjourney was due to its use of Stable Diffusion, its independent use of training images, or both (as Midjourney is also being accused of using Stability AI’s models, which allegedly use copyrighted works). 

Another challenge for the plaintiffs is demonstrating that Midjourney’s outputs are substantially similar to their original artworks. Orrick noted that the plaintiffs themselves admitted that the output images from Stable Diffusion are unlikely to closely match any specific image in the training data. 

As of now, the case is alive, with the court denying AI companies’ most recent attempts to dismiss the artists’ claims. 

LAION dataset usage thrown into the mix

Legal cases submitted against Midjourney and co. also emphasized their potential use of the LAION-5B dataset – a compilation of 5.85 billion internet-sourced images, including copyrighted content. 

Stanford recently blasted LAION for containing illicit sexual images, including child sex abuse and various sexist, racist, and otherwise deplorable content – all of which now also ‘lives’ inside the AI models that society is starting to depend on for creative and professional uses. 

The long-term implications of that are hotly debated, but the fact these AIs are possibly firstly trained on stolen work and secondly on illegal content doesn’t shed positive light on AI development in general. 

Midjourney developer comments have been widely lambasted on social media and the Y Combinator forum.

It’s very likely that 2024 will cook up more fiery legal debates, and the Wild West chapter of AI development might be coming to a close.

Join The Future


SUBSCRIBE TODAY

Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Sam Jeans

Sam is a science and technology writer who has worked in various AI startups. When he’s not writing, he can be found reading medical journals or digging through boxes of vinyl records.

×
 
 

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI


 

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.



 
 

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions