Over 16,000 artists’ names have been linked with the non-consensual training of Midjourney’s image generation models.
The Midjourney artist database is attached to an amended lawsuit submitted against Stability AI, DeviantArt, and Midjourney, filed under Exhibit J, and in a recently leaked public Google spreadsheet, part of which can be viewed in the Internet Archive here.
Artist Jon Lam shared screenshots on X from a Midjourney Discord chat where developers discuss using artist names and styles from Wikipedia and other sources.
The spreadsheet is believed to have originally been sourced from Midjourney’s development team and squares up with the leaked Discord chats from Midjourney developers, which allude to the artist’s work being mapped to ‘styles.’
By encoding artist work as ‘styles,’ Midjourney can efficiently recreate work in their style.
Lam writes, “Midjourney developers caught discussing laundering, and creating a database of Artists (who have been dehumanized to styles.”
Lam also shared videos of lists of artists, including those used for Midjourney styles and another list of ‘proposed artists.’ Numerous X users stated their names were on these lists.
Midjourney developers caught discussing laundering, and creating a database of Artists (who have been dehumanized to styles) to train Midjourney off of. This has been submitted into evidence for the lawsuit. Prompt engineers, your “skills” are not yourshttps://t.co/wAhsNjt5Kz pic.twitter.com/EBvySMQC0P
— Jon Lam #CreateDontScrape (@JonLamArt) December 31, 2023
One screenshot appears to show a statement by Midjourney CEO David Holz celebrating the addition of 16,000 artists to the training program.
Another shows a Midjourney developer discussing that you have to “launder it” through a “Codex,” though, without context, it’s tough to say whether this is referring to artists’ work.
Others (not Midjourney employees) in that same conversation refer to how processing artwork through an AI model essentially disembodies it from copyright.
One says, “all you have to do is just use those scraped datasets and the conveniently forget what you used to train the model. Boom legal problems solved forever.”
How legal cases are developing
In legal cases submitted against Midjourney, Stability AI, and also OpenAI, Meta, and Google (but for text-based work, rather than images), artists, writers, and others have found it tough to prove their work is really ‘inside’ the model verbatim.
That would be the smoking gun they need to prove copyright violations.
Copyright, in general, remains poorly defined in the era of AI. AI models are trained on data that has to come from somewhere, and what better source to find that data than the internet?
The developers ‘scrape’ what’s termed as ‘open,’ ‘open-source,’ or ‘public’ data from the internet, but again, these concepts are poorly defined. It might be fair to say that when AI developers smelled the imminent gold rush, they seized as much ‘open’ data from the internet as they could and used it to train their models.
Legal processes are slow; AI is lightspeed in comparison. It was very easy for developers to outflank copyright law and train models long before copyright holders and the law that governs intellectual property could react.
The reaction process is now underway, but both the AI training process and the technical process involved in generating AI outputs (e.g., text or images) from user inputs challenge the nature of intellectual property law.
Specifically, it’s a) hard to prove that AI models are definitely trained on copyright material and b) hard to prove their outputs replicate copyright material sufficiently.
There’s also the issue of accountability. AI companies like OpenAI and Midjourney at least partly used data harvested by others rather than harvesting it themselves. So, would it not be the original data scrapers liable for infringement?
In the context of this recent situation at Midjourney, Midjourney’s models, like others, will always reproduce a mixture of works contained within its data. Artists can’t easily prove what pieces they’ve used.
For example, when a recent copyright case against Midjourney, Stability AI, and DeviantArt was dismissed (it’s since been resubmitted with new plaintiffs), Federal Judge Orrick identified several defects in the way the claims were framed, particularly in their understanding of how AI image generators function.
The original lawsuit alleged that Stability AI, in training its Stable Diffusion model, stored compressed copies of the images.
Stability AI refuted this, clarifying that the training process involves extracting attributes such as lines, shades, and colors and developing parameters based on these attributes rather than storing copies of the images.
Orrick’s ruling highlighted the need for the plaintiffs to amend their claims to more accurately represent the operation of these AI models.
This includes a need for a clearer explanation of whether the claim against Midjourney was due to its use of Stable Diffusion, its independent use of training images, or both (as Midjourney is also being accused of using Stability AI’s models, which allegedly use copyrighted works).
Another challenge for the plaintiffs is demonstrating that Midjourney’s outputs are substantially similar to their original artworks. Orrick noted that the plaintiffs themselves admitted that the output images from Stable Diffusion are unlikely to closely match any specific image in the training data.
As of now, the case is alive, with the court denying AI companies’ most recent attempts to dismiss the artists’ claims.
Gen Ai techbros would have you believe the lawsuit is dead or thrown out, no, the lawsuit is still alive and well, and more evidence and plaintiffs have been added to the casefile.
Updated Casefile here.https://t.co/uTqs6grWRE
— Jon Lam #CreateDontScrape (@JonLamArt) January 2, 2024
.
LAION dataset usage thrown into the mix
Legal cases submitted against Midjourney and co. also emphasized their potential use of the LAION-5B dataset – a compilation of 5.85 billion internet-sourced images, including copyrighted content.
Stanford recently blasted LAION for containing illicit sexual images, including child sex abuse and various sexist, racist, and otherwise deplorable content – all of which now also ‘lives’ inside the AI models that society is starting to depend on for creative and professional uses.
The long-term implications of that are hotly debated, but the fact these AIs are possibly firstly trained on stolen work and secondly on illegal content doesn’t shed positive light on AI development in general.
Midjourney developer comments have been widely lambasted on social media and the Y Combinator forum.
It’s very likely that 2024 will cook up more fiery legal debates, and the Wild West chapter of AI development might be coming to a close.