AI Copyright LawExplainerJun 26, 2026, 3:56 PM· 8 min read· #1 of 3 in culture

Artists Win Landmark Discovery Ruling Forcing AI Giants to Disclose Training Data

A federal judge has ordered major generative AI companies to hand over their proprietary training datasets to plaintiffs, piercing the industry's 'black box' and granting visual artists a crucial victory in their ongoing copyright lawsuit.

By Factlen Editorial Team

Share this story

Visual Artists & Creators 40%Generative AI Developers 35%Legal & Copyright Scholars 25%

Visual Artists & Creators: Argue that AI models are built on mass copyright infringement and demand transparency and compensation.
Generative AI Developers: Maintain that training on public data is fair use and warn that forced disclosure threatens trade secrets and innovation.
Legal & Copyright Scholars: Focus on how this discovery evidence will test the boundaries of the fair use doctrine in the digital age.

What's not represented

· Open-source AI researchers
· Commercial brands using AI-generated art

Why this matters

For years, AI companies have shielded their training data behind claims of trade secrecy. This ruling forces transparency, giving human creators the forensic evidence they need to prove their work was used without consent and potentially forcing a shift toward ethical, licensed AI models.

Key points

A federal judge ordered generative AI companies to disclose their complete training datasets to plaintiffs in a major copyright lawsuit.
The ruling pierces the industry's 'black box,' rejecting arguments that the data is too proprietary or burdensome to share.
Plaintiffs will use the data to trace whether specific artists' portfolios were deliberately targeted to train the models.
The disclosed data will be kept under a strict protective order to prevent trade secrets from leaking to competitors.
The evidence gathered will be crucial for the upcoming summary judgment motions and a jury trial scheduled for 2027.

5.8 billion

Images in the LAION-5B dataset

Lead plaintiffs in the class-action suit

$1.5 billion

Anthropic's 2025 copyright settlement

2027

Scheduled year for the jury trial

On Thursday, a federal judge in the Northern District of California handed a monumental victory to visual creators, ordering generative AI giants including Midjourney and Stability AI to fully disclose the exact datasets and internal scraping logs used to train their flagship image models. The landmark discovery ruling pierces the so-called 'black box' of artificial intelligence development, compelling some of the world's most highly valued tech companies to hand over the native files and proprietary data structures that power their systems. For an industry that has long shielded its ingestion practices behind claims of corporate secrecy, the order represents a seismic shift in legal accountability.[1][2]

The decision marks a critical turning point in Andersen v. Stability AI, a sprawling class-action copyright lawsuit spearheaded by prominent illustrators Sarah Andersen, Kelly McKernan, and Karla Ortiz. Since initially filing the suit in early 2023, the plaintiffs have consistently argued that generative AI platforms are built on the unauthorized, wholesale ingestion of billions of copyrighted artworks. They contend that the software does not simply learn from public data, but actively misappropriates the specific creative labor of human artists to build a commercial product that directly undercuts their livelihoods. The case has since expanded to represent thousands of visual creators who found their distinct styles replicated by machine learning algorithms.[3][4]

Until now, proving that claim has been an exhausting forensic guessing game. AI developers have fiercely guarded their training pipelines, arguing that their datasets are proprietary trade secrets and that extracting specific ingestion logs would be technically unfeasible. When artists found AI-generated images that bore an uncanny resemblance to their own portfolios, they could only point to the output as circumstantial evidence. Without access to the underlying code and the exact list of images the model was fed, plaintiffs struggled to definitively prove that their specific copyrighted works were the source material.[5]

The court's order effectively dismantles that defense. The judge ruled that the plaintiffs cannot fairly argue their case without direct access to the underlying evidence, ordering the companies to produce the complete, native datasets to the plaintiffs' legal and technical teams. This includes not only the massive open-source repositories like the 5.8-billion-image LAION-5B database, but also the highly guarded, proprietary additions and fine-tuning datasets that companies like Midjourney use to polish their final models. The ruling asserts that the need for transparency in a copyright dispute outweighs the developers' desire for corporate secrecy.[3][6]

The discovery order forces the disclosure of massive datasets, including proprietary additions to the 5.8-billion-image LAION database.

To understand the profound stakes of this disclosure, it is necessary to understand exactly how these generative systems are built. Modern image generators rely on a sophisticated machine learning architecture known as latent diffusion. In this process, an algorithm is fed billions of text-and-image pairs, gradually learning to associate specific words with visual concepts by adding and removing digital noise. Over months of continuous processing, the model learns the mathematical relationships that define everything from the shape of a human hand to the specific brushstrokes of a famous oil painter.[2][7]

The artists' central legal claim is that this ingestion process is not merely passive learning, but an act of mass mechanical reproduction that violates federal law. They argue that the AI models essentially function as unauthorized derivative works, compressing billions of copyrighted images into a mathematical latent space. Because the models are commercial products sold via monthly subscriptions, the plaintiffs argue that the tech companies are directly monetizing stolen labor, creating a synthetic competitor that can instantly generate artwork in the exact style of the humans it was trained on.[4][8]

With the discovery gates now forced open, the plaintiffs' forensic experts will gain unprecedented, direct access to the servers and native files of the AI developers. This raw evidence will allow them to trace the exact lineage of the training data. Experts will be looking for proof that specific artists' portfolios were deliberately over-sampled, or weighted heavily during the fine-tuning phase, to allow the model to perfectly mimic their unique, recognizable styles. If the logs reveal internal communications directing engineers to scrape specific copyrighted galleries, it could prove devastating to the defense.[1][5]

With the discovery gates now forced open, the plaintiffs' forensic experts will gain unprecedented, direct access to the servers and native files of the AI developers.

The tech companies, however, maintain that their training methods are legally sound and fundamentally misunderstood by the plaintiffs. In their defense filings, Midjourney and Stability AI argue that their models learn abstract concepts rather than memorizing exact pixels. They insist that the training process is akin to a human art student walking through a museum—studying millions of paintings to understand color theory and composition, but never actually copying a specific canvas. From their perspective, the final model contains no actual images, only mathematical weights.[2][3]

The visual arts lawsuit is scheduled to reach a jury trial in 2027, following years of procedural battles.

To shield this practice, the developers rely heavily on the doctrine of 'fair use,' a crucial provision in US copyright law that permits the unlicensed use of copyrighted material if the new use is highly transformative and does not harm the original creator's market. The AI companies argue that turning billions of disparate images into a versatile, general-purpose creative tool is the very definition of transformative use. They warn that a ruling against them would effectively outlaw machine learning in the United States, ceding the technological future to international rivals.[6][7]

Furthermore, the companies warned the court that forcing the handover of their complete datasets exposes them to severe, potentially catastrophic cybersecurity risks. They argued that transferring massive, proprietary data structures—the core intellectual property that gives their platforms a competitive edge in a multi-billion-dollar market—could allow foreign competitors or malicious actors to reverse-engineer their flagship products. The developers pleaded that the sheer logistical burden of securely transferring petabytes of sensitive data was an unreasonable demand for a civil lawsuit, and that the risk of a devastating data leak far outweighed the plaintiffs' need for discovery.[5]

Acknowledging these valid security concerns, the court placed the disclosed data under an exceptionally strict protective order. The datasets will not be made public; instead, they will be housed on highly secure, air-gapped servers. Access will be strictly limited to the plaintiffs' lead attorneys and a small team of vetted, independent technical experts. These experts will be permitted to run forensic queries and analyze the code, but they are legally barred from copying the models or sharing the underlying trade secrets with anyone outside the immediate confines of the litigation.[1][6]

This ruling builds on a growing legal momentum for content creators across the broader artificial intelligence landscape. In early 2025, a similar discovery order in the Tremblay v. OpenAI lawsuit forced the disclosure of the massive text datasets used to train ChatGPT. Later that same year, the AI startup Anthropic agreed to a historic $1.5 billion settlement after discovery revealed the company had used shadow libraries of pirated books to train its Claude chatbot. Across the board, courts are increasingly rejecting the argument that AI training data is too complex or proprietary to be scrutinized.[3][8]

For visual creators, the ruling provides the forensic tools needed to prove their work was ingested without consent.

However, this is the first time the multi-billion-parameter visual models have been forced open in such a comprehensive manner. The forensic evidence unearthed in this discovery phase will serve as the factual bedrock for the upcoming summary judgment motions, where both sides will ask the judge to rule on the core legal questions. If the case is not resolved or settled during that phase, the evidence will be presented to a jury in a landmark trial currently scheduled for April 2027.[2][4]

The ultimate uncertainty now lies in what the raw data will actually reveal. If the internal logs and scraping scripts show that developers actively sought out and targeted specific artists' websites to intentionally replicate their styles, it could severely undermine the companies' fair use defense by proving bad faith. Conversely, if the data proves the models only absorbed broad statistical patterns without disproportionately targeting individual creators, the tech industry's legal standing could be significantly solidified, proving their 'art student' analogy correct.[5][7]

Regardless of the trial's ultimate outcome, the discovery ruling itself is already being celebrated as a massive, paradigm-shifting win for the 'Data Dignity' movement. This growing global coalition of creators, ethicists, and technologists advocates for a future where human labor is respected and compensated. They are demanding transparent licensing frameworks, collective bargaining rights, and strictly opt-in consent models for all future AI training, arguing that the era of 'move fast and break things' must end when it comes to human creativity.[4][8]

By forcing the AI giants to open their books and reveal their exact ingestion practices, the federal court has fundamentally shifted the balance of power in the digital age. Artists are no longer fighting a theoretical, uphill battle against an opaque algorithm; they now have the concrete legal tools to demand accountability. As the forensic teams begin their deep dive into the code, the creative industry is one step closer to establishing an ethical, sustainable framework for the future of artificial intelligence.[1][4]

How we got here

Jan 2023
Sarah Andersen and other artists file a class-action copyright lawsuit against Stability AI and Midjourney.
Aug 2024
A federal judge denies the AI companies' motion to dismiss, allowing the core copyright infringement claims to proceed.
Jan 2025
A separate ruling in Tremblay v. OpenAI forces the disclosure of text-based training datasets.
Jun 2026
The court orders generative image AI companies to fully disclose their visual training datasets and internal logs.
Apr 2027
The scheduled start date for the jury trial in the Andersen v. Stability AI case.

Viewpoints in depth

Visual Artists & Creators

For the creative community, this ruling is about basic fairness and consent.

Artists argue that their life's work has been strip-mined to build commercial products that now compete directly against them. By gaining access to the training data, they believe they can finally prove that AI models are not just 'learning' abstract concepts, but are actively compressing and reproducing their specific, copyrighted expressions without permission or compensation.

Generative AI Developers

Tech companies view the forced disclosure as a dangerous precedent that threatens the US AI industry.

They argue that machine learning relies on analyzing massive swaths of public data—a practice they insist falls squarely under fair use. Developers warn that opening proprietary datasets not only exposes highly sensitive trade secrets to international competitors but also fundamentally misunderstands how latent diffusion models work, as the models do not store actual images.

Legal & Copyright Scholars

Legal experts see this discovery phase as the crucible that will forge modern copyright law.

Scholars note that while the AI companies have successfully argued 'fair use' in theory, the actual internal data and communications will determine if they acted in good faith. If the disclosed logs reveal deliberate targeting of specific artists to mimic their styles, it could shatter the transformative use defense and force the entire industry into a licensed, opt-in model.

What we don't know

It remains unclear exactly what the internal scraping logs will reveal about the developers' intent and targeting practices.
We do not yet know how the court will ultimately rule on the 'fair use' defense once the forensic evidence is presented.
It is uncertain if this discovery order will force the AI companies to settle before the 2027 trial begins.

Key terms

Discovery: The formal pre-trial phase in a lawsuit where parties are legally required to exchange evidence, documents, and information relevant to the case.
Latent Diffusion: An AI technique that generates images by starting with random noise and gradually refining it into a coherent picture based on patterns learned from training data.
Training Data: The massive collections of text, images, or audio used to teach an artificial intelligence model how to recognize patterns and generate new content.
Fair Use: A legal doctrine in US copyright law that allows the limited, unlicensed use of copyrighted material for transformative purposes, such as commentary, criticism, or research.
Data Dignity: A movement advocating for the rights of creators to control how their digital data and intellectual property are used, demanding consent and compensation.

Frequently asked

What exactly did the judge order the AI companies to do?

The judge ordered companies like Midjourney and Stability AI to hand over their complete, native training datasets and internal scraping logs to the plaintiffs' legal team.

Will the public be able to see the training data?

No. The court placed the data under a strict protective order, meaning it will be kept on secure servers and can only be viewed by the plaintiffs' attorneys and vetted technical experts.

Why is this ruling considered a landmark victory?

It pierces the 'black box' of AI development. For the first time, artists will have the concrete forensic evidence needed to prove exactly whose work was used to build these commercial models.

What is the AI companies' main defense?

The companies argue that training AI on publicly available images constitutes 'fair use' under US law, claiming their models learn abstract concepts rather than copying exact pixels.

Sources

[1]ReutersLegal & Copyright Scholars
Federal judge orders AI image generators to disclose training data in copyright suit
Read on Reuters →
[2]WIREDGenerative AI Developers
The Black Box Opens: Artists Win Access to AI Training Data
Read on WIRED →
[3]Bloomberg LawLegal & Copyright Scholars
Stability AI, Midjourney Ordered to Produce Training Data to Artists
Read on Bloomberg Law →
[4]ARTnewsVisual Artists & Creators
A Massive Win for Creators: Judge Forces AI Giants to Reveal Training Data
Read on ARTnews →
[5]The VergeGenerative AI Developers
AI companies must hand over their training data to artists, judge rules
Read on The Verge →
[6]Electronic Frontier FoundationLegal & Copyright Scholars
What the Visual AI Discovery Ruling Means for Fair Use and Trade Secrets
Read on Electronic Frontier Foundation →
[7]US District Court for the Northern District of CaliforniaLegal & Copyright Scholars
Order Granting Plaintiffs' Motion to Compel Discovery
Read on US District Court for the Northern District of California →
[8]The Art NewspaperVisual Artists & Creators
Data Dignity: Artists celebrate court order forcing AI firms to reveal scraped images
Read on The Art Newspaper →

Up next

Civic Museums

The Box in Plymouth Wins World's Largest Museum Prize for Pioneering Civic Impact

Plymouth's flagship cultural hub, The Box, has been named the Art Fund Museum of the Year 2026, securing a £120,000 prize for its transformative community outreach and £244 million economic impact.

Every angle. Every day.

Get culture stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse culture