AI Copyright BattleExplainerJun 20, 2026, 11:44 PM· 8 min read· #4 of 4 in technology

How a New Searchable Database is Exposing the Hidden Music Used to Train AI

A newly published database reveals over 21 million copyrighted songs used to train generative AI, giving artists their first concrete tool to see if their work was ingested.

By Factlen Editorial Team

Share this story

Independent Musicians 30%Major Record Labels 30%Transparency Advocates 20%Generative AI Developers 20%

Independent Musicians: Argue that their life's work is being exploited without consent or compensation, demanding accountability.
Major Record Labels: View the mass scraping of copyrighted catalogs as blatant infringement and are leveraging this evidence in lawsuits.
Transparency Advocates: Argue that the public and creators deserve full visibility into the black box of AI training data.
Generative AI Developers: Maintain that training models on publicly accessible internet data qualifies as fair use and is essential for progress.

What's not represented

· Music Streaming Platforms (Spotify, Apple Music)
· Everyday Music Listeners

Why this matters

For the first time, musicians and rights holders have a transparent window into the 'black box' of AI development. This visibility shifts the balance of power in ongoing copyright battles and allows creators to verify if their intellectual property was used without consent.

Key points

The Atlantic has published a fully searchable database revealing 21.2 million songs used to train generative AI music models.
The datasets include tracks from massive pop stars like Taylor Swift as well as tens of thousands of independent artists.
AI developers compiled the audio by using automated tools to scrape publicly available links from platforms like YouTube and Spotify.
The exposed data provides major record labels with concrete evidence for their ongoing copyright infringement lawsuits against AI companies.

21.2 million

Total tracks uncovered across four datasets

12 million

Songs in the largest single archive

91 years

Time required to listen to the largest dataset

61,000

Sound recordings recently added to a major label lawsuit against Suno

For years, the artificial intelligence industry has operated behind a carefully constructed black box, shielding its most valuable assets from public scrutiny. As generative AI models rapidly learned to write coherent poetry, code complex software, and compose full-length pop songs that rival human production, the exact ingredients fed into these digital brains remained a closely guarded corporate secret. Creators across all disciplines were left to wonder if their life's work—their distinct voices, styles, and catalogs—was being used to train the very machines that are now threatening to replace them in the commercial market. The secrecy allowed AI companies to build massive valuations without having to answer uncomfortable questions about intellectual property and consent. Now, thanks to a landmark journalistic investigation, that black box has been forced open, fundamentally altering the balance of power between human artists and the technology sector.[1][2]

In a watershed moment for digital transparency and creator rights, The Atlantic has published a fully searchable database revealing the exact music used to train some of the world's most powerful artificial intelligence models. Compiled and verified by reporter Alex Reisner, the extensive investigation uncovered four massive datasets that have been quietly circulating within the AI development community for years. Together, these archives contain a staggering 21.2 million individual tracks, representing one of the largest known collections of unauthorized training data ever exposed to the public. By meticulously indexing this data and building a public-facing search tool, the publication has provided the music industry with a vital resource that strips away the plausible deniability previously enjoyed by tech companies.[3][6][7]

The sheer scale of the ingestion revealed by the database is difficult for the human mind to conceptualize. The largest single archive uncovered in the investigation contains an astonishing 12 million songs, while a second major dataset holds an additional 9 million tracks. To put this volume of audio into perspective, if a person were to listen to the largest dataset continuously, without ever pausing for sleep or breaks, it would take them 91 years to reach the end of the playlist. By making these datasets searchable, the project has handed musicians, producers, and record labels their first concrete, undeniable tool to verify whether their intellectual property was swept up in the generative AI gold rush without their knowledge or permission.[2][3][4]

To truly understand why the publication of this database is so disruptive to the tech ecosystem, one must understand the fundamental mechanics of how generative AI music models are built. These sophisticated systems do not inherently know what a distorted electric guitar sounds like, nor do they possess an innate understanding of the emotional arc of a pop chorus. They begin as blank slates and must be taught through massive, relentless exposure to existing human art. Developers compile these "training datasets" by feeding millions of audio files into a machine learning algorithm, which then analyzes the files to learn the mathematical patterns of rhythm, melody, timbre, and genre conventions.[5][7]

The investigation uncovered four distinct datasets circulating among AI developers.

Once the machine learning model has ingested and processed this vast library of human creativity, it can generate entirely new, highly convincing audio tracks based on simple text prompts from users. The controversy, however, lies entirely in how these massive troves of audio are acquired in the first place. According to the investigation, the datasets were largely compiled using sophisticated, automated scraping tools designed to extract audio files directly from consumer-facing platforms like YouTube, Spotify, and SoundCloud. This brute-force extraction method deliberately bypasses the platforms' terms of service, circumvents advertising mechanisms, and completely ignores the royalty payment structures that working artists rely on to survive in the modern streaming economy.[3][5]

The searchable database confirms that the automated scraping process was entirely indiscriminate, prioritizing volume over any consideration of copyright or consent. The datasets include the heavily protected, highly lucrative catalogs of global superstars like Taylor Swift, Bad Bunny, Billie Eilish, and Nirvana. But the automated tools did not stop at the top of the Billboard charts; they also swept up the life's work of tens of thousands of independent artists, experimental composers, and niche bedroom producers who lack the legal firepower and financial resources of the major record labels. For these smaller creators, the discovery that their music was used to train commercial AI products feels like a profound violation of their creative autonomy.[2][3][6]

The searchable database confirms that the automated scraping process was entirely indiscriminate, prioritizing volume over any consideration of copyright or consent.

The reaction from the independent music community has been a potent mixture of long-awaited vindication and deep-seated outrage. For months, many artists suspected that their distinct sonic signatures and unique production techniques were being mimicked by generative AI platforms, but they lacked any hard proof to back up their claims. Following the database's publication, musicians immediately flocked to the tool, taking to social media platforms to share screenshots of their own track titles appearing in the search results. This collective realization transformed abstract, theoretical anxieties about artificial intelligence into a documented, undeniable reality for thousands of working professionals.[4][5]

Prominent independent artists like Backxwash, Titus Andronicus, and the experimental electronic composer Hainbach are among those whose extensive catalogs were found buried within the massive archives. For these creators, the database is not merely a fascinating list of digital files; it is a stark ledger of uncompensated labor. They argue that their years of practice, financial investment, and creative struggle are now being used as free raw material to fuel commercial products that are capable of generating infinite, royalty-free soundalikes. The revelation has galvanized the independent sector, prompting renewed calls for collective action and legislative intervention to protect human artistry from unchecked algorithmic extraction.[1][4]

Independent musicians are using the tool to verify if their catalogs were swept up in the AI gold rush.

Beyond the profound emotional and financial impact on individual creators, the exposure of these datasets carries massive, immediate legal implications for the broader technology sector. The music industry is currently locked in an existential, high-stakes legal battle with some of the most prominent AI developers in the world. Major entities like Universal Music Group and Sony Music Entertainment are actively suing prominent AI music platforms, including Suno and Udio, alleging mass-scale, willful copyright infringement. Until now, the labels had to rely on circumstantial evidence and the outputs of the AI models to make their case, but the newly published database provides them with exactly the kind of concrete proof they need.[2][3]

In the courtroom, generative AI companies frequently rely on the "fair use" doctrine to defend their aggressive data collection practices. They argue that training a machine learning model on publicly accessible internet data is a highly transformative act that does not directly harm the original commercial market for the copyrighted works. They often claim to use only content that is freely available online, framing their massive ingestion process as being legally and philosophically akin to a human musician listening to the radio to gain inspiration before writing their own original song. By positioning their technology as a tool that learns rather than a machine that copies, these companies hope to avoid the catastrophic financial penalties associated with mass copyright infringement.[2][3][4]

The newly exposed datasets, however, severely complicate and potentially undermine this popular legal defense. By providing an exact, track-by-track inventory of the copyrighted material required to output commercially viable musical clones, the database gives record labels tangible, irrefutable evidence of the mechanics behind the technology. It demonstrates clearly that the AI models are not merely taking abstract inspiration from the ether, but are fundamentally, structurally reliant on the wholesale ingestion of protected intellectual property. Without the millions of unauthorized human-created songs serving as a foundation, the AI platforms would be entirely incapable of generating their highly praised outputs.[3][5]

The unprecedented transparency brought by this searchable database strips away the veil of secrecy typically maintained by AI companies and sets a powerful new precedent for accountability in the digital landscape. It fundamentally shifts the burden of proof in the ongoing copyright debates, allowing rights holders of all sizes to point to specific, documented instances of their work being used without authorization, credit, or compensation. This newfound visibility empowers creators to demand answers from the tech platforms that have historically operated with impunity, forcing a public reckoning over the true cost of artificial intelligence development. It proves that the "magic" of AI is entirely dependent on the uncredited labor of millions of human beings.[2][3]

The sheer scale of the ingested data far exceeds human listening capacity.

As these complex legal battles continue to grind their way through the federal courts, the music industry is aggressively pushing for the establishment of a standardized, mandatory licensing framework. Advocacy groups and industry leaders argue that if AI developers want to build highly profitable commercial products using human creativity as their foundational raw material, they must negotiate licenses and pay for the privilege, just as streaming services, radio stations, and film studios have done for decades. They envision a future where artists can explicitly opt-in to training datasets and receive fair, ongoing compensation for their contributions to the machine learning ecosystem.[6]

Ultimately, the publication of this searchable database marks a definitive turning point in the increasingly fraught relationship between human creators and artificial intelligence. It transitions the global debate from theoretical, philosophical arguments about the nature of fair use to concrete, actionable discussions about specific, documented data usage. For the first time since the generative AI boom began, the artists, producers, and songwriters who unknowingly built the foundation of the AI music revolution can finally see their names written into the code. Armed with this undeniable proof, the music industry is now better equipped than ever to demand transparency, enforce their copyrights, and ensure that the future of digital creation does not come at the expense of human artistry.[3][5][7]

How we got here

2020
OpenAI scrapes roughly 1.2 million songs from the internet to train its early Jukebox model.
2022
Google trains an experimental music generation model on a dataset of 44 million songs.
Early 2024
Major record labels launch massive copyright infringement lawsuits against AI music generators like Suno and Udio.
June 2026
The Atlantic publishes a searchable database of 21.2 million tracks used in AI training, giving artists public proof of ingestion.

Viewpoints in depth

Transparency Advocates

Journalists and digital rights advocates view the publication of this database as a critical victory against corporate secrecy.

Advocates argue that as AI systems become increasingly integrated into daily life, the public has a fundamental right to know what data is shaping these models. By forcing the training datasets into the light, they hope to establish a new standard where tech companies can no longer hide massive data extraction behind proprietary algorithms and corporate black boxes.

Independent Musicians

Independent creators view the database as a crucial tool for accountability and proof of uncompensated labor.

For independent artists, the database transforms abstract anxieties into documented reality. Many view the automated scraping of their catalogs as a profound violation of their creative rights, arguing that their life's work is being exploited to build commercial products that could eventually replace them. They are demanding a standardized licensing framework where consent is required and compensation is paid before a single track is ingested.

Major Record Labels

The major labels view the mass scraping as blatant copyright infringement and are weaponizing the data in court.

Entities like Universal Music Group and Sony Music Entertainment are treating the exposed datasets as smoking-gun evidence in their existential legal battles against AI developers. They argue that the wholesale ingestion of protected intellectual property cannot be excused as 'fair use,' as the resulting AI models directly compete with and devalue the original human-created works. For the labels, this data is the key to forcing AI companies to the negotiating table.

Generative AI Developers

AI companies argue that training models on publicly accessible data is legally permissible and essential for innovation.

Developers of generative AI music platforms maintain that their ingestion processes fall under the 'fair use' doctrine. They argue that machine learning is a transformative act—akin to a human musician listening to the radio to learn a genre's conventions—and that they are not reproducing the copyrighted works, but rather analyzing their underlying mathematical patterns. They warn that overly restrictive licensing frameworks could stifle technological progress and consolidate AI development in the hands of a few massive tech monopolies.

What we don't know

It remains unclear exactly which commercial AI music platforms utilized these specific datasets for their final consumer products.
The courts have not yet definitively ruled on whether training AI models on copyrighted music constitutes 'fair use' or mass infringement.
It is unknown how streaming platforms will alter their infrastructure to prevent automated scraping tools from downloading audio in the future.

Key terms

Generative AI Music Model: An artificial intelligence system trained on vast amounts of audio data to generate entirely new, original songs, vocals, and instrumentation from text prompts.
Data Scraping: The automated process of extracting large amounts of information from websites, often bypassing standard user interfaces or terms of service.
Fair Use: A legal doctrine in US copyright law that permits limited use of copyrighted material without acquiring permission, often cited by AI companies as justification for their training methods.
Training Dataset: A massive collection of examples—in this case, millions of audio files—fed into a machine learning algorithm to teach it patterns, structures, and styles.

Frequently asked

Can I check if my music was used to train AI?

Yes, The Atlantic has made the datasets publicly searchable through its AI Watchdog site, allowing anyone to search for specific artist names or track titles.

Did the AI companies pay for this music?

No. The datasets were compiled by scraping publicly available links from platforms like YouTube and Spotify, bypassing standard licensing and monetization structures.

Is it illegal to train AI on copyrighted music?

This is currently the subject of massive ongoing lawsuits. AI companies argue it falls under 'fair use,' while record labels and artists argue it constitutes mass copyright infringement.

Sources

[1]The VergeTransparency Advocates
The Atlantic created a searchable database of the music used to train AI
Read on The Verge →
[2]EngadgetGenerative AI Developers
Investigation By The Atlantic Reveals Many Millions Of Songs Used For AI Music Training
Read on Engadget →
[3]HypebeastMajor Record Labels
Investigation by The Atlantic uncovers four searchable datasets containing over 21 million tracks used to train generative AI music models
Read on Hypebeast →
[4]Exclaim!Independent Musicians
Backxwash, Titus Andronicus Among Musicians to Find Their Songs in AI-Training Datasets Exposed by 'The Atlantic'
Read on Exclaim! →
[5]BillboardIndependent Musicians
A new data leak is showing artists if their music has been used to train AI models
Read on Billboard →
[6]Music Business WorldwideMajor Record Labels
Four datasets of music are circulating among artificial intelligence developers
Read on Music Business Worldwide →
[7]The AtlanticTransparency Advocates
The AI Watchdog Database: Search the Music Used to Train AI
Read on The Atlantic →

Up next

Authentication

The Evidence on Passkeys: Are They Actually Replacing Passwords in 2026?

With an estimated 5 billion passkeys now in active use globally, cryptographic authentication has moved from a tech-industry experiment to an operational baseline. Here is the evidence on how well the post-password transition is working.

Every angle. Every day.

Get technology stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse technology