🔔 Stay Updated!

Get instant alerts on breaking news, top stories, and updates from News EiSamay.

Inside the 300TB Spotify scrape and why it matters for the AI race

Hacktivists have scraped and archived 300TB of Spotify music and metadata, raising concerns over copyright violations and the use of pirated data to train AI models.

By Surjosnata Chatterjee

Dec 23, 2025 18:11 IST

A hacktivist group has scraped and archived a massive portion of Spotify’s music library, backing up nearly 300 terabytes of audio files and metadata in what it describes as an effort to preserve global music. The scale of the dataset, however, has triggered concern over copyright violations and the possibility of misuse by artificial intelligence companies.Surjo

According to a report by The Indian Express, the archive includes around 86 million audio files and over 256 million rows of track metadata, now listed on Anna’s Archive, an open-source search engine that indexes shadow libraries such as Sci-Hub and LibGen.

The metadata already released publicly covers artist names, song titles, producers, genres, release dates, durations, and 186 million ISRC codes, which uniquely identify sound recordings. Anna’s Archive has said the audio files will be released next, in phases, ordered by popularity and distributed as torrent files.

Also Read | Epic’s free game today is Paradise Killer — here’s how to claim it

Spotify confirms unlawful scraping, tightens safeguards

Spotify acknowledged the incident, saying it had identified and disabled accounts linked to unlawful scraping activity.

“We’ve implemented new safeguards for these types of anti-copyright attacks and are actively monitoring for suspicious behaviour,” a Spotify spokesperson told The Indian Express. “We have stood with the artist community against piracy and are working with industry partners to protect creators and their rights.”

In a separate statement cited by Billboard, Spotify said a preliminary investigation found that a third party scraped public metadata and used illicit methods to bypass DRM protections to access some audio files.

Anna’s Archive has always argued that it itself does not host any copyright infringement content, but rather that it’s a search index. In a blog entry from December 20, they asserted that this project was supposed to be a “preservation archive,” and that anyone could copy the dataset themselves if they had the storage space.

Importance of the archive in the era of AI

This scraping has drawn interest for its potential impact on the development of artificial intelligence. A large amount of music works and their metadata can prove highly beneficial in the training of machine learning technology for music analysis, recommendation, or music generation.

Whether or not to share data for research purposes and the potential for misuse by big companies such as Google or reselling for illegal purposes without paying royalties to artist-performers was under discussion on public forums like Hacker News.

Yoav Zimmerman, the CEO of AI startup Third Chair, noted that the archive, according to a post published on LinkedIn, offers a technical capability that could allow individuals to establish streaming platforms via Plex and other media managers.

Also Read | Toshiba aims to launch a 55TB hard drive by 2030, 40TB model will be arriving in 2026

Preservation claims and open research questions

Anna’s Archive has said the dataset only includes music available on Spotify before July 2025, and that its approach prioritised breadth over file quality. The group argued that existing music archives tend to focus on popular artists and high-fidelity formats, making comprehensive preservation difficult due to storage constraints.

The platform will release more information, including album art, file checksums, and patch files for rebuilding original audio formats. Recently, they announced the possibility of individual file downloading if there is enough interest.

At the moment, this episode has reignited debates around the topics of electronic preservation, piracy, and the creation of risks in the data-driven economy that are a result of large-scale data scraping, even when the latter takes the shape of an archives project.

Prev Article
Bengali date palm cakes sell out in minutes, rival Christmas classics

Articles you may like: