How to scrape all of Spotify

June 28th, 2022

Back in 2020 I published a paper analyzing the Spotify collaboration graph. You can read it in on Arxiv for free or behind a fancy paywall at the Journal. Since that publication, several people have asked me how to download the full network of Spotify. The short answer: don’t, it’s gonna take a bunch of work.

The long answer. Well, here goes.

A broad overview

You’re going to use the Spotify Web API to collect artist metadata and all their albums and songs, probably using an external tool like Spotipy in Python. There is no universal list of artists, so the best simple idea is to snowball sample the artists by looking at each artist’s connections / collaborators / related artists. Doing random / empty searches may also help.

There are a lot of artists, and limited time, so you’re going to want to use bulk APIs where possible, and write some code that can run multiple collection pools at once.

In a bit more detail

You’re going to need to get a set (or several sets for rate limit reasons later) of API keys. These credentials allow you to access Spotify data.
We’re going to write a python script that will spend a lot of time collecting data. get this set up using Spotipy and try collecting a single artist’s albums.

Hint: if you’re looking for an artist’s ID, you’ll find it in the URL of their profile. (e.g., https://open.spotify.com/artist/50JJSqHUf2RQ9xsHs0KMHg)
Once we know how to download everything we want to know about an artist (artist_albums, each song in each album, artist_related_artists, followers, name, genres, popularity). We’re going to want to explore the network to collect as many artists as we can find.
We’re going to write a very large sudo-breadth-first-search. Essentially we will start with an artist, get everything we want to know about them (and save it to a file); then find all of the artists related to them or that have collaborated with them (appeared on a song or album with the artist in question), and add these new artists to a big-ass queue of artists to query.
This list of artists will balloon very quickly, and our collection bots will be very busy collecting new artists. The more artists we find, the larger this queue will get.
In light of this, two things are important:
1. The code used to query and artist is as fast as possible and uses very few API calls.
2. We parallelize these queries across API credentials (to avoid rate limits that slow us down). Spotify will help with the rate limits and backing-off.
There are many smart ways to scale this up, but if I was to try this again, I would create a cloud database (a little cheap/free SQL one on AWS) where we could store the metadata (don’t store what you don’t need) and a queue of artists to search for. I would then spin up a bunch of very lightweight cloud computers (think AWS EC2 instances) and let them run the lightweight querying code (most of the time delay is waiting for rate limits so we don’t need heavy compute). We could have each computer draw randomly from the queue, to make the probability of double collecting very low. If the whole idea of spinning up random AWS servers is scary, this could totally also be done with multiple python instances on one computer or across multiple computers.

This might sound a bit daunting as a task, and it will require lots of testing and debugging; but if done right, this could open a fantastic dataset for science and fun; which could result in fame and fortune (*maybe).

I wish I had this blog post many years ago to produce the data, and it would be great to see the dataset reach its dream form.

If you want to try, I’m happy to talk about it. Just email me.

Subscribe to Tobin South

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

kM2qgfVG9By5HmJ…onfJfdNCpxbtMPw

Author Address

0x0c291c1AbC3880D…a7ce29f2f85495B

Content Digest

7N1kXKgINDJ0t_4…nDdk2cxBYiR4-qI