Adventures in building custom datasets via Web Scrapping / Spotify API — Octane Playlist (37) Sirius Radio Edition

This is a quick write up where we go through creating a custom dataset based on the Octane’s song playlist over a finite period of time. Octane is part of Sirius Radio. We first try to scrape the data but ultimately resort to finding an API to crawl through. Finally, we use the Spotify API to pull down supporting data.


import os
import re
import requests
import numpy as np
import pandas as pd
import time
from datetime import datetime
from bs4 import BeautifulSoupimport spotipy
from spotipy.oauth2 import SpotifyOAuth

First Attempt — Web Scrapping

playlist = []
content = requests.get(
bs = BeautifulSoup(content, 'html.parser')
main = bs.find_all('div', {'class': 'column_3'})[0]
rows = main.find_all('div', {'class': 'row_80'})
for row in rows:
info = row.find('div', {'class':'row_83'}).getText().split('-')
artist = re.sub(r'\(.+?\)', '', info[0]).strip()
info = row.find('div', {'class':'row_82'}).getText()
track = info.strip()
start_at = re.sub(
' ',
row.find('div', {'class': 'row_84'}).getText()
start_at = datetime.strptime(start_at, '%m/%d/%Y %H:%M:%S %p')
'time': start_at,
'artist': artist',
'track': re.sub(r'\s+', ' ', track)
df = pd.DataFrame(playlist).set_index('time')
RadioWaveMonitor — Scrapping Results

Second Attempt — API

The parameter appears to be epoch with 3 digits for milliseconds. Using this API call and a simple crawler, we can walk back time in order to gather our data. The nice part about this is API is that a ‘spotify_id’ is returned in the response JSON object. This can be used to help us pull additional data in batches.

current_time = 1606559939604
iteration = 3600000 ## hour in epoch plus milliseconds
iterations = 24*20 ## 240
sleep_time_in_seconds = 5
def get_items(epoch):
url = f'{epoch}'
return requests.get(url).json()
playlist = []
i in range(iterations):
print(f'{i} @ {current_time}')
items = get_items(current_time)
for item in items:
track = item['track']['name']
artist = item['track']['artists'][0]
start_at = datetime.strptime(
spotify_id = item['spotify']['spotify_id']

'time': start_at,
'artist': artist,
'track': track,
'spotify_id': spotify_id
current_time -= iteration
df = pd.DataFrame(playlist).set_index('time')
df = df[~df.index.duplicated(keep='first')]

Using the Spotify API

auth_manager = SpotifyOAuth(...<credentials>...)
spotify_api = spotipy.Spotify(auth_manager=auth_manager)
spofity_id_to_track = {}
unique_spotify_ids = df[~df.spotify_id.isna()].spotify_id.unique()
for spotify_id_chunks in np.split(unique_spotify_ids, 25):
response = spotify.tracks(spotify_id_chunks)
for track in response['tracks']:
data = {
'track': track['name'],
'popularity': track['popularity'],
'explicit': track['explicit'],
'duration_ms': track['duration_ms'],
'band': track['album']['artists'][0]['name']
spotify_id = track['id']
spofity_id_to_track[spotify_id] = data

Populate DataFrame with Spotify Data

def try_get_spotify_data(row, attribute, default_value):
spotify_id = row['spotify_id']
if spotify_id is None:
return default_value
return spofity_id_to_track[spotify_id][attribute]
df['popularity'] = df.apply(
lambda row: try_get_spotify_data(row, 'popularity', -1),
df['duration_ms'] = df.apply(
lambda row: try_get_spotify_data(row, 'duration_ms', 0),
df['explicit'] = df.apply(
lambda row: try_get_spotify_data(row, 'explicit', False),
df['artist'] = df.apply(
lambda row: try_get_spotify_data(row, 'band', row['artist']),
df['track'] = df.apply(
lambda row: try_get_spotify_data(row, 'track', row['track']),
Final Pandas DataFrame
Photo by OCV PHOTO on Unsplash

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.