Adventures in building custom datasets via Web Scrapping / Spotify API — Octane Playlist (37) Sirius Radio Edition

Slaps Lab
4 min readNov 28, 2020

This is a quick write up where we go through creating a custom dataset based on the Octane’s song playlist over a finite period of time. Octane is part of Sirius Radio. We first try to scrape the data but ultimately resort to finding an API to crawl through. Finally, we use the Spotify API to pull down supporting data.

Setup

import os
import re
import requests
import numpy as np
import pandas as pd
import time
from datetime import datetime
from bs4 import BeautifulSoupimport spotipy
from spotipy.oauth2 import SpotifyOAuth

First Attempt — Web Scrapping

  • For the first attempt, I decided to try and scrap a web page that listed what seemed like a days worth of songs. The results seemed fine, but I wanted more data. When I first tried this out, I used the Search endpoint on the Spotify API to pull down track information.
playlist = []
content = requests.get(
'http://www.radiowavemonitor.com/pub_charts/diaries.aspx?IDDS=9508'
).text
bs = BeautifulSoup(content, 'html.parser')
main = bs.find_all('div', {'class': 'column_3'})[0]
rows = main.find_all('div', {'class': 'row_80'})
for row in rows:
info = row.find('div', {'class':'row_83'}).getText().split('-')
artist = re.sub(r'\(.+?\)', '', info[0]).strip()
info = row.find('div', {'class':'row_82'}).getText()
track = info.strip()
start_at = re.sub(
r'\s+',
' ',
row.find('div', {'class': 'row_84'}).getText()
).strip()
start_at = datetime.strptime(start_at, '%m/%d/%Y %H:%M:%S %p')
playlist.append({
'time': start_at,
'artist': artist',
'track': re.sub(r'\s+', ' ', track)
})
df = pd.DataFrame(playlist).set_index('time')
df.head(n=10)
RadioWaveMonitor — Scrapping Results

Second Attempt — API

The next site provided us with a better option. Looking at the network traffic from the site, a promising API call was found.

The parameter appears to be epoch with 3 digits for milliseconds. Using this API call and a simple crawler, we can walk back time in order to gather our data. The nice part about this is API is that a ‘spotify_id’ is returned in the response JSON object. This can be used to help us pull additional data in batches.

current_time = 1606559939604
iteration = 3600000 ## hour in epoch plus milliseconds
iterations = 24*20 ## 240
sleep_time_in_seconds = 5
def get_items(epoch):
url = f'https://xmplaylist.com/api/station/octane?last={epoch}'
return requests.get(url).json()
playlist = []
for
i in range(iterations):
print(f'{i} @ {current_time}')
items = get_items(current_time)
for item in items:
track = item['track']['name']
artist = item['track']['artists'][0]
start_at = datetime.strptime(
item['start_time'][:-1],
'%Y-%m-%dT%H:%M:%S.%f'
)
spotify_id = item['spotify']['spotify_id']

playlist.append({
'time': start_at,
'artist': artist,
'track': track,
'spotify_id': spotify_id
})
current_time -= iteration
time.sleep(sleep_time_in_seconds)
df = pd.DataFrame(playlist).set_index('time')
df = df[~df.index.duplicated(keep='first')]
df.head(n=10)
DataFrame

Using the Spotify API

The following bounces the track ids off of the Spotify API in order to grab some more supporting data, ie: popularity ranking, duration, full track name, etc. Please see documentation for how to setup your own client and secret ids to enable access.

auth_manager = SpotifyOAuth(...<credentials>...)
spotify_api = spotipy.Spotify(auth_manager=auth_manager)
spofity_id_to_track = {}
unique_spotify_ids = df[~df.spotify_id.isna()].spotify_id.unique()
for spotify_id_chunks in np.split(unique_spotify_ids, 25):
response = spotify.tracks(spotify_id_chunks)
for track in response['tracks']:
data = {
'track': track['name'],
'popularity': track['popularity'],
'explicit': track['explicit'],
'duration_ms': track['duration_ms'],
'band': track['album']['artists'][0]['name']
}
spotify_id = track['id']
spofity_id_to_track[spotify_id] = data
time.sleep(10)

Populate DataFrame with Spotify Data

def try_get_spotify_data(row, attribute, default_value):
spotify_id = row['spotify_id']
if spotify_id is None:
return default_value
return spofity_id_to_track[spotify_id][attribute]
df['popularity'] = df.apply(
lambda row: try_get_spotify_data(row, 'popularity', -1),
axis=1
)
df['duration_ms'] = df.apply(
lambda row: try_get_spotify_data(row, 'duration_ms', 0),
axis=1
)
df['explicit'] = df.apply(
lambda row: try_get_spotify_data(row, 'explicit', False),
axis=1
)
df['artist'] = df.apply(
lambda row: try_get_spotify_data(row, 'band', row['artist']),
axis=1
)
df['track'] = df.apply(
lambda row: try_get_spotify_data(row, 'track', row['track']),
axis=1
)
df.head(n=10)
Final Pandas DataFrame
Photo by OCV PHOTO on Unsplash

--

--

Slaps Lab

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.