Adventures in building custom datasets via Web Scrapping / Spotify API — Octane Playlist (37) Sirius Radio Edition
This is a quick write up where we go through creating a custom dataset based on the Octane’s song playlist over a finite period of time. Octane is part of Sirius Radio. We first try to scrape the data but ultimately resort to finding an API to crawl through. Finally, we use the Spotify API to pull down supporting data.
Setup
import os
import re
import requestsimport numpy as np
import pandas as pdimport time
from datetime import datetimefrom bs4 import BeautifulSoupimport spotipy
from spotipy.oauth2 import SpotifyOAuth
First Attempt — Web Scrapping
- For the first attempt, I decided to try and scrap a web page that listed what seemed like a days worth of songs. The results seemed fine, but I wanted more data. When I first tried this out, I used the Search endpoint on the Spotify API to pull down track information.
playlist = []
content = requests.get(
'http://www.radiowavemonitor.com/pub_charts/diaries.aspx?IDDS=9508'
).textbs = BeautifulSoup(content, 'html.parser')
main = bs.find_all('div', {'class': 'column_3'})[0]
rows = main.find_all('div', {'class': 'row_80'})
for row in rows:
info = row.find('div', {'class':'row_83'}).getText().split('-')
artist = re.sub(r'\(.+?\)', '', info[0]).strip()
info = row.find('div', {'class':'row_82'}).getText()
track = info.strip()
start_at = re.sub(
r'\s+',
' ',
row.find('div', {'class': 'row_84'}).getText()
).strip()
start_at = datetime.strptime(start_at, '%m/%d/%Y %H:%M:%S %p')
playlist.append({
'time': start_at,
'artist': artist',
'track': re.sub(r'\s+', ' ', track)
})df = pd.DataFrame(playlist).set_index('time')
df.head(n=10)
Second Attempt — API
The next site provided us with a better option. Looking at the network traffic from the site, a promising API call was found.
The parameter appears to be epoch with 3 digits for milliseconds. Using this API call and a simple crawler, we can walk back time in order to gather our data. The nice part about this is API is that a ‘spotify_id’ is returned in the response JSON object. This can be used to help us pull additional data in batches.
current_time = 1606559939604
iteration = 3600000 ## hour in epoch plus millisecondsiterations = 24*20 ## 240
sleep_time_in_seconds = 5def get_items(epoch):
url = f'https://xmplaylist.com/api/station/octane?last={epoch}'
return requests.get(url).json()playlist = []
for i in range(iterations):
print(f'{i} @ {current_time}') items = get_items(current_time)
for item in items:
track = item['track']['name']
artist = item['track']['artists'][0]
start_at = datetime.strptime(
item['start_time'][:-1],
'%Y-%m-%dT%H:%M:%S.%f'
)
spotify_id = item['spotify']['spotify_id']
playlist.append({
'time': start_at,
'artist': artist,
'track': track,
'spotify_id': spotify_id
}) current_time -= iteration
time.sleep(sleep_time_in_seconds)df = pd.DataFrame(playlist).set_index('time')
df = df[~df.index.duplicated(keep='first')]df.head(n=10)
Using the Spotify API
The following bounces the track ids off of the Spotify API in order to grab some more supporting data, ie: popularity ranking, duration, full track name, etc. Please see documentation for how to setup your own client and secret ids to enable access.
auth_manager = SpotifyOAuth(...<credentials>...)
spotify_api = spotipy.Spotify(auth_manager=auth_manager)spofity_id_to_track = {}
unique_spotify_ids = df[~df.spotify_id.isna()].spotify_id.unique()
for spotify_id_chunks in np.split(unique_spotify_ids, 25):
response = spotify.tracks(spotify_id_chunks)
for track in response['tracks']:
data = {
'track': track['name'],
'popularity': track['popularity'],
'explicit': track['explicit'],
'duration_ms': track['duration_ms'],
'band': track['album']['artists'][0]['name']
} spotify_id = track['id']
spofity_id_to_track[spotify_id] = data time.sleep(10)
Populate DataFrame with Spotify Data
def try_get_spotify_data(row, attribute, default_value):
spotify_id = row['spotify_id']
if spotify_id is None:
return default_value
return spofity_id_to_track[spotify_id][attribute]df['popularity'] = df.apply(
lambda row: try_get_spotify_data(row, 'popularity', -1),
axis=1
)df['duration_ms'] = df.apply(
lambda row: try_get_spotify_data(row, 'duration_ms', 0),
axis=1
)df['explicit'] = df.apply(
lambda row: try_get_spotify_data(row, 'explicit', False),
axis=1
)df['artist'] = df.apply(
lambda row: try_get_spotify_data(row, 'band', row['artist']),
axis=1
)df['track'] = df.apply(
lambda row: try_get_spotify_data(row, 'track', row['track']),
axis=1
)df.head(n=10)