Adventures in building custom datasets via Web Scrapping / Spotify API — Octane Playlist (37) Sirius Radio Edition

4 min readNov 28, 2020

This is a quick write up where we go through creating a custom dataset based on the Octane’s song playlist over a finite period of time. Octane is part of Sirius Radio. We first try to scrape the data but ultimately resort to finding an API to crawl through. Finally, we use the Spotify API to pull down supporting data.

Setup

import os
import re
import requestsimport numpy as np
import pandas as pdimport time
from datetime import datetimefrom bs4 import BeautifulSoupimport spotipy
from spotipy.oauth2 import SpotifyOAuth

First Attempt — Web Scrapping

For the first attempt, I decided to try and scrap a web page that listed what seemed like a days worth of songs. The results seemed fine, but I wanted more data. When I first tried this out, I used the Search endpoint on the Spotify API to pull down track information.

playlist = []
content = requests.get(
    'http://www.radiowavemonitor.com/pub_charts/diaries.aspx?IDDS=9508'
).textbs = BeautifulSoup(content, 'html.parser')
main = bs.find_all('div', {'class': 'column_3'})[0]
rows = main.find_all('div', {'class': 'row_80'})
for row in rows:
    info = row.find('div', {'class':'row_83'}).getText().split('-')
    artist = re.sub(r'\(.+?\)', '', info[0]).strip()
    info = row.find('div', {'class':'row_82'}).getText()
    track = info.strip()
    start_at = re.sub(
        r'\s+', 
        ' ',
        row.find('div', {'class': 'row_84'}).getText()
    ).strip()
    start_at = datetime.strptime(start_at, '%m/%d/%Y %H:%M:%S %p')
    playlist.append({
        'time': start_at,
        'artist': artist',
        'track': re.sub(r'\s+', ' ', track)
    })df = pd.DataFrame(playlist).set_index('time')
df.head(n=10)

Second Attempt — API

The next site provided us with a better option. Looking at the network traffic from the site, a promising API call was found.

The parameter appears to be epoch with 3 digits for milliseconds. Using this API call and a simple crawler, we can walk back time in order to gather our data. The nice part about this is API is that a ‘spotify_id’ is returned in the response JSON object. This can be used to help us pull additional data in batches.

current_time = 1606559939604
iteration = 3600000 ## hour in epoch plus millisecondsiterations = 24*20 ## 240
sleep_time_in_seconds = 5def get_items(epoch):
    url = f'https://xmplaylist.com/api/station/octane?last={epoch}'
    return requests.get(url).json()playlist = []
for i in range(iterations):
    print(f'{i} @ {current_time}')    items = get_items(current_time)
    for item in items:
        track = item['track']['name']
        artist = item['track']['artists'][0]
        start_at = datetime.strptime(
            item['start_time'][:-1],
            '%Y-%m-%dT%H:%M:%S.%f'
        )
        spotify_id = item['spotify']['spotify_id']
 
        playlist.append({
            'time': start_at,
            'artist': artist,
            'track': track,
            'spotify_id': spotify_id
        })     current_time -= iteration
     time.sleep(sleep_time_in_seconds)df = pd.DataFrame(playlist).set_index('time')
df = df[~df.index.duplicated(keep='first')]df.head(n=10)

Using the Spotify API

The following bounces the track ids off of the Spotify API in order to grab some more supporting data, ie: popularity ranking, duration, full track name, etc. Please see documentation for how to setup your own client and secret ids to enable access.

auth_manager = SpotifyOAuth(...<credentials>...)
spotify_api = spotipy.Spotify(auth_manager=auth_manager)spofity_id_to_track = {}
unique_spotify_ids = df[~df.spotify_id.isna()].spotify_id.unique()
for spotify_id_chunks in np.split(unique_spotify_ids, 25):
    response = spotify.tracks(spotify_id_chunks)
    for track in response['tracks']:
        data = {
            'track': track['name'],
            'popularity': track['popularity'],
            'explicit': track['explicit'],
            'duration_ms': track['duration_ms'],
            'band': track['album']['artists'][0]['name']
        }        spotify_id = track['id']
        spofity_id_to_track[spotify_id] = data    time.sleep(10)

Populate DataFrame with Spotify Data

def try_get_spotify_data(row, attribute, default_value):
    spotify_id = row['spotify_id']
    if spotify_id is None:
        return default_value
    return spofity_id_to_track[spotify_id][attribute]df['popularity'] = df.apply(
    lambda row: try_get_spotify_data(row, 'popularity', -1),
    axis=1
)df['duration_ms'] = df.apply(
    lambda row: try_get_spotify_data(row, 'duration_ms', 0), 
    axis=1
)df['explicit'] = df.apply(
    lambda row: try_get_spotify_data(row, 'explicit', False),
    axis=1
)df['artist'] = df.apply(
    lambda row: try_get_spotify_data(row, 'band', row['artist']),
    axis=1
)df['track'] = df.apply(
    lambda row: try_get_spotify_data(row, 'track', row['track']),
    axis=1
)df.head(n=10)

Adventures in building custom datasets via Web Scrapping — Little Mermaid Edition

This is a quick, end to end write up where I go through parsing out a movie script from the web. We start with HTML and…

theslaps.medium.com