Adventures in building custom datasets via Web Scrapping / Spotify API — Octane Playlist (37) Sirius Radio Edition

This is a quick write up where we go through creating a custom dataset based on the Octane’s song playlist over a finite period of time. Octane is part of Sirius Radio. We first try to scrape the data but ultimately resort to finding an API to crawl through. Finally, we use the Spotify API to pull down supporting data.

Setup

import os
import re
import requests

First Attempt — Web Scrapping

  • For the first attempt, I decided to try and scrap a web page that listed what seemed like a days worth of songs. The results seemed fine, but I wanted more data. When I first tried this out, I used the Search endpoint on the Spotify API to pull down track information.
playlist = []
content = requests.get(
'http://www.radiowavemonitor.com/pub_charts/diaries.aspx?IDDS=9508'
).text
Image for post
Image for post
RadioWaveMonitor — Scrapping Results

Second Attempt — API

The next site provided us with a better option. Looking at the network traffic from the site, a promising API call was found.

Image for post
Image for post

The parameter appears to be epoch with 3 digits for milliseconds. Using this API call and a simple crawler, we can walk back time in order to gather our data. The nice part about this is API is that a ‘spotify_id’ is returned in the response JSON object. This can be used to help us pull additional data in batches.

current_time = 1606559939604
iteration = 3600000 ## hour in epoch plus milliseconds
Image for post
Image for post
DataFrame

Using the Spotify API

The following bounces the track ids off of the Spotify API in order to grab some more supporting data, ie: popularity ranking, duration, full track name, etc. Please see documentation for how to setup your own client and secret ids to enable access.

auth_manager = SpotifyOAuth(...<credentials>...)
spotify_api = spotipy.Spotify(auth_manager=auth_manager)

Populate DataFrame with Spotify Data

def try_get_spotify_data(row, attribute, default_value):
spotify_id = row['spotify_id']
if spotify_id is None:
return default_value
return spofity_id_to_track[spotify_id][attribute]
Image for post
Image for post
Final Pandas DataFrame
Image for post
Image for post
Photo by OCV PHOTO on Unsplash

Written by

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store