This is a quick write up where we go through creating a custom dataset based on the Octane’s song playlist over a finite period of time. Octane is part of Sirius Radio. We first try to scrape the data but ultimately resort to finding an API to crawl through. Finally, we use the Spotify API to pull down supporting data.


import os
import re
import requests
import numpy as np
import pandas as pd
import time
from datetime import datetime
from bs4 import BeautifulSoupimport spotipy
from spotipy.oauth2 import SpotifyOAuth

First Attempt — Web Scrapping

  • For the first attempt, I decided to try and scrap…

Quick write up on using the CountVectorizer and TruncatedSVD from the Sklearn library, to compute Document-Term and Term-Topic matrices. After setting up our model, we try it out on simple, never before seen documents in order to label them.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
documents = [
'Basketball is my favorite sport.',
'Football is fun to play.',
'IBM and GE are companies.'
cv = CountVectorizer()
bow = cv.fit_transform(documents)
n_topics = 2
tsvd = TruncatedSVD(n_topics)

Helper Methods

  • using these to simplify viewing a document-topic matrix
def set_topics(df, n_topics): topics = list(range(n_topics)) df.columns = [ f'topic_{t}' for t in topics…

Quick, simple write up on using PCA to reduce word-embedding dimensions down to 2D so we visualize them in a scatter plot.

1. Setup

import matplotlib.pyplot as plt
import gensim.downloader as api
from sklearn.decomposition import PCA
transformer = api.load('glove-twitter-100')## add your own terms here
terms = [


2. Pull Embeddings

embeddings = [ transformer.wv[term] for term in terms ]

3. Run PCA

pca = PCA(n_components=2)
data = pca.fit_transform(embeddings).transpose()
x, y = data[0], data[1]

4. Visualize

fig, ax = plt.subplots(figsize=(15, 8))ax.scatter(x, y, c='g')
for i, term in enumerate(terms):
ax.annotate(term, (x[i], y[i]))
Image for post
Image for post

This is a quick, end to end write up where I go through parsing out a movie script from the web. We start with HTML and end up with an ordered CSV of lines per actor. Parsing and cleaning data does not have to be something we dread.

** warning: this post assumes you have some basic knownledge of python, text preprocessing and feature generation.url:

Project Imports

import os
import re
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

Setup Pipeline Objects

Lots of libraries already exist to help build efficient pipelines. I, however, went with a…

Markov chains are considered “memoryless” if the next state only depends on the previous. Using this concept, we can build a basic text generator, where the next word in our sequence will only depend on the prior word selected. The transition between these two terms will be based on our observed probabilities from the data.

** This write up assumes you have a decent understanding of the topics covered.

Finding a Dataset

Navigating to ESPN, I grabbed the first article that was shown. I did a previous write up on how to scrape the text from the HTML response. See the link below…

This write up is meant to simulate a situation in which you already have a developed vocab but you are presented with documents that contain terms found outside of it. Here we show how word embeddings and cosine similarity could be used to help recommend possible transformations from our unseen words to words contained in our vocab. To accomplish this, we will be using gensim, sklearn, and nltk libraries to build a simple term transformation recommender.

** This write up assumes you have a decent understanding of the topics covered.

1. Setup

import re
import numpy as np
import pandas as pd

Zachary’s karate club is a widely used dataset [1] which originated from the paper “An Information Flow Model for Conflict and Fission in Small Group” that was written by Wayne Zachary [2]. The paper was published in 1977.

This dataset will be used to explore four widely used node centrality metrics (Degree, Eigenvector, Closeness and Betweenness) using the python library NetworkX.

Warning: This social network is not a directed graph. Computing directed graph centrality metrics will not be covered here.

import networkx as nx
G = nx.karate_club_graph()
## #nodes: 34 and #edges: 78
print('#nodes:', len(G.nodes()), 'and', '#edges:', len(G.edges()))

Degree Centrality

The degree…

This method allows us to focus on the occurrence of a term in a corpus. The ordering of the terms is lost during this transformation.

corpus = [
'You do not want to use ... tasks, just not deep learning.',
'It’s always a ... our data before we get started plotting.',
'The problem is supervised text classification problem.',
'Our goal is ... learning methods are best suited to solve it.'

Step 1: Setup a simple method to clean documents and terms.

def parse_document(document): def parse_term(term): for char_to_replace in [ '.', ',' ]: term = term.replace(char_to_replace, '') return term return…

A Document-Term Matrix is used as a starting point for a number of NLP tasks. This short write up shows how to use Sklearn and NLTK python libraries to construct frequency and binary versions.

1. Setup Libraries

import re
import pandas as pd
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizerdocuments = [ 'Mom took us shopping today and got a bunch of stuff. I love shopping with her.', "Friday wasn't a great day.", 'She gave me a beautiful bunch of violets.', "Dad attested, they're a bunch of bullies.", 'Mom…

Writing regex expressions, to extract data, can be tedious, time consuming and annoying. At some point, you may reach a breaking point and start to wonder if the process could be automated. I recently reached this point and decided to build a simple regex expression generator.

I ended up experimenting with a lot of different solutions but ultimately settled on using Hill Climbing. I decided to to use a static ending for the regex expression. The process would try to build out unique expressions for each value that matched that static ending.

Example:Min: 1
Mean: 10
Max: 15

Slaps Lab

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store