This is a quick write up where we go through creating a custom dataset based on the Octane’s song playlist over a finite period of time. Octane is part of Sirius Radio. We first try to scrape the data but ultimately resort to finding an API to crawl through. Finally, we use the Spotify API to pull down supporting data.

importosimportreimportrequestsimportnumpyasnpimportpandasaspdimporttimefromdatetimeimportdatetimefrombs4importBeautifulSoupimportspotipyfromspotipy.oauth2importSpotifyOAuth

- For the first attempt, I decided to try and scrap a web page that listed what seemed like a days worth of songs. The results seemed fine, but I wanted more data. When I first tried this out, I used the
*Search*endpoint on the Spotify API to pull down track information. …

Quick write up on using the *CountVectorizer* and *TruncatedSVD *from the Sklearn library, to compute Document-Term and Term-Topic matrices. After setting up our model, we try it out on simple, never before seen documents in order to label them.

fromsklearn.feature_extraction.textimportCountVectorizerfromsklearn.decompositionimportTruncatedSVDdocuments= [

'Basketball is my favorite sport.',

'Football is fun to play.',

'IBM and GE are companies.']cv=CountVectorizer()

bow=cv.fit_transform(documents)n_topics=2

tsvd=TruncatedSVD(n_topics)

- using these to simplify viewing a document-topic matrix

**def** set_topics(df, n_topics)**:**

topics **=** list(range(n_topics))

df.columns **=** [ *f**'topic_{t}'* for t in topics ]

**for** t **in** range(n_topics):

df[*f**'topic_{t}'*] **=** df[*f**'topic_{t}'*].abs().round(2) …

Quick, simple write up on using PCA to reduce word-embedding dimensions down to 2D so we visualize them in a scatter plot.

importmatplotlib.pyplotaspltimportgensim.downloaderasapifromsklearn.decompositionimportPCAtransformer=api.load('glove-twitter-100')## add your own terms here

terms= [

'great',

'good',

'ok',

'worst',

'bad',

'awful',

'normal',

'fine',

'better',

'best']

`embeddings `**= [** transformer.wv[term] **for** term **in** terms **]**

pca=PCA(n_components=2)

data=pca.fit_transform(embeddings).transpose()x,y=data[0],data[1]

fig,ax=plt.subplots(figsize=(15, 8))ax.scatter(x, y, c='g')fori, terminenumerate(terms):

ax.annotate(term, (x[i], y[i]))plt.xlabel('x')

plt.ylabel('y')

plt.show()

This is a quick, end to end write up where I go through parsing out a movie script from the web. We start with HTML and end up with an ordered CSV of lines per actor. Parsing and cleaning data does not have to be something we dread.

url: http://www.fpx.de/fp/Disney/Scripts/LittleMermaid.html** warning: this post assumes you have some basic knownledge of python, text preprocessing and feature generation.

**import** os

**import** re

**import** requests

**import** numpy **as** np

**import** pandas **as** pd

**from** bs4 **import** BeautifulSoup

Lots of libraries already exist to help build efficient pipelines. I, however, went with a simple, custom approach. This gave me complete control and eneded up being pretty simple to setup and debug. Basically, each step in the process (Pipe) gets cached after completion. This allowed me to break up a large script into single stages so that I could easily debug the final output and quickly determine the area that needed more focus. …

Markov chains are considered “memoryless” if the next state only depends on the previous. Using this concept, we can build a basic text generator, where the next word in our sequence will only depend on the prior word selected. The transition between these two terms will be based on our observed probabilities from the data.

*** This write up assumes you have a decent understanding of the topics covered.*

Navigating to ESPN, I grabbed the first article that was shown. I did a previous write up on how to scrape the text from the HTML response. See the link below for that information. This is only an example but feel free to pull down more articles. …

This write up is meant to simulate a situation in which you already have a developed vocab but you are presented with documents that contain terms found outside of it. Here we show how word embeddings and cosine similarity could be used to help recommend possible transformations from our unseen words to words contained in our vocab. To accomplish this, we will be using gensim, sklearn, and nltk libraries to build a simple term transformation recommender.

*** This write up assumes you have a decent understanding of the topics covered.*

importreimportnumpyasnpimportpandasaspdfromnltkimportword_tokenizefromnltk.corpusimportstopwordsfromnltk.tokenizeimportword_tokenizefromnltk.stem…

Zachary’s karate club is a widely used dataset [1] which originated from the paper “An Information Flow Model for Conflict and Fission in Small Group” that was written by Wayne Zachary [2]. The paper was published in 1977.

This dataset will be used to explore four widely used ** node** centrality metrics (Degree, Eigenvector, Closeness and Betweenness) using the python library NetworkX.

*Warning**: This social network is not a directed graph. Computing directed graph centrality metrics will not be covered here.*

importnetworkxasnx

G=nx.karate_club_graph()## #nodes: 34 and #edges: 78

print('#nodes:', len(G.nodes()), 'and', '#edges:', len(G.edges()))

The degree of a node is simply defined as the number of connecting edges that it has. The node ‘33’ has 17 edges connecting it, to other nodes in the network. This results in a degree of 17. To determine the degree centrality, the degree of a node is divided by the number of other nodes in the network (n-1). To continue with computing the degree centrality for node ‘33’, 17 / (34–1) results in 0.5152. Remember from above, the number of nodes in the dataset is 34. …

This method allows us to focus on the occurrence of a term in a corpus. The ordering of the terms is lost during this transformation.

`corpus`** =** [

'You do not want to use **...** tasks, just not deep learning.',

'It’s always a **...** our data before we get started plotting.',

'The problem is supervised text classification problem.',

'Our goal is **...** learning methods are best suited to solve it.'

]

**Step 1:** Setup a simple method to clean documents and terms.

**def **parse_document(document)**:**

**def **parse_term(term)**:**

**for **char_to_replace **in **[ '.', ',' ]**:**

term **=** term.replace(char_to_replace, '')

**return **term

**return **[

parse_term(term)

**for **term **in **document.lower().split(' …

A Document-Term Matrix is used as a starting point for a number of NLP tasks. This short write up shows how to use Sklearn and NLTK python libraries to construct frequency and binary versions.

importreimportpandasaspdfromnltkimportword_tokenizefromnltk.corpusimportstopwordsfromnltk.tokenizeimportword_tokenizefromnltk.stem.porterimportPorterStemmerfromsklearn.feature_extraction.textimportCountVectorizerdocuments=[

'Mom took us shopping today and got a bunch of stuff. I love shopping with her.',

"Friday wasn't a great day.",

'She gave me a beautiful bunch of violets.',

"Dad attested, they're a bunch of bullies.",

'Mom hates bullies.',

'A bunch of people confirm it.',

'Taking pity on the sad flowers, she bought a bunch before continuing on her journey home.…'

Writing regex expressions, to extract data, can be tedious, time consuming and annoying. At some point, you may reach a breaking point and start to wonder if the process could be automated. I recently reached this point and decided to build a simple regex expression generator.

I ended up experimenting with a lot of different solutions but ultimately settled on using Hill Climbing. I decided to to use a static ending for the regex expression. The process would try to build out unique expressions for each value that matched that static ending.

Example:Min: 1

Mean: 10

Max: 15~ static ending = 10, "possible" regex expression -> /ea[a-z]…

About