Using Word Embeddings to help bridge different sets of vocab

This write up is meant to simulate a situation in which you already have a developed vocab but you are presented with documents that contain terms found outside of it. Here we show how word embeddings and cosine similarity could be used to help recommend possible transformations from our unseen words to words contained in our vocab. To accomplish this, we will be using gensim, sklearn, and nltk libraries to build a simple term transformation recommender.

** This write up assumes you have a decent understanding of the topics covered.

1. Setup

2. Dataset

I ended up just pulling 14 ESPN articles that came through their RSS feed. The first 11 articles made up my base-set while the other 3 would represent possible hold out documents. The details on how to pull these articles from ESPN can be found in a previous write up, linked below.

** Links to these 14 articles can be found at the end.

3. Bag of Words

  • We convert each set to a bag of words representation. This allows us to quickly pull out our vocab and convert to word embeddings.

4. Setup Nearest Neighbor for Set1

5. Find “Similar” words between set2 and set1

  • We are only checking the closets distance to filter down our recommendations. However, this could easily be applied to all returned distances.

6. Results

We will need to manually go through these results to check for possible transformations. These transformations can now be applied during our tokenization process.

Photo by Nikita Kachanovsky on Unsplash

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store