Using Word Embeddings to help bridge different sets of vocab
This write up is meant to simulate a situation in which you already have a developed vocab but you are presented with documents that contain terms found outside of it. Here we show how word embeddings and cosine similarity could be used to help recommend possible transformations from our unseen words to words contained in our vocab. To accomplish this, we will be using gensim, sklearn, and nltk libraries to build a simple term transformation recommender.
** This write up assumes you have a decent understanding of the topics covered.
1. Setup
import re
import numpy as np
import pandas as pdfrom nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizerfrom sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizerimport gensim.downloader as apiinformation = api.info()
transformer = api.load('glove-twitter-100')
2. Dataset
I ended up just pulling 14 ESPN articles that came through their RSS feed. The first 11 articles made up my base-set while the other 3 would represent possible hold out documents. The details on how to pull these articles from ESPN can be found in a previous write up, linked below.
** Links to these 14 articles can be found at the end.
set1 = [...] ## 11 articles
set2 = [...] ## 3 articles
3. Bag of Words
- We convert each set to a bag of words representation. This allows us to quickly pull out our vocab and convert to word embeddings.
class Preprocessor(object):
def __call__(self, document):
document = document.lower() ## split up contractions
document = re.sub(r'[\n"]', '', document)
document = re.sub(r'[-]', ' ', document) return documentclass Tokenizer(object):
def __init__(self):
self.lemmatizer = WordNetLemmatizer() def __call__(self, articles):
return [
self.lemmatizer.lemmatize(term)
for term in word_tokenize(articles)
if term.isalpha()
]def get_bow(documents):
cv = CountVectorizer(
preprocessor=Preprocessor(),
tokenizer=Tokenizer(),
stop_words=set(stopwords.words('english')),
min_df=2
)
data = cv.fit_transform(documents).toarray()
vocab = cv.get_feature_names()
doc_term_matrix = pd.DataFrame(
data=cv.fit_transform(documents).toarray(),
columns=cv.get_feature_names()
)
doc_term_matrix = doc_term_matrix.transpose()
return doc_term_matrix.sum(axis=1).to_dict()bow_set1 = get_bow(set1)
set1_vocab = [
term
for term in list(bow_set1.keys())
if term in model.vocab
]
set1_embeddings = [
transformer.wv[term]
for term in set1_vocab
]bow_set2 = get_bow(set2)
set2_vocab = [
term
for term in list(bow_set2.keys())
if term in model.vocab
]
set2_embeddings = [
transformer.wv[term]
for term in set1_vocab
]
4. Setup Nearest Neighbor for Set1
nn = NearestNeighbors(n_neighbors=5, metric='cosine')
nn.fit(set1_embeddings)
5. Find “Similar” words between set2 and set1
- We are only checking the closets distance to filter down our recommendations. However, this could easily be applied to all returned distances.
threshold = .25
n = len(set2_embeddings)
distances, indices = nn.kneighbors(set2_embeddings, 5)searches = []
for i in range(n):
distance = distances[i][0]
if distance == 0:
## already exists
continue if distance < threshold:
index = indices[i]
closets_terms = map(lambda i: set1_vocab[i], indices[i])
message = f'{tokens[i]} -> {list(closets_terms)}'
searches.append(
(round(distance, 2), message)
)sorted(searches, key=lambda v: v[0])
6. Results
[
(0.03, "thursday -> ['wednesday', 'friday', 'monday', 'early', 'next']"),
(0.1, "lot -> ['much', 'many', 'mean', 'think', 'really']"),
(0.11, "better -> ['way', 'good', 'think', 'well', 'could']"),
(0.11, "let -> ['take', 'tell', 'go', 'see', 'need']"),
(0.12, "give -> ['take', 'need', 'get', 'tell', 'want']"),
(0.13, "trying -> ['could', 'going', 'think', 'might', 'would']"),
(0.15, "denver -> ['detroit', 'chicago', 'dallas', 'houston', 'memphis']"),
(0.15, "keep -> ['stay', 'need', 'still', 'take', 'make']"),
(0.15, "start -> ['starting', 'going', 'next', 'time', 'still']"),
(0.16, "seems -> ['really', 'yet', 'though', 'might', 'think']"),
(0.17, "call -> ['tell', 'talk', 'know', 'say', 'take']"),
(0.17, "different -> ['many', 'way', 'thing', 'think', 'people']"),
(0.17, "especially -> ['though', 'people', 'many', 'thing', 'least']"),
(0.18, "along -> ['around', 'way', 'together', 'right', 'still']"),
(0.18, "help -> ['need', 'find', 'want', 'could', 'take']"),
(0.18, "orlando -> ['dallas', 'miami', 'houston', 'chicago', 'angeles']"),
(0.18, "several -> ['three', 'two', 'including', 'four', 'many']"),
(0.18, "shoot -> ['shot', 'shooting', 'take', 'hit', 'see']"),
(0.18, "stuff -> ['thing', 'something', 'really', 'though', 'everything']"),
(0.19, "finish -> ['finished', 'half', 'set', 'round', 'work']"),
(0.19, "huge -> ['big', 'biggest', 'another', 'great', 'large']"),
(0.19, "ready -> ['going', 'go', 'coming', 'get', 'next']"),
(0.19, "robert -> ['john', 'james', 'edward', 'paul', 'george']"),
(0.2, "leaving -> ['going', 'left', 'coming', 'already', 'getting']")
]
We will need to manually go through these results to check for possible transformations. These transformations can now be applied during our tokenization process.
transformations = {
'huge': 'big',
'leaving': 'going',
...
}
## Articles used from ESPN
urls = [
'https://www.espn.com/nba/story/_/id/30321409/sources-los-angeles-lakers-talks-acquire-dennis-schroder-oklahoma-city-thunder',
'https://www.espn.com/nba/story/_/id/30322395/cleveland-cavaliers-kevin-porter-jr-arrested-weapon-charge',
'https://www.espn.com/nba/story/_/id/30310367/udonis-haslem-returning-18th-season-miami-heat',
'https://www.espn.com/nba/story/_/id/30306095/lamelo-ball-holds-workout-front-warriors-hornets-pistons',
'https://www.espn.com/nba/story/_/id/30310273/report-miami-heat-hiring-caron-butler-assistant-coach',
'https://www.espn.com/nba/story/_/id/30316444/chicago-bulls-coach-billy-donovan-adds-hall-famer-maurice-cheeks-staff',
'https://www.espn.com/nba/story/_/id/30162802/2020-nba-free-agency-trades-latest-buzz-news-reports',
'https://www.espn.com/nba/story/_/id/30299175/how-lsu-guard-skylar-mays-turned-tragedy-fuel-dreams-being-selected-2020-nba-draft',
'https://www.espn.com/nba/story/_/id/30300355/nba-draft-2020-no-1-pick-timberwolves-see-opportunity-pressure',
'https://www.espn.com/nba/story/_/id/30303622/becky-hammon-art-possible',
'https://www.espn.com/nba/story/_/id/30311714/canadian-officials-concerned-raptors-cross-border-travel',
'https://www.espn.com/nba/story/_/id/30221719/giannis-antetokounmpo-future-all-star-trades-nba-draft-everything-else-watch-offseason',
'https://www.espn.com/nba/story/_/id/30174755/nba-schedule-debate-winners-losers-nba-pre-christmas-start',
'https://www.espn.com/nba/story/_/id/30264861/nbpa-reps-approve-dec-22-start-date-72-game-regular-season'
]