Latent semantic analysis — LSA via Sklearn
Quick write up on using the CountVectorizer and TruncatedSVD from the Sklearn library, to compute Document-Term and Term-Topic matrices. After setting up our model, we try it out on simple, never before seen documents in order to label them.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVDdocuments = [
'Basketball is my favorite sport.',
'Football is fun to play.',
'IBM and GE are companies.'
]cv = CountVectorizer()
bow = cv.fit_transform(documents)n_topics = 2
tsvd = TruncatedSVD(n_topics)
Helper Methods
- using these to simplify viewing a document-topic matrix
def set_topics(df, n_topics):
topics = list(range(n_topics))
df.columns = [ f'topic_{t}' for t in topics ]
for t in range(n_topics):
df[f'topic_{t}'] = df[f'topic_{t}'].abs().round(2)
return dfdef svd_to_pandas(svd_results, n_topics):
df = pd.DataFrame(svd_results)
df = set_topics(df, n_topics)
return df
Document-Topic
matrix = tsvd.fit_transform(bow)
doc_to_topic = to_pandas(matrix, n_topics)
doc_to_topic

Term-Topic
term_to_topic = pd.DataFrame(
tsvd.components_,
columns = cv.get_feature_names()
).Tterm_to_topic = set_topics(term_to_topic, n_topics)
term_to_topic

Hold-Out Documents
hold_outs = [
'The basketball game is Friday.', ## expect Topic 0
'IBM stock stinks lately.' ## expect Topic 1
]hold_out_bow = cv.transform(hold_outs)
to_pandas(tsvd.transform(hold_out_bow), n_topics)
