Constructing a Document-Term Matrix via Sklearn, NLTK
A Document-Term Matrix is used as a starting point for a number of NLP tasks. This short write up shows how to use Sklearn and NLTK python libraries to construct frequency and binary versions.
1. Setup Libraries
import re
import pandas as pdfrom nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmerfrom sklearn.feature_extraction.text import CountVectorizerdocuments = [
'Mom took us shopping today and got a bunch of stuff. I love shopping with her.',
"Friday wasn't a great day.",
'She gave me a beautiful bunch of violets.',
"Dad attested, they're a bunch of bullies.",
'Mom hates bullies.',
'A bunch of people confirm it.',
'Taking pity on the sad flowers, she bought a bunch before continuing on her journey home.'
]stop_words = set(stopwords.words('english'))
2. Optional: Setup our Tokenizer / Preprocessor
class Preprocessor(object):
def __call__(self, document: str):
document = document.lower() ## split up contractions
document = re.sub(r"they'?re", 'they are', document)
document = re.sub(r"wasn'?t", 'was not', document)
return documentclass Tokenizer(object):
def __init__(self):
self.stemmer = PorterStemmer() def __call__(self, documents: str):
return [
self.stemmer.stem(term)
for term in word_tokenize(documents)
if term.isalpha()
]
3. Document-Term ‘Frequency’ Matrix
- Column values will be the number of times a term was found in that document.
cv = CountVectorizer(
stop_words=stop_words,
preprocessor=Preprocessor(),
tokenizer=Tokenizer()
)data = cv.fit_transform(documents).toarray()
vocab = cv.get_feature_names()doc_term_matrix = pd.DataFrame(
data=data,
columns=vocab
).transpose()doc_term_matrix.tail(n=8)
- The transpose() moves our documents to be our columns and the terms our rows. This allows us to quickly query terms, example:
doc_term_matrix.loc[['shop']]

4. Document-Term ‘Binary’ Matrix
- Column values will be a ‘1’ for the term was found in that document, ‘0’ when not.
cv = CountVectorizer(
stop_words=stop_words,
preprocessor=Preprocessor(),
tokenizer=Tokenizer()
binary=True
)data = cv.fit_transform(documents).toarray()
vocab = cv.get_feature_names()doc_term_matrix = pd.DataFrame(
data=data,
columns=vocab
).transpose()doc_term_matrix.tail(n=8)
