Constructing a Document-Term Matrix via Sklearn, NLTK

A Document-Term Matrix is used as a starting point for a number of NLP tasks. This short write up shows how to use Sklearn and NLTK python libraries to construct frequency and binary versions.

1. Setup Libraries

import re
import pandas as pd

2. Optional: Setup our Tokenizer / Preprocessor

class Preprocessor(object):
def __call__(self, document: str):
document = document.lower()

3. Document-Term ‘Frequency’ Matrix

  • Column values will be the number of times a term was found in that document.
cv = CountVectorizer(
stop_words=stop_words,
preprocessor=Preprocessor(),
tokenizer=Tokenizer()
)
  • The transpose() moves our documents to be our columns and the terms our rows. This allows us to quickly query terms, example: doc_term_matrix.loc[['shop']]
Document-Term Frequency Matrix

4. Document-Term ‘Binary’ Matrix

  • Column values will be a ‘1’ for the term was found in that document, ‘0’ when not.
cv = CountVectorizer(
stop_words=stop_words,
preprocessor=Preprocessor(),
tokenizer=Tokenizer()
binary=True
)
Document-Term Binary Matrix

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store