A Document-Term Matrix is used as a starting point for a number of NLP tasks. This short write up shows how to use Sklearn and NLTK python libraries to construct frequency and binary versions.
1. Setup Libraries
import re import pandas as pdfrom nltk import word_tokenize from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem.porter import PorterStemmerfrom sklearn.feature_extraction.text import CountVectorizerdocuments = [ 'Mom took us shopping today and got a bunch of stuff. I love shopping with her.', "Friday wasn't a great day.", 'She gave me a beautiful bunch of violets.', "Dad attested, they're a bunch of bullies.", 'Mom hates bullies.', 'A bunch of people confirm it.', 'Taking pity on the sad flowers, she bought a bunch before continuing on her journey home.' ]stop_words =set(stopwords.words('english'))