Bag of Words via Python

Slaps Lab
2 min readNov 15, 2020

This method allows us to focus on the occurrence of a term in a corpus. The ordering of the terms is lost during this transformation.

corpus = [
'You do not want to use ... tasks, just not deep learning.',
'It’s always a ... our data before we get started plotting.',
'The problem is supervised text classification problem.',
'Our goal is ... learning methods are best suited to solve it.'
]

Step 1: Setup a simple method to clean documents and terms.

def parse_document(document):
def parse_term(term):
for char_to_replace in [ '.', ',' ]:
term = term.replace(char_to_replace, '')
return term

return [
parse_term(term)
for term in document.lower().split(' ')
]

Step 2: Get term occurrences.

all_terms = []
for terms in [ parse_document(document) for document in corpus ]:
all_terms.extend(terms)
BOW = dict([
(k, len(list(g)))
for k, g in itertools.groupby( \
sorted(all_terms), key=lambda x: x)
])
BOW## results
{
'a': 1,
'always': 1,
'are': 2,
'before': 1,
'best': 1,
'classification': 1,
'data': 1,
'deep': 1,
...
}

Text Compression

Simple text compression method that is often useful when working with other complex algorithms. In this method, define a vocab of unique terms. This can be easily accomplished by working off of the previous Bag of Words work.

vocab = list(BOW.keys())
vocab[
'a',
'always',
'are',
'before',
'best',
'classification',
'data',
'deep',
'do',
'examine',
'fine',
'for',
'get'
...
]

The resulting lookup should contain the index of the term in reference to the index of that term in the unique vocabulary list. Once completed, the vocab becomes secondary and can technically be ignored until we need a reference back to the actual term. Just make sure you save off the correct ordering of terms.

document_to_term_lookup = [ 
[
vocab.index(term)
for term in parse_document(document)
]
for document in corpus
]
document_to_term_lookup[
[43, 8, 25, 40, 38, 39, ..., 22, 21, 33, 20, 25, 7, 21],
[19, 1, 0, 14, 15, 38, 9, 26, 6, 3, 41, 12, 30, 27],
[35, 28, 17, 32, 34, 5, 28],
[26, 13, 17, 38, 16, 42, 32, 22, 21, 24, 2, 4, 31, 38, 29, 18]
]

Full Notebook can be found here.

Photo by Yue Iris on Unsplash

--

--

Slaps Lab

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.