Bag of Words via Python

corpus = [
'You do not want to use ... tasks, just not deep learning.',
'It’s always a ... our data before we get started plotting.',
'The problem is supervised text classification problem.',
'Our goal is ... learning methods are best suited to solve it.'
]
def parse_document(document):
def parse_term(term):
for char_to_replace in [ '.', ',' ]:
term = term.replace(char_to_replace, '')
return term

return [
parse_term(term)
for term in document.lower().split(' ')
]
all_terms = []
for terms in [ parse_document(document) for document in corpus ]:
all_terms.extend(terms)
BOW = dict([
(k, len(list(g)))
for k, g in itertools.groupby( \
sorted(all_terms), key=lambda x: x)
])
BOW## results
{
'a': 1,
'always': 1,
'are': 2,
'before': 1,
'best': 1,
'classification': 1,
'data': 1,
'deep': 1,
...
}

Text Compression

Simple text compression method that is often useful when working with other complex algorithms. In this method, define a vocab of unique terms. This can be easily accomplished by working off of the previous Bag of Words work.

vocab = list(BOW.keys())
vocab[
'a',
'always',
'are',
'before',
'best',
'classification',
'data',
'deep',
'do',
'examine',
'fine',
'for',
'get'
...
]
document_to_term_lookup = [ 
[
vocab.index(term)
for term in parse_document(document)
]
for document in corpus
]
document_to_term_lookup[
[43, 8, 25, 40, 38, 39, ..., 22, 21, 33, 20, 25, 7, 21],
[19, 1, 0, 14, 15, 38, 9, 26, 6, 3, 41, 12, 30, 27],
[35, 28, 17, 32, 34, 5, 28],
[26, 13, 17, 38, 16, 42, 32, 22, 21, 24, 2, 4, 31, 38, 29, 18]
]
Photo by Yue Iris on Unsplash

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Slaps Lab

Slaps Lab

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.