Can’t stand, Don’t want CONTRACTIONS! — with Spacy

Slaps Lab
2 min readApr 15, 2020

Text is often littered with words with the same meaning but have alternate spellings. This problem extends beyond contractions to include character names, spelling mistakes, abbreviations… etc. This post is all about creating a ‘preprocessing’ component to be added to the Spacy pipeline to help normalize text in order to create consistency across multiple documents. In this post, contractions will the main focus.

This approach can easily be updated to help correct common spelling mistakes or even change character names in a short story. It also is not required to live in a pipeline.

The Procedure

  1. Create a method to expand contractions based on the English contractions list from wikipedia.
  2. Build component and insert into Spacy pipeline.
  3. Compare word frequencies between a control and test group in order to visualize the impact.

Expand Contractions

The basic direction was to group out the different forms and plow through them with text. Important to make sure to run smaller versions last to avoid mistaken conversions. Once the method was complete, I grabbed the before and after from wikipedia and setup a basic test to at least ensure a base case works. Any future issues that crop up with the regex could just be added to the test text.

import redef expand_contractions(text: str) -> str:    flags = re.IGNORECASE | re.MULTILINE

text = re.sub(
r"\b(can)'?t\b",
'can not', text,
flags = flags
)

return text
expand_contractions("I can't wait to go!")## 'I can not wait to go!'

Setup up Spacy Component

import spacyfrom spacy.tokens import Doc
from spacy.language import Language
class ExpandContractionsComponent(object):
name = "expand_contractions"
nlp: Language def __init__(self, nlp: Language):
self.nlp = nlp
def __call__(self, doc: Doc) -> Doc:
text = doc.text
return self.nlp.make_doc(expand_contractions(text))

Spacy Pipeline

nlp = spacy.load('en_core_web_sm')nlp.add_pipe(
ExpandContractionsComponent(nlp),
before = 'tagger'
)
nlp.pipeline ## - to view addition,[
('expand_contractions', <__main__.ExpandContractionsComponent>),
('tagger', <spacy.pipeline.pipes.Tagger>),
('parser', <spacy.pipeline.pipes.DependencyParser>),
('ner', <spacy.pipeline.pipes.EntityRecognizer>)
]

Results

I used a basic sentence “I can’t, cant, cannot, won’t, wont do it!”, and computed the frequency word counts after loading the document.

Control Group (w/out expanded contractions component):

{
'I': 1,
'ca': 2,
"n't": 2,
',': 4,
'nt': 2,
'can': 1,
'not': 1,
'wo': 2,
'do': 1,
'it': 1,
'!': 1
}

Test Group (w/ expanded contractions component):

{
'I': 1,
'can': 3,
'not': 5,
',': 4,
'will': 2,
'do': 1,
'it': 1,
'!': 1
}

Jupyter Notebook

References

  1. https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions

--

--

Slaps Lab

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.