Can’t stand, Don’t want CONTRACTIONS! — with Spacy

2 min readApr 15, 2020

Text is often littered with words with the same meaning but have alternate spellings. This problem extends beyond contractions to include character names, spelling mistakes, abbreviations… etc. This post is all about creating a ‘preprocessing’ component to be added to the Spacy pipeline to help normalize text in order to create consistency across multiple documents. In this post, contractions will the main focus.

This approach can easily be updated to help correct common spelling mistakes or even change character names in a short story. It also is not required to live in a pipeline.

The Procedure

Create a method to expand contractions based on the English contractions list from wikipedia.
Build component and insert into Spacy pipeline.
Compare word frequencies between a control and test group in order to visualize the impact.

Expand Contractions

The basic direction was to group out the different forms and plow through them with text. Important to make sure to run smaller versions last to avoid mistaken conversions. Once the method was complete, I grabbed the before and after from wikipedia and setup a basic test to at least ensure a base case works. Any future issues that crop up with the regex could just be added to the test text.

import redef expand_contractions(text: str) -> str:    flags = re.IGNORECASE | re.MULTILINE
    
    text = re.sub(
        r"\b(can)'?t\b",
        'can not', text,
        flags = flags
    )
    
    return textexpand_contractions("I can't wait to go!")## 'I can not wait to go!'

the actual method can be found in the full notebook.

Setup up Spacy Component

import spacyfrom spacy.tokens import Doc
from spacy.language import Languageclass ExpandContractionsComponent(object):
    name = "expand_contractions"    nlp: Language    def __init__(self, nlp: Language):
        self.nlp = nlp    def __call__(self, doc: Doc) -> Doc:
        text = doc.text
        return self.nlp.make_doc(expand_contractions(text))

Spacy Pipeline

nlp = spacy.load('en_core_web_sm')nlp.add_pipe(
    ExpandContractionsComponent(nlp),
    before = 'tagger'
)nlp.pipeline ## - to view addition,[
  ('expand_contractions', <__main__.ExpandContractionsComponent>),
  ('tagger', <spacy.pipeline.pipes.Tagger>),
  ('parser', <spacy.pipeline.pipes.DependencyParser>),
  ('ner', <spacy.pipeline.pipes.EntityRecognizer>)
]

Results

I used a basic sentence “I can’t, cant, cannot, won’t, wont do it!”, and computed the frequency word counts after loading the document.

Control Group (w/out expanded contractions component):

{
  'I': 1,
  'ca': 2,
  "n't": 2,
  ',': 4,
  'nt': 2,
  'can': 1,
  'not': 1,
  'wo': 2,
  'do': 1,
  'it': 1,
  '!': 1
}

Test Group (w/ expanded contractions component):

{
  'I': 1,
  'can': 3,
  'not': 5,
  ',': 4,
  'will': 2,
  'do': 1,
  'it': 1,
  '!': 1
}

Jupyter Notebook

Expanding Contractions

References

https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions