Text is often littered with words with the same meaning but have alternate spellings. This problem extends beyond contractions to include character names, spelling mistakes, abbreviations… etc. This post is all about creating a ‘preprocessing’ component to be added to the Spacy pipeline to help normalize text in order to create consistency across multiple documents. In this post, contractions will the main focus.
This approach can easily be updated to help correct common spelling mistakes or even change character names in a short story. It also is not required to live in a pipeline.
The Procedure
- Create a method to expand contractions based on the English contractions list from wikipedia.
- Build component and insert into Spacy pipeline.
- Compare word frequencies between a control and test group in order to visualize the impact.
Expand Contractions
The basic direction was to group out the different forms and plow through them with text. Important to make sure to run smaller versions last to avoid mistaken conversions. Once the method was complete, I grabbed the before and after from wikipedia and setup a basic test to at least ensure a base case works. Any future issues that crop up with the regex could just be added to the test text.
import redef expand_contractions(text: str) -> str: flags = re.IGNORECASE | re.MULTILINE
text = re.sub(
r"\b(can)'?t\b",
'can not', text,
flags = flags
)
return textexpand_contractions("I can't wait to go!")## 'I can not wait to go!'
- the actual method can be found in the full notebook.
Setup up Spacy Component
import spacyfrom spacy.tokens import Doc
from spacy.language import Languageclass ExpandContractionsComponent(object):
name = "expand_contractions" nlp: Language def __init__(self, nlp: Language):
self.nlp = nlp def __call__(self, doc: Doc) -> Doc:
text = doc.text
return self.nlp.make_doc(expand_contractions(text))
Spacy Pipeline
nlp = spacy.load('en_core_web_sm')nlp.add_pipe(
ExpandContractionsComponent(nlp),
before = 'tagger'
)nlp.pipeline ## - to view addition,[
('expand_contractions', <__main__.ExpandContractionsComponent>),
('tagger', <spacy.pipeline.pipes.Tagger>),
('parser', <spacy.pipeline.pipes.DependencyParser>),
('ner', <spacy.pipeline.pipes.EntityRecognizer>)
]
Results
I used a basic sentence “I can’t, cant, cannot, won’t, wont do it!”, and computed the frequency word counts after loading the document.
Control Group (w/out expanded contractions component):
{
'I': 1,
'ca': 2,
"n't": 2,
',': 4,
'nt': 2,
'can': 1,
'not': 1,
'wo': 2,
'do': 1,
'it': 1,
'!': 1
}
Test Group (w/ expanded contractions component):
{
'I': 1,
'can': 3,
'not': 5,
',': 4,
'will': 2,
'do': 1,
'it': 1,
'!': 1
}