Part-of-Speech Tagging with PowerShell
When analyzing text, a common goal is to identify the parts of speech within that text – what parts are nouns? Adjectives? Verbs in their gerund form?
To accomplish this goal, the area of natural language processing in Computer Science has developed systems for Part of Speech tagging, or “POS Tagging”. The acronym preceded the version in Urban Dictionary :)
The version I used in University was a Perl-based Brill Tagger, but things have advanced quite a bit – and the Stanford NLP group has done a great job implementing a Java version with C# wrappers here:
https://nlp.stanford.edu/software/tagger.shtml
The default English model is 97% correct on known words, and 90% correct on unknown words. “SpeechTagger” provides a PowerShell interface to this tagger
By default, Split-PartOfSpeech outputs objects that represent words and the part of speech associated with them. The TaggerModel parameter lets you specify an alternate tagger model: the Stanford Part of Speech Tagger supports:
- Arabic
- Chinese
- English
- French
- German
- Spanish
The –Raw parameter emits sentence in the common text-based format for part-of-speech tagging, separating the word and its part of speech with the ‘/’ character. This is sometimes useful for regular expressions, or for adapting code you might have previously written to consume other part-of-speech taggers.
To install this project, simply run the following command from PowerShell:
Install-Module –Name SpeechTagger