The Missing Hello World for OpenNLP - OpenSource Connections

October 4, 2018 Eric Pugh
Category: Lucene

I wrote this back in 2012 for version 1.5.2-incubating, and never published it. So I’m updating it for the October 2018 version of OpenNLP, 1.9.0.

Visit http://opennlp.apache.org/ and you will discover that the Hello World for OpenNLP example is missing! Click on the Documentation link and you’re dropped into the deep end of OpenNLP coding details.

SmallBizContracting – An experiment in finding the bull-shitters in the federal space! We used OpenNLP to power our autosuggest capability. No longer online ;-(

A typical search feature is provide some sort of “autosuggest” function. The initial approach is just to provide snippets of results based on either facet query, or look at the terms that are available. That can help you complete a single word, but doesn’t help with multiple words. So for example, if you start typing “home”, we want to auto suggest the concept “home health care”. However, Solr doesn’t know that is a valid concept.

So OpenNLP to the rescue!

We are going to use OpenNLP to parse the sentence:

Avenue HomeCare, a North Carolina state licensed home health care agency of registered nurses, licensed practical nurses, certified home health aides and nursing assistants providing care for people in their own homes.”

Hopefully we will pull out licensed practical nurses and home health care as logical groupings of text that should be suggested through a technique called Chunking that puts tells you which words go together as a single chunk in a sentence.

Step 0: Download OpenNLP!

Grab a copy of OpenNLP and unzip in your working directory.

Step 1: Build a Model

The first step turns out to be the hardest step. The magic of OpenNLP comes from it being trained on carefully curated datasets to understand what the structure of your text, so it can do the right things. These data sets are apparently very difficult to find, and to build your own you need between 10,000 and 50,000 rows of example data! The resulting trained model is a binary file that contains all the rules required. You can play with some published models at http://opennlp.sourceforge.net/models-1.5/, however your milage may vary as they are built on old data sets. Note: These are the same models as when I wrote this back in 2012!

Apparently figuring out how to build the model training dataset is the hardest part. We want to build our own copy of the en-chunker.bin available at http://opennlp.sourceforge.net/models-1.5/. Googling for “conll2000 chunking” leads us to http://www.clips.ua.ac.be/conll2000/chunking/ where we can download the training dataset:

wget https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gzgunzip train.txt.gz

And then we can train our model based on this dataset:

./bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang en -data train.txt -model helloworld-chunker.bin

We now have a model called helloworld-chunker that knows how to group words together in a sentence.

Step 2: Part of Speech Tagging

The chunker works on a sentence that has been tagged with all the parts of speech, that is what it uses to figure out how to group things together. So back to OpenNLP we go with our sample paragraph to mark it up. First we need to grab an already pre trained model:

wget http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin

The text needs to be whitespace tokenized, so convert the sentence:

Avenue HomeCare, a North Carolina state licensed home health care agency of registered nurses, licensed practical nurses, certified home health aides and nursing assistants providing care for people in their own homes.

into one where there are spaces around the punctuation:

Avenue HomeCare , a North Carolina state licensed home health care agency of registered nurses , licensed practical nurses , certified home health aides and nursing assistants providing care for people in their own homes .

And now use that sentence by firing up the POSTagger and then paste the white spaced sentence into the resulting terminal prompt that is blinking at you:

./bin/opennlp POSTagger ../en-pos-maxent.binLoading POS Tagger model ... done (1.431s)Avenue HomeCare , a North Carolina state licensed home health care agency of registered nurses , licensed practicalnurses , certified home health aides and nursing assistants providing care for people in their own homes .Avenue_NNP HomeCare_NNP ,_, a_DT North_NNP Carolina_NNP state_NN licensed_VBD home_NN health_NN care_NN agency_NN of_IN registered_JJ nurses_NNS ,_, licensed_VBN practical_JJ nurses_NNS ,_, certified_JJ home_NN health_NN aides_NNS and_CC nursing_NN assistants_NNS providing_VBG care_NN for_IN people_NNS in_IN their_PRP$ own_JJ homes_NNS ._.

You can see the sentence of text now has been marked up with a set of codes describing what part of speech each token represents.

Step 3: Finally do the Chunking!

Maybe this really should be called grouping versus chunking, but this is where we find out how good our training model is. We take the full marked up sentence and pass it into the chunker, using the model we trained:

./bin/opennlp ChunkerME helloworld-chunker.binLoading Chunker model ... done (0.446s)Avenue_NNP HomeCare_NNP ,_, a_DT North_NNP Carolina_NNP state_NN licensed_VBD home_NN health_NN care_NN agency_NN of_IN registered_JJ nurses_NNS ,_, licensed_VBN practical_JJ nurses_NNS ,_, certified_JJ home_NN health_NN aides_NNS and_CC nursing_NN assistants_NNS providing_VBG care_NN for_IN people_NNS in_IN their_PRP$ own_JJ homes_NNS ._.[NP Avenue_NNP HomeCare_NNP ] ,_, [NP a_DT North_NNP Carolina_NNP state_NN ] [VP licensed_VBD ] [NP home_NN health_NN care_NN agency_NN ] [PP of_IN ] [NP registered_JJ nurses_NNS ] ,_, [VP licensed_VBN ] [NP practical_JJ nurses_NNS ] ,_, [NP certified_JJ home_NN health_NN aides_NNS ] and_CC [NP nursing_NN assistants_NNS ] [VP providing_VBG ] [NP care_NN ] [PP for_IN ] [NP people_NNS ] [PP in_IN ] [NP their_PRP$ own_JJ homes_NNS ] ._.

Now, this isn’t the easiest to read, but you look at the content between the [ and ] brackets to figure out what was grouped. You can see that we have as groups:

Avenue HomeCarea North Carolina statehome health care agencyregistered nursespractical nursescertified home health aidesnursing assistants

With the exception of a North Carolina state, all of these look like really good sets of phrases to offer up as autosuggest suggestions.

Have fun with OpenNLP !