Programming Tips & Tricks

Home Categories: C# | C++ | General | Other

Other Tips & Tricks: Word Stemming in Java with WordNet and JWNL

This is a short article on how to create a Stemmer in Java with WordNet and JWNL.
First a quick explanation of what stemming is:
stemming means reducing a word to its base (or stem). For example, the words 'writing', 'wrote' and 'written' all have the stem 'write'. A stemmer takes a word, or a list of words, and produces the stem, or a list of the stems, of the input.
Stemming is useful when you are doing any kind of text-analysis: when you are concerned about the contents of a text, the different times of verbs, and the different endings for singular and plural, make it difficult to discern the importance of specific words within the text, when you treat each word as it is.
For example the text "I wrote a book about cats, after i had written a short article on cats. Currently i'm writing about dogs. Next year i'll write poems. But i made my favorite writings when i was younger."
Now, to a human, it's obvious that this text is mainly concerned with writing. But when you run a program that does simple text analysis over it, the program will come to the conclusion that it's a text about cats, because the word 'cats' is the only word (aside from words like 'i', 'a', etc. which are usually filtered out) that occurs more than once.
But if you stem the text before analysing it, that is replacing all words with their stems, the program will correctly tell you that it's a text about writing, since after stemming, the word 'write' will appear five times, because 'writing', 'written', 'writings' and 'wrote' have been replaced by 'write'.
I myself used stemming (and wrote the following code) when i was writing a web crawler that classified sites by their textual content.

Unfortunately, stemming is a problem to do algorhitmically, due to different rules and special cases in the english language. An easier way is to stem brute by brute force, that is to use a dictionary that lists all words together with their stems.
One such dictionary, which is freely available, is WordNet. And some nice people created an open source project that provides a nice Java API to the dictionary, named JWNL.

So, to use WordNet for stemming, follow these steps:
1. Download and install WordNet (when i was writing this code, JWNL only supported WordNet version 2.0. Meanwhile it will propably support newer versions, but if you run into problems, try installing the 2.0 version of WordNet)
2. Download JWNL
3. Add jwnl.jar and commons-logging.jar to your project
4. Copy JWNLproperties.xml to your project directory, open it and change the value of 'dictionary_path' to the path where you installed the WordNet dictionary (this should be [WordNet installiation directory]\dict\)

Now we can write some code!

First we have to import the jwnl and some standard classes:
import net.didion.jwnl.*;
import net.didion.jwnl.data.*;
import net.didion.jwnl.dictionary.*;

import java.io.*;
import java.util.Vector;
import java.util.Map;
import java.util.HashMap;
Now we can create a class for our stemmer. The class should contain these members:
	private int MaxWordLength = 50;
	private Dictionary dic;
	private MorphologicalProcessor morph;
	private boolean IsInitialized = false;  
	public HashMap AllWords = null;
Unfortunately, retrieving word information from WordNet is not as fast as we'd like it to be, so we'll use a HashMap to store words we stem, so stemming a word a second time will be no more expensive than a constant-time hashmap lookup.
Now lets add a method to our class that creates the connection to WordNet:
	/**
	 * establishes connection to the WordNet database
	 */
	public Stemmer ()
	{
		AllWords = new HashMap ();
		
		try
		{
			JWNL.initialize(new FileInputStream
				("JWNLproperties.xml"));
			dic = Dictionary.getInstance();
			morph = dic.getMorphologicalProcessor();
			// ((AbstractCachingDictionary)dic).
			//	setCacheCapacity (10000);
			IsInitialized = true;
		}
		catch ( FileNotFoundException e )
		{
			System.out.println ( "Error initializing Stemmer: 
				JWNLproperties.xml not found" );
		}
		catch ( JWNLException e )
		{
			System.out.println ( "Error initializing Stemmer: " 
				+ e.toString() );
		} 
		
	}
First we create our hashmap, then we initialize JWNL and catch and report any exceptions this might generate.
Inside the try block, we first have to initialize JWNL with the properties xml file which we copied into our application folder. Then we can create a Dictionary and a MorphologicalProcessor, which we will need for the actual stemming. The commented out line after this sets the size of JWNL's internal cache to 10000. You can use this and experiment with different values if you don't want to use the extra Hashmap (but, at least with the JWNL version i was using at the time, the internal cache didn't seem to do anything, which is why i used the extra hashmap).Finally, we set a IsInitialized flag to true, which will later be used to prevent the stemming functions from trying to access the uninitialized directory.

Next, our class should contain some code to close the connection to WordNet:
	public void Unload ()
	{ 
		dic.close();
		Dictionary.uninstall();
		JWNL.shutdown();
	}
And now comes the interesting part: the method to actually stem a word:
	 * stems a word with wordnet
	 * @param word word to stem
	 * @return the stemmed word or null if it was not found in WordNet
	 */
	public String StemWordWithWordNet ( String word )
	{
		if ( !IsInitialized )
			return word;
		if ( word == null ) return null;
		if ( morph == null ) morph = dic.getMorphologicalProcessor();
		
		IndexWord w;
		try
		{
			w = morph.lookupBaseForm( POS.VERB, word );
			if ( w != null )
				return w.getLemma().toString ();
			w = morph.lookupBaseForm( POS.NOUN, word );
			if ( w != null )
				return w.getLemma().toString();
			w = morph.lookupBaseForm( POS.ADJECTIVE, word );
			if ( w != null )
				return w.getLemma().toString();
			w = morph.lookupBaseForm( POS.ADVERB, word );
			if ( w != null )
				return w.getLemma().toString();
		} 
		catch ( JWNLException e )
		{
		}
		return null;
	}
First we check if JWNL was correctly initialized. If not, we simply return the input word. Then we define an IndexWord, which is the datastructure the library uses for words.
The next part is somewhat un-straightforward, since we have to tell the API if we want to look up a verb, noun, adjective or adverb. But since we don't know what kind of word we're currently stemming, we have to try all four kinds until we find a match, so we do this with lookupBaseForm() until we get a non-null return value (if the return value is null in all 4 cases, then our word couldn't be found in WordNet, and we return null). The final part is now to get the word stem out of the IndexWord, which can be achieved via IndexWord.getLemma().toString();
This was the heart of our stemmer, getting a word stem from any input word.
But for good performance we want to use our hashmap. Client code will actually call the following method to stem a word:
	/**
	 * Stem a single word
	 * tries to look up the word in the AllWords HashMap
	 * If the word is not found it is stemmed with WordNet
	 * and put into AllWords
	 * 
	 * @param word word to be stemmed
	 * @return stemmed word
	 */
	public String Stem( String word )
	{
		// check if we already know the word
		String stemmedword = AllWords.get( word );
		if ( stemmedword != null )
			return stemmedword; // return it if we already know it
		
		// don't check words with digits in them
		if ( containsNumbers (word) == true )
			stemmedword = null;
		else	// unknown word: try to stem it
			stemmedword = StemWordWithWordNet (word);
		
		if ( stemmedword != null )
		{
			// word was recognized and stemmed with wordnet:
			// add it to hashmap and return the stemmed word
			AllWords.put( word, stemmedword );
			return stemmedword;
		}
		// word could not be stemmed by wordnet, 
		// thus it is no correct english word
		// just add it to the list of known words so 
		// we won't have to look it up again
		AllWords.put( word, word );
		return word;
	}
The method is pretty straightforward: first we check if the word is already present in the hashmap, and return the stem if it is. If it wasn't in the map yet, we use our previous function to stem the word (i do an additional check here to see if our word is or contains a number, to skip the word if this is the case. You will not need this check if you're analyzing literature, but i found out that when crawling websites, enough text on the pages contained numbers (dates, prizes, etc.) that this check improved performance). Finally, if the word could be stemmed, it is added to the hashmap and the stem is returned. The last step then depends on your application. You could simply return null here to indicate that the word could not be stemmed (and was thus not found in the WordNet database, meaning it is no correct english word), but for my needs it was best to simply return the input word, and add it to the hashmap as it is.

To conclude this, here is a little method to use the Stem method to stem an entire list of words:
	/**
	 * performs Stem on each element in the given Vector
	 * 
	 */
	public Vector Stem ( Vector words )
	{
		if ( !IsInitialized )
			return words;
		
		for ( int i = 0; i < words.size(); i++ )
		{
			words.set( i, Stem( (String)words.get( i ) ) );
		}
		return words;		
	}