What happened to CharStream/CharReader in Lucene?

May 22, 2013 Doug Turnbull
Category: Uncategorized

A pretty subtle change happened in the transition from Lucene/Solr 3 to 4. The abstract method for CharFilterFactory changed from

public CharStream  create(CharStream input);

public abstract Reader create(Reader input);

What ’s up with this change? Why did it happen?

Well first, let’s take a step back and explain why CharStream existed. CharStream inherited from and wrapped a Java’s Reader class. Prior to Lucene 4, it was used by a Tokenizer to pull characters out of a document one-by-one. CharStream could be any number of things that wrap the underlying text, including a CharFilter that might filter in or out certain characters from being indexed. To support filtering, it added one piece of functionality, the abstract method correctOffsets.

What does correctOffsets do? Recall that in Lucene, you can store for each token the character offset within a document of the begin and end of that term. Lucene also allows you to filter in/out the characters of the document before tokenization in your own custom CharFilter. The result is that if you took the offsets of the characters after character filtering, they’ll be wrong. You can see this if we perform HTML filtering on this text:

01234567890123456789012345

Doug is cool

After filtering the HTML chars out we have

012345678901    Doug is cool

If we didn’t correct the offsets, we’d end up saving the following character positions of each token:

0  3 56 8  11    Doug is cool    Doug – begins 0, ends 3   -> Really should be begins 3, ends 6    is – begins 5, ends 6     -> Really should be begins 8, ends 9    cool – begins 8, ends 11  -> Really should be begins 14, ends 17

Something in the CharFilter needs to remember the original offsets and correct them. This is where “correctOffsets” comes in. It’s the magic black box that takes the offsets after filtering and converts them back to the original offsets in the document.

Prior to Lucene 4.0, the intermediate CharStream abstract class was passed around. There really were only two implementation of CharStream – CharFilter – which filtered out characters and corrected offsets – and CharReader – which did no filtering and therefore no offset correcting. Nobody else really cared about CharStream. If they needed to filter (and in turn correct offsets) they inherited from CharFilter. CharStream was therefore a noisy intermediary between Reader and CharFilter. Most API users would probably feel more comfortable interfacing with a Reader or inheriting from a CharFilter. At least if I was the one who did the change that’s what I would be thinking .

So things were simplified for API users. Now in Lucene 4 and beyond, CharFilter simply inherits from Reader directly. Either a Tokenizer works with a Reader or a CharFilter. To decide how to correct offsets during tokenization, Tokenizer has this single line of hackery to decide how to proceed:

 // (taken from Tokenizer) protected final int correctOffset(int currentOff) {       return (input instanceof CharFilter) ?            ((CharFilter)input).correctOffset(currentOff) : currentOff; }

In short, Lucene has hidden a little naughty direct type checking craziness to simplify the API. The Tokenizer now just directly figures out if it’s talking to a Reader or CharFilter and proceeds accordingly. Either keeping the current offset or correcting the offset as needed. Theres no longer a need for the CharStream (or CharReader). We either dont correct offsets if theres a Reader. Or we correct offsets if the Reader is really a CharFilter.

So what does this mean for API users? You can safely change your CharFilterFactories to return Readers. If they returned CharFilters, they can still return CharFilters (as these are Readers). If you don’t care about filtering, you can return a Reader from this method as well. As an API user you can rest assured that Lucene won’t barf tokenizing either a CharFilter or a Reader.