Build Your Own Lucene Codec! - OpenSource Connections

June 5, 2013 Doug Turnbull
Category: Lucene

A Lucene Codec, a bit twiddlers dream!

Ive been having a lot of fun hacking on a Lucene Codec lately. My hope is to create a Lucene storage layer based on FoundationDB – a new distributed and transactional key-value store. It’s a fun opportunity to learn about both FoundationDB and low-level Lucene details.

But before we get into all that fun technical stuff, there’s some work we need to do. Our goal is going to be to get MyFirstCodec to work! Here’s the source code:

public class MyCodec extends FilterCodec {    public MyCodec() {        super("MyCodec", new Lucene42Codec());    }}

Great! Done. Well not quite. A real codec looks more like this

public final class SimpleTextCodec extends Codec {  // pretend there’s private vars here  @Override  public PostingsFormat postingsFormat() {    return postings;  }  @Override  public StoredFieldsFormat storedFieldsFormat() {    return storedFields;  }  @Override  public TermVectorsFormat termVectorsFormat() {    return vectorsFormat;  }  @Override  public FieldInfosFormat fieldInfosFormat() {    return fieldInfosFormat;  }  @Override  public SegmentInfoFormat segmentInfoFormat() {    return segmentInfos;  }  @Override  public NormsFormat normsFormat() {    return normsFormat;  }  @Override  public LiveDocsFormat liveDocsFormat() {    return liveDocs;  }  @Override  public DocValuesFormat docValuesFormat() {    return dvFormat;  }}

This example is a bit more explicit. This is Mike McCandless’s SimpleText codec that is a great codec to browse for browsing for educational purposes. In this codec, each part of the class is being customized.

Typically you only want to implement a subset of the components of the codec. Lucene provides a convenient base class called “FilterCodec”. You can customize whatever pieces you’d like to, delegating the rest to another codec. For example, If we were, however, to implement a custom storage implemenation for only term vectors, we can override like so:

public class MyCodec extends FilterCodec {    final private TermVectorsFormat myTermVectorsFormat;    public MyCodec() {        super("MyCodec", new Lucene42Codec());        myTermVectorsFormat = new MyTermVectorsFormat();    }    // Use custom TermVectorsFormat, default everything else to Lucene42Codec    public TermVectorsFormat termVectorsFormat() {        return myTermVectorsFormat;    }}

Here, we’re delegating to Lucene42Codec for everything except our special TermVectorsFormat.

Each of these individual formats are separate pieces responsible for serializing to a backing store during indexing and deserializing into memory when read back into memory. For many of the formats though, it’s a bit more than that. For example, for the postings format, the format responsible for storing the inverted index, it’s vital to be able to efficiently iterate the inverted index. Storage of the inverted index must be done in such a way that we can easily iterate all the indexed fields, then all the terms indexed into that field, then in turn the documents with term frequencies and positions that contain that term in that field.

You’ll find similar constraints as you implement the interfaces of the other pieces of the codec. Each of these pieces is a topic in its own right worth writing about. They all deserve their own blog articles. For now, I encourage you to explore the JavaDocs to see what might be fun to customize on the Lucene backend! Before I leave you to the Javadocs though, it’s important to tackle a few bits of plumbing – building & running Lucene’s tests against your codec.

Plumbing! Unit Tests & More

Let’s take care of a bit of plumbing. How do we setup a project for a codec? How do we run Lucenes tests against our implementation to confirm Solr/Lucene will function with our changes?

Using maven to setup the project is fairly straight-forward. Luckily I’ve created a Lucene Codec hello world project on github to get you started. It captures setting up the project with Maven. Feel free to fork it to skip the first two steps below. You’ll still need to read below to learn how to run the Lucene tests against your codec.

First, we’ll start by creating a straightforward maven project with a pom that depends on lucene-core at the version you’re targeting your codec for.

Second we’ll need to publish via our META-INF/services directory that we have a class that implements the Codec interface. This advertises our codec to Lucene’s class loader. Under src/main/resources/services create a file called org.apache.lucene.codecs.Codec. In the file should be a single line with the full name of your Codec class:

com.o19s.MyCodec

We’ll need to tell mvn to copy this into the target/META-INF by specifying it as a resource to be copied into the target folder:

                                 src/main/resources/services               META-INF/services

Third pull down the full Lucene/Solr source tree. Lets test our codec from the command line!

Package your codec into a jar:

mvn package

Under the Lucene source tree, run a single Lucene Test. Pass it the codec to use with the -Dtests.codec argument. Pass the jar with the codec you just packaged up with the lib argument. Executing this will prove that Lucene can find and load your codec. If Lucene can’t find load your codec, you’ll get an appropriate error right away.

C:\solr\solr-4.3.0\lucene>ant -Dtestcase=TestSegmentTermDocs -Dtests.codec=MyCodec –lib "C:\path\to\target\codec-1.0-SNAPSHOT.jar" test

Now Run all the tests!

C:\solr\solr-4.3.0\lucene>ant -Dtests.codec=MyCodec –lib"C:\path\to\target\codec-1.0-SNAPSHOT.jar" test

Fourth, Naturally its going to be convenient to debug Lucene unit tests running our codec in Eclipse. Here’s what we need to do.

Load your codec and the Lucene source code into Eclipse.
Create a new Junit debug configuration
Select the Radio button for “Run all the tests in the selected project, package, or folder”
Enter in the folder /path/to/lucene/core/src/test
Select the JUNIT 4 test runner
In the arguments tab, for vm arguments specify:

-ea -Dtests.codec=MyCodec
In the “classpath” tab, make sure both your codec project and the solr/lucene projects are selected

Now you should be able to launch this debug configuration and go to town! Go forth and make some awesome codecs! Let us know about the codec you’re working on, love to hear about it!