Blog

Solr UTF-8 Character Handling

Searching for non-ASCII characters can be a challenge. There are a number of reasons for doing so, even in a primarily English corpus:

  1. Accented characters in names and words that have been incorporated. For example, Renèe Pèrez
  2. Greek letters, which have been incorporated into mathematical formulae or scientific phrase: α-linoleic acid
  3. Punctuation characters which may change the meaning of the text: α

These characters each have their own special considerations. Fortunately, Solr provides support for all these cases with a little bit of configuration.

Accented Characters

For the first case, there are two main cases that need to be supported:

  • Rene and Renè should return different results
  • Rene and Renè should return the same results

Remember, by default Solr does exact matches for each token. Thus by default, a search for Rene will not match Renè. The usual solution here is called code folding: replacing the utf-8 characters at index-time with their latin equivalent. This is provided via the mappingCharFilterFactory

In the schema, add the following lines:

<analyzer>
  <charFilter class="solr.MappingCharFilterFactory"  mapping="mapping-FoldToASCII.txt"/>
  <tokenizer ...>
  [...]
</analyzer>

That mapping-FoldToASCII.txt contains both the characters to be replaced, and their character replacements. For our example:

# è  [LATIN SMALL LETTER E WITH GRAVE]"u00E8" => "e"

This tells Solr at index-time to replace all è with e, and our search for Rene will return Renè as well. It is important to remember that this removes the initial è character. Unless this is applied for both the index and query analyzer chain, queries for Renè won’t match anything at all!

The mappingCharFilterFactory allows us to specify precisely which characters will be preserved or folded out. Solr provides the asciiFoldingFilter which will fold down characters to their ASCII equivalent, if one exists. Thus è => e will be mapped, although many control and punctuation characters will be folded out.

Greek Letters

Greek letters are common in many document sets, particularly those containing scientific terms and formulae. In this case, there may not be an obvious character to use in the folding. Our first thought might be to add a longer character string in the mapping-FoldToASCII.txt file, or one like it:

# è  [GREEK SMALL LETTER ALPHA]
"\u03B1" => "alpha"

However, the mapping char filter is applied as a char filter. This means that it will be applied before the document is tokenized. So when we replace one character (α) with five characters (alpha), there is no way to preserve that mapping of location back to the original document. If we are returning highlighted results to the user based on the field containing the mappings, then the highlighting will cover too many characters!

The PatternReplaceFilterFactory is a close relative to the MappingCharFilterFactory. It allows us to specify a pattern of characters to match a corresponding substitution. In this case, the replacement will be applied after the text has been tokenized, and the tokenFilter will contain the start and end lengths of the original text, even if we change the size of the token:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilterFactory" pattern="α" replacement="alpha"/>
</analyzer>

If our users will be commonly searching for the actual character, the reverse configuration is also possible:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilterFactory" pattern="alpha" replacement="α"/>
</analyzer>

Remember, the important thing is that the same analyzer is applied at both index and query time.

There are two other interesting issues that come up here. These considerations are not special to Greek letters, but are more common with that subset than other special characters.

Encoding

What if the Greek letters in my document are not initially UTF-8 encoded? What then? Many larger documents have either been serialized to XML or HTML before being ingested. For many of these documents, the special characters may be HTML-encoded, rather than the UTF-8 encoding Solr allows searching over. That “α-linoliec acid” may be “α-linoleic acid”. In this case, users will rarely, if ever, search for the html encoding.

If highlighting is not a significant concern, the easy solution in this case is the HTMLStripCharFilterFactory. This removes all html syntax from the body before tokenization, including replacing html-encoded characters with their target in the index:

<analyzer>
  <charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>

Note that because this is a charfilter, it will negatively impact highlighting. Additionally, the HTMLStripCharFilterFactory is sensitive to punctuation. HTML-encoded characters are usually of the pattern “&[a-zA-Z];” however, in many cases the trailing semicolon is not. In stand-alone tokens, Solr will assume that stand-alone terms missing the trailing semicolon are html-encoded, while embedded terms must provide the trailing semicolon. Thus

alpha&omega

will be indexed as

alpha&omega

while

alpha&omega;

would be indexed as

alpha&ω

Depending on your familiarity with the greek alphabet, the above encoding may be a surprise. Isn’t omega Ω?

The answer is simple. As with the a-z alphabet, non-latin characters also have upper and lowercase syntax. For the most part, this is handled transparently in the java toUpper() and toLower() libraries. Furthermore, html-encoding provides syntax to support both cases. In the previous example,

&omega;

is html-encoded to be lowercase omega, or ω.

Whether this behavior is relevant depends on the text being indexed. For nutritional content, users may be more interested that the character seamlessly map to ‘alpha’ in their searches. In a scientific document, ω and Ω may occur in different formulae, and occurrences of the lowercase format may not be relevant to occurrences of the other.

If we can help you solve these kind of tricky search issues, get in touch!