Recently, we had a project where we helped a client index a corpus of Chinese language documents in Solr. We have asked Dan Funk, a committer to Project Blacklight to provide a guest blog post for us on the details of how to approach indexing Chinese, particularly when you are a non-speaker.
Take it away, Dan!
Indexing Chinese in Solr
Prologue (Including thanks, and some vital orientation)
Before I start, Id like to lay some thanks on a few people who helped me muddle through indexing a language I cant speak, and having me come off looking like a pro. Wiley Kestner ([@prairie_dogg]) sat for hours giving me tips and pointers about Chinese. Christopher Ball helped me quickly put an excellent and professional face on my work by using the Blacklight project. And Eric Pugh (@dep4b) provided some much needed mentoring – helping me see a way forward in what I initially believed was an intractable problem.
If you dont read Chinese or have not worked with it before, here are few things you should know:
- Chinese words are frequently made up of more than one character, and words are not separated by spaces. (read this as “Tokenization is a big problem.”)
- Spoken Chinese is completely different from written Chinese, so dont stress about the multitude of dialects when you are indexing.
- There are two common types of written Chinese: Standardized and Traditional. Since Traditional can be converted to Standardized fairly easily, the focus of this document is on Standardized. Traditional text has many more characters and thus the potential for deeper subtler meanings.
- Though traditionally written from top to bottom, right to left, it is far more common to see Chinese written from left to right – particularly on the web.
- Dont depend on your documents being in UTF – you are far more likely to encounter GB2312 encoding.
- A great method for testing relevancy in a language you dont know is to use a Judgment List, please see Eric Pughs presentation here for more information.
My Best Advice:
Ok, here are two most important pieces of advice I can give you:
#1: Separate your Chinese text into its own field(s).
That is to say, dont try and index multiple languages in the same field. If your Lucene/Solr field structure is complicated, add a second core with duplicate field names. Why?
A. You set yourself up for handling additional languages fluidly and effectively.
B. You can use the best indexer available for each language (see advice #2)
C. You improve overall performance because the indexes are smaller and tighter.
D. You remove confusing, and likely false, results in a language the end user does not understand.
#2: Use the CJK or Paoding analyzers for your Chinese Text.
There is some great documentation out there for CJK, but if you would like to give Paoding a shot, here are some directions to help get you up and running:
dan@maus:~$ cd code
dan@maus:~/code $ svn co http://paoding.googlecode.com/svn/trunk/ paoding-analysis
2. Compile it with Ant.
dan@maus:~/code/paoding-analysis $ cd paoding-analysis
Building jar: paoding-analysis.jar
3. Build a modified Solr war file.
The Paoding analyzer, while brilliant at analyzing Chinese text, was not originally built to work well in a web deployed environment, and depends heavily on file paths to get to its built in dictionaries. To correct for this, you will need to inject the analyzer and its configuration files into your solr war file. I tested this approach with apache-solr-3.4 doing the following:
dan@maus:~$ mkdir temp
dan@maus:~$ cd temp
dan@maus:~/tmp$ unzip /usr/local/apache-solr-3.4.0/dist/apache-solr-3.4.0.war
dan@maus:~/tmp$ cp ~/code/paoding-analysis/paoding-analysis.jar WEB-INF/lib/
dan@maus:~/tmp$ cp ~/code/paoding-analysis/classes/*.properties WEB-INF/
dan@maus:~/tmp$ zip -r * apache-solr-3.4.0-paoding.war
4. Update your Solr configuration and add support for a paoding string.
5. Copy over Paodings dictionary files into your solr home directory.
dan@maus:~/solr_home$ cp ~/code/paoding-analysis/dic my_solr_home
6. Set an environment variable to let the Paoding Analyzer know where to find the dictionary files:
dan@maus:~/solr_home$ java -DPAODING_DIC_HOME=./dic -jar start.jar
**Choosing the right Analyzer**Now that Ive recommended Paoding and CJK, let me back that up with some details. Below I delve just a little more into the structure of Chinese text, and then run through a comparison of the available tokenizers to help give you an idea of their differences.**The Structure of Chinese Text**Most languages uses spaces to separate their words. A common misconception is that Chinese words are its characters – but this is the case only a fraction of the time.Take 的 (de) for example. It is the single most common character in Standard Chinese by far. It has little use on its own, but when placed with other characters it can mean:
高的 high, tall;
是的 thats it, thats right;
是…的 one who…;
目的 goal, true, real;
In short, you can’t search for the characters individually as if they all carry the same weight or the relevance of the search results will be embarrassingly reduced.**What Analyzers are available?**Let me introduce you to the options, then follow up with some comparisons that will show off how the tokenizing will actually differ …To my knowledge what follows is a complete list of the open source options available for parsing, indexing and searching Chinese characters in Solr/Lucene. While commercial options definitely exist, they were not a part of this comparison.
|Default Solr setup||No new configuration required, and roughly supports multiple languages.||Tokenized on spaces – but will shift to character tokenization for Chinese text. See previous section for why this is problematic.|
|CJK||Thoughtfully parses Chinese characters – understands that character groups alter meaning. Ships with and is part of Solr’s default configuration.||Does not use a dictionary, depends largely on an n-gram based algorithm that creates all possible groupings of pairs of symbols in the text.|
|Smart Chinese||Uses a dictionary to pull out characters. Ships with solr as an add-on package.||The dictionary is minimal and handles general cases well, but many nuances of the language are lost. It requires a custom Solr configuration.|
|Paoding||Uses a large set of dictionaries, and provides exceptionally good search results across a multitude of contexts.||Can be very difficult to configure and setup – almost all documentation is written in Chinese. Does not ship with Solr, and must be built from source to work correctly with the latest stable Solr versions.|
**About the Sample Document Set:**A set of 12 documents were loaded into Lucene. The first 10 are about “types of fish” and are based on a quick google search of the same. The 11th document is a wikipedia article on Hồ Chí Minh , and the 12th document is about a person whose name begins Hồ Chí.
Example 1: 爬蟲
爬蟲 means “Reptile”.爬 : [pá] crawl, climb,蟲 : [chóng] The traditional form of 虫. meaning worm, paired with 书 to mean insect.So here is a case where we have a traditional character*, and a paired set of characters that have an alternate meaning from what they mean separately.In this table the “T1”, “T2” … represent the terms parsed out by the various analyzers. In the example below the string “爬蟲” is split into two tokens by the default solr setup, but remain a single token in CJK.
|Default Solr setup||爬||蟲|| 2 hits (doc 8 and doc 3)
* 爬蟲類:”reptile” – a good hit.
* 爬岩鳅: “Beaufortia loach” – bad hit.
– Even more problematic, is that your highlighting will identify these as two seperate hits, even when it gets it right.
|CJK||爬蟲|| 1 hit (doc 8 )
It gets the right document. But this is because CJK always groups by 2, we will see it fall short on the next example.
|Smart Chinese||爬||蟲||2 hits (doc 8 and doc 3)|
|Paoding||爬蟲||1 hit (doc 8 )|
* Note: A second run, replacing the traditional symbol 蟲 with the standardized 虫 symbol does not match any documents in the test set, though it would have been correct to do so. The ICU Project provides an API that would perform this conversion.
Example 2: 胡志明
Hồ Chí Minh was profoundly important leader in Vietnam. However, divide these characters up and you might get “A recklessly clear magazine.”胡志明 means “Hồ Chí Minh”.胡 : [Hu] “recklessly” 胡说 nonsense (F鬍) (=胡子 húzi) beard (F衚) 胡同 hútòng lane志 : [zhì] (=意志 yìzhì) will, (=标志 biāozhì) mark; 同志 tóngzhì comrade, (F誌) 杂志 zázhì magazine明 : [míng] bright, clear, distinct, next (day or year),
|Default Solr setup||胡||志||明|| 6 Hits:
胡志明 (Hồ Chí Minh)
胡 志 (Ho Chi)
起明显 (the aparent)
|CJK||胡志||志明|| 2 Hits
胡志明 (Hồ Chí Minh)
胡志 (Ho Chi)
|Smart Chinese||胡||志||明|| 4 Hits:
胡志明 (Hồ Chí Minh)
胡 志 (Ho Chi)
|Paoding||胡志明||胡志明 (Hồ Chí Minh)|
It is possible for you to index Chinese, even if you dont speak it. The largest problem you will face is in correctly parsing the text, but there are several effective tools that help solve the problem. I would strongly discourage you from indexing Chinese content with Solrs default settings. You will not get good results. If you need to quickly add support for Chinese to an existing project, I highly recommend using the CJK analyzer. However, if you have a discerning audience, a specialized area, or the need to enhance the quality of your results over time (by expanding on the included dictionaries) then Paoding is an excellent choice.
Resources and References
http://www.zein.se/patrick/3000char.html** – **The most common Chinese characters in order of frequencyhttp://translate.google.com/** – **A fantastic way to quickly translate a few characters or a whole page of text.http://site.icu-project.org/** – **Provides an API for converting from Traditional to Standardized Chinese Characters. : http://www.twitter.com/prairie_dogg