Searching names is a pretty common requirement for many applications. Searching by book authors, for example, is a pretty crucial component to a book store. And as it turns out names are actually a surprisingly hard thing to get perfect. Regardless, we can get something pretty good working in Solr, at least for the vast-majority of Anglicized representations.
We can start with the assumption that aside from all the diversity in human names, that a name in our Authors field is likely going to be a small handful of tokens in a single field. We’ll avoid breaking these names up by first, last, and middle names (if these are even appropriate in all cultural contexts). Let’s start by looking at some sample names in our “Authors” field:
- Doug Turnbull
- Turnbull, Douglas
- Turnbull, Douglas G.
- Turnbull, D. G.
- D. Graeme Turnbull
Ok, you can already see the variations in how we represent Anglicized names. This lets us construct our strategy. First, for the record, we’re going to use this pretty basic analysis chain for our authors field. It will do the job of stripping punctuation and lowercasing names:
<fieldType name="AuthorsType" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>
Ok down to business. There are two main issues that if we could tackle, would get at a very large portion of the name search problem
Author names get rearranged, either in the document or query, with some parts omitted: (Douglas Turnbull vs Turnbull, Douglas vs Turnbull Douglas G)
Many names are abbreviated: (Doug Turnbull vs D. Turnbull vs D. G. Turnbull vs Douglas G. Turnbull)
The reordering of author tokens is a fairly straight forward exercise in using lucene proximity search. This feature of Lucene query syntax lets us take a user’s query and a proximity P:
- Douglas Turnbull
And search for phrases that are the phrase the user entered or phrases where the terms occur within P proximity to each other. So, expressing this in Lucene query syntax
- Authors:”Douglas Turnbull”~2
Will happily return results where Douglas and Turnbull occur within 2 token moves of each other (regardless of order). The following matches would be acceptable:
- Douglas Turnbull
- Turnbull Douglas
Accounting for middle initials, we could set our proximity to 3, so a query for:
- Authors:”Douglas Turnbull”~3
Will now match:
- Douglas G. Turnbull
- Turnbull, Douglas G.
Great, we’ve solved this problem! Well actually this works in most, but not all, cases. Can you spot the subtle bug? Here’s a hint, it has to do with using a phrase query. Is there a class of queries that this strategy won’t work for?
Initials & Short Forms
What happens when a user searches for Doug Turnbull and all Solr has indexed are references to Douglas Turnbull? To help with efficient prefix queries, Solr gives us the EdgeNGramFilterFactory. The EdgeNGramFilterFactory takes a token, say Douglas, and generates tokens based on slicing the string from either the front or the back of the string. For example, with minGramSize=1 and side=”front” the token “Douglas” will result in the following tokens:
Input: douglas Tokens: [d] [do] [dou] [doug] [dougl] [dougla] [douglas]
An important note about this filter (and many others in Solr) is that each generated token ends up occupying the same position in the indexed document. In reality, we can envision the document as two dimensional:
Position N: Position N+1 [d] -> [t] [do] [tu] ... ... [douglas] [turnbull]
So a phrase query for “do turnbull” will hit on this document in the same location in this document as the phrase “douglas turnbull”. What a nice characteristic! So we’ll be able to match the abbreviated “D. Turnbull” by simply employing this filter in our analysis chain as follows:
<field name="AuthorsPre" type="AuthorsPrefix" indexed="true" multiValued="true"/>
<copyField source="Authors" dest="AuthorsPre"/>
<fieldType name="AuthorsPrefix" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="200" side="front"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>
Let’s walk this analysis chain through an example query on the AuthorsPre field to see how it works. The name “Douglas G. Turnbull” gets indexed as follows
Position N N+1 N+2 Standard Tokenizer: [Douglas] [G] [Turnbull] Lower Case Filter: [douglas] [G] [turnbull] NGram Filter: [d] [g] [t] [do] [tu] …(etc)… …(etc)… [douglas] [turnbull]
Now let’s take a user’s query: “D G Turnbull”. This gets tokenized rather simply using the query analysis chain to [d] [g] [turnbull]. These tokens occur in the index everywhere that the name Douglas G. Turnbull was indexed! (and also where David G. Turnbull was indexed)
Putting it together
Ok time to step up your game. Now the user has entered “Turnbull, D.” into the search box. What now? Well we simply apply the same trick we used before. Instead of searching for:
- AuthorsPre:”Turnbull, D.”
Let’s turn this into a proximity query:
- AuthorsPre:”Turnbull D.”~3
There’s lots of bits of knowledge coalescing at once to see how this works. First, as we discussed, all the generated ngrammed tokens share the same position in the token stream. So [D.] and [Douglas] are in the same position in the indexed document. This means that when position matters (like in a phrase query) “D. Turnbull” and “Douglas Turnbull” will both match a document containing “Douglas Turnbull”. Our proximity search, on the other hand, gives Solr some freedom in rearranging the tokens a bit to satisfy the match, allowing a bit of freedom in how names are arranged – resulting in a match on many rearranged and abbreviated forms of our name.
This is just a start
This is a pretty good start, but search is a heuristic that can always be improved. There’s lots of work left to do to get this perfect! Aside from the numerous conflicting cultural conventions I’m sure I’ve violated, there’s plenty of exercises left to the reader:
- How do you prefer exact matches over prefixed names?
- How could you identify which query tokens are meant for middle names vs first names vs last names?
- The standard tokenizer breaks up hyphenated names, how could you preserve the hyphenated name as one token?
- Many name abbreviations are not simply prefixes of the original name. For example, how would you match “Thomas” when the user enters “Tom”?
So there’s some fun puzzles to up your Solr game! And if you really want to up your Solr game on these and other problems, be sure to check out our Solr Training!
Share your thoughts! Hopefully this article can get you started building a reasonable name search. Have you had issues with name search in the past? How have you approached these problems with Solr? And of course contact us for help with your name search problems!