In my previous post, I talked about implementing Suggest-As-You-Type using Solr. In this post I’ll cover a closely related functionality called Suggest-As-You-Type.
Several years back, Google introduced an interesting new interface for their search called Search-As-You-Type. Basically, as you type in the search box, the result set is continually updated with better and better search results. By this point, everyone is used to Googles Search-As-You-Type, but for some reason I have yet to see any of our clients use this interface. So I thought it would be cool to take a stab at this with Solr.
Lets get started. First things first, download Solr and spin up Solrs example.
cd solr-4.2.0/example java -jar start.jar
Next click this link and POOF! you will have the following documents indexed:
- Theres nothing better than a shiny red apple on hot summer day.
- Eat an apple!
- I prefer a Grannie Smith apple over Fuji.
- Apricots is kinda like a peach minus the fuzz.
(Kinda cool how that link works isnt it?) Now lets work on the strategy. Lets assume that the user is going to search for “apple”. When the user types “a” what should we do? In a normal index, theres a buzillion things that start with “a”, so maybe we should just do nothing. Next “ap” depending upon how large your index is, two letters may be a reasonably small enough set to start providing feedback back to your users. The goal is to provide Solr with appropriate information so that it continuously comes back with the best results possible.
Attempt number 1: Directly query Solr with current search string.
User has typed “ap”, so you search for q=”ap”:
… and nothing is returned. But youre nobodys fool. A quick glance at the schema and you see that the “title” field (which weve indexed these sentences into) is tokenized with the StandarTokenizer, so of course “ap” wont match either of the plausible tokens “apricot” or “apple”. Therefore:
Attempt number 2: N-gram Tokenize the documents.
Just add this fieldType to your schema,
<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.LowerCaseTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.LowerCaseTokenizerFactory"/> </analyzer> </fieldType>
set the “title” field to be of type=”text_general_edge_ngram”, restart, reindex and viola! The queries above will now work just fine.
You can do this… but I dont like it. Why? Because the average English word is 5 letters long. And if you account for stopwords, probably up to 6 or 7. So BAM! your index is now substantially larger than it was before. Lets try something else.
Attempt number 3: Use wildcards.
Rather than trying number 3, lets just add an asterisk “” at the end of the users search string. This wildcard will match zero or more additional characters at the end of the users search term. So when the user types “ap”, the search string becomes “ap”:
Thats more like it! We can see that we are now catching all 4 documents because of the terms apple and apricot. And as the user keeps typing, the appropriate thing happens. When the user adds an extra “p”, the search becomes
And with that the document for apricot appropriately disappears.
So we have solved the problem, right? … Not so fast, what happens when the user has typed the complete word “apple”. In this case the search becomes:
Notice anything fishy? I do… Why on earth is “Eat an apple!” the second results. This sentence is only three terms long, it should be sorted to the top. Similarly, the first result is the longest sentence in the bunch, it should be sorted to the bottom! As proof that this is the “proper” order of things, lets do a plain search for apple:
As you can see, the results are now returned in order of length of sentence – as they should be. But why the confusing arrangement? Its because wildcard queries are not scored at all! Thats right! The strange ordering from before is the order in which the documents were indexed. Theres gotta be something we can do.
Attempt number 4: Use wildcards and the last token in the search.
Heres an idea, what if we just stick both the wildcard and the users current input into the search. So for instance, if the user types “appl” then the query is for
q=appl appl*. And if the user types “apple” then the query is for
q=apple apple*. In the first case token “appl” returns nothing and the wildcard “appl*” at least returns reasonable results. Take a look:
In the second case the search term “apple” matches and the results all come back correctly ordered:
The only think left is for finishing touches. If the user types in multiple terms, for instance “red apple”, then its only the final term that should be queried with and without the wild card.
Of course the strategy presented above is nothing without a swanky AJAXy UI. As a predominantly back-end developer – heres the best I have to offer.
Its not much to look at, but save it to your desktop and youll immediately be able to use it with the demo above. Take a look at the code and youll see that were implementing the strategy above.
Additionally this little example makes use of a really nice auto-suggest pattern that I described in my previous post.