Blog

Two Search Conferences in Two Weeks Was Too Informative

This year I experienced the conference equivalent of a lunar eclipse: two search conferences in two weeks located two hours away from my home town of Charlottesville, Virginia! Enterprise Search Summit (ESS) and LuceneRevolution (LR) share many similarities. Both have changed their names in the last year, Enterprise Search Summit expanding its focus to be Enterprise Search & Discovery Summit , and LuceneRevolution billing itself as the Solr/Lucene Revolution! Ironically, both still use their original domain names. Both are overlapping more in the focus on open source search, with Solr and ElasticSearch being frequent topics of conversation at ESS.

However, they both continue to have different audiences. ESS is focused as the business user, and how to provide great knowledge management via crawling the enterprises internal resources is the most common topic of discussion. LR is really focused at the people building applications that are powered by search, with the expectation that you are really going to customize your search technology. It is a very geeky audience ;-).

1) Death of the commercial search engine has been greatly exaggerated.

A couple of years ago, as a Solr guy, I was riding high on the proof that open source was the best solution to search. We had toppled the Autonomy Idol, FAST had switched to the slow lane of being Sharepoint search, and Endeca was rapidly becoming “remember Endeca? Good for e-commerce search”. Since then, Ive come to miss the integrated enterprise search engine, because while the core technologies of open source search based on Lucene are fantastic, glueing everything around that technology is a pain. Connectors. Security Model. Homemade Tuning Tools. Were now seeing the rejuvenation of enterprise search engine product, with LucidWorks, ElasticSearch, SearchBlox, and Attivio all built on Lucene, as well as large ecosystem of component providers (like us with our Quepid and Splainer relevancy tuning tools!). This feels like a step backwards, but instead acknowledges that the hardest part of open source is the integration between projects into a single platform. One of the comments that I heard at ESS was from a business user who said:

We need to be able to modify the documents shown, tweak rankings, and otherwise optimize the data for our users, but without it becoming a full scale IT project.

I couldnt recommend them Solr, without a bundle of custom work upfront. I think that we will see more commercial options that are targeted to specific verticals, and bundle in more then just a search engine. For example, a search engine for the medical field that understands what MeSH and PubChem are out of the box would be worth buying versus building. The era of one size fits all commercial products is over, but there is plenty of room for new more narrowly focused offerings.

I first drafted this conclusion last week, and since then, two blog posts touching on this have come out that are worth reading, that I think validate my statement!

2) Aggregations! Aggregations! Aggregations!

While facets may be the original aggregation tool, it seems like today all the excitement is about counting things. Field collapsing (aka grouping) was introduced in Mid 2011 as part of Solr 3.3. Since then it seems like what everyone is trying to do is turn a search engine into a calculating engine. The line between a database, a search engine, and a Map/Reduce engine is getting blurrier and blurrier. Today, if my job is to figure out how to analyze log messages, I would have a raft of options: Splunk, Cassandra, Solr, ElasticSearch, or Hadoop. This is all part of the influence of the NoSQL movement on search, but I look forward to when we go back to arguing about if Stemming or Lemmatization is a better solution 😉

3) Two approaches for dealing with complex rules around text parsing.

I am seeing two approaches for dealing with really complex relevancy rules for parsing free text. One is the approach Ill call “Get inside Lucenes head and understand everything” and the other is “The world is complex, I give up”.

In the “Get inside Lucenes head and understand everything” camp was a presentation by Ramzi Alqrainy on Arabic Content with Solr. In there, he really focused on understanding the rules of Arabic language, and tweaking the various knobs and dials in Solr to make it work. Another example is the way that Ernst and Young askss its user to rate the results so that it can boost them. Again, but directly trying to directly influence how the search engine works. We have a customer doing Trademark Search that has the same set of very specific rules. The challenge is that the more very specific rules you have, the interactions between them become more complex and harder to understand. And you are the rules engine.

So as a reaction to that is the “The world is complex, I give up” crowd. By that, I mean they attempt to build a model of a search relevancy problem, based on signals, and then use that model to influence the results. The interesting thing is that there are well known techniques for measure how well your model works, but you may not know exactly why any individual result was returned! Because the model is built off of inputs, you are two degrees separated from the specific rules. However, this black box approach has been proved to work well. Andrew Fast gave a really wonderful presentation that advocated for this appraoch. My only beef with his presentation is that he made building the model seem like a simple process! The presentation on Thoth, a Solr monitoring solution, touched on this approach as well. Thoth includes a model for figuring out which incoming queries are fast and which are slow, but without trying to parse out the parts of the query. The Thoth ML module isnt public yet, but the use of a simple classifier for queries is.


It was two very busy weeks, dare I saw too exhausting even? However it was great to catch up with the community, and see where search is going.