Every one of us at OSC looks forward to Lucene Revolution. It’s one of the few conferences we attend where everyone understands search at a deep level. That means we get to hear all the cool things people are doing with and to our favorite search engine. A lot of that has to do with the community voting on the presentations before hand, because just reading the presentation abstracts made us, months ahead, impatient to fly to Austin and start geeking out!
Will Hayes - CEO, Lucidworks This year Will Hayes kicked things off with an interesting take on the Gartner Hype Cycle. Essentially his message was Big Data is transitioning from the peak of inflated expectations to the slope of enlightment, but must first cross the trough of disllusionment. The fastest way to make that crossing is to generate valuable insights from your Big Data effort, and the fastest way to generate those insights is through Search.
Grant Ingersoll - CTO & Founder, Lucidworks Search best practices, relevancy, video games This was cool! As a follow up to Will’s call to good search practices, Grant highlighted what those practices are. The fun part is he stuck in pictures of classic video games like Pitfall, Oregon Trail, and Ms. Pac-Man. One point that really resonated with us was the quest for Relevancy, which my colleague went into more depth on.
Matt Pfeil - CCO & Founder, DataStax Lastly, Matt from Datastax drove home the point that the amount of data really is exploding, and we need search to handle it. DataStax integrated Solr into Cassandra, which yields a very robust, distributed, searchable data store.
On to the talks!
Where Search Meets Machine Learning Joaquin Delgado - Director of Engineering, Verizon Diana Hu - Data Science Lead, Verizon This was a high-level talk about how Verizon integrates machine learning into their search platform, and why. It didn’t get too deep into the nitty-gritty, but it was a good overview of how one would go about doing so in a production system.
Scorer’s Diversity Phase 2.0 Mikhail Khludnev - Search Engineer, Grid Dynamics, Inc. How is Lucene able to look at billions of documents and make snap judgments about their applicability to a query? That’s the scorer, and Mikhail went into a couple of the data structures underpinning Lucene’s scorers. The main one is skip lists.
Each search term gets analyzed down to a token, and that token is just a long series of zeros and ones in a bit set. The order of bits is important because a one in the first bit means the token appears in the first document. When you do a search for “termA OR termB” you’re really just doing a bitwise OR operation which computers are really fast at. Likewise with AND. The trouble is, these lists are split among several index segments which might even reside in different nodes. That’s where skip lists come in. They enable efficient traversal of the bit set so that you get your search results fast.
Implementing Conceptual Search in Solr using LSA and Word2Vec Simon Hughes - Chief Data Scientist, Dice.com Latent Semantic Analysis is a handy way to derive an abstract concept from a bag of words. What you end up with is a distance measure of how relevant a term is to this concept. What Simon & co. did at Dice was to add this abstract concept to the indexed document so that a search like “java” would get a hit on “j2ee”. OSC relevancy guru Doug Turnbull and OSC alumnus John Berryman gave a talk on this a couple of years go and it was awesome of Simon to give them a shoutout during the presentation. Back then they tried to use Mahout, which wasn’t quite compatible with Lucene, so they ended up using a Python library. Had Word2Vec been available things would’ve gone much smoother, and it was cool to see this approach actually in use.
Learning to Rank in Solr Michael Nilsson - Software Engineer, Bloomberg LP Diego Ceccarelli - Software Engineer, Bloomberg LP This was an incredibly popular presentation — so much so that I wasn’t able to stand in the back without blocking somebody’s view! So I’m going to have to wait and hope that the video of this makes its way online because I excused myself back out into the hallway. As an excellent consolation prize I wandered into a conversation between Yonik Seeley and Doug Turnbull about subfacets.
Stump the Chump Chris Hostetter One of the great traditions of Lucene Revolution is Stump the Chump. That’s where a conference full of Lucene/Solr experts try to stump Chris Hostetter on some bizarre feature or bug or how-to. Cassandra Targett bravely MC’d, with committers Steve Rowe, Mark Miller, and Upayavira serving as judges. In the end I think Chris prevailed, but if not it was close.
Conference Party @ The Iron Cactus After the first day we all reconvened at The Iron Cactus to sample some of the local flavor. Even on a Wednesday night Austin’s 6th Street was hopping. While there I got a chance to catch up with David Smiley and colleagues from Bloomberg, as well as talk with Will Hayes about Lucidworks Fusion. I must be getting old because these days I’m finding it hard to hold a decent conversation while shouting over bar music.
What I missed There was so much going on this year that I really felt like I missed more than I saw. These are some of the presentations I would’ve seen if I could be in multiple places at once:
(in no particular order)
Properly integrate ManifoldCF with Apache Solr Aurélien Mazoyer - Search Expert & Cofounder, France Labs Aurélien and I spoke before the conference about patent indexing and I did get the opportunity to meet him at the conference, but I didn’t get to see his presentation. Which is a shame because OSC recently did a Solr/ManifoldCF project and my very first conference presentation was on using ManifoldCF to implement multi-level security in Solr.
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine Trey Grainger - Director of Engineering, Search & Recommendations, CareerBuilder.com Trey is continually pushing Solr to do newer and cooler things.
Lucene/Solr Spatial in 2015 David Smiley - Solr Search Consultant & Developer, D W Smiley LLC The last time I saw David talk he greatly expanded my concept of spatial search by cluing me into the fact that it works for any numeric range, like date ranges.
Searching for Better Code Grant Ingersoll - CTO & Co-founder, Lucidworks With all of the insights we’re able to tease out of document collections using search, applying those techniques to source code should be a logical progression. IDEs do this to help you quickly navigate to code symbols, and Krugle does this at a much larger scale.
How to Build a Smart Search Engine with Intent Detection Rama Yannam - Senior Software Development Manager, Bank of America Viju Kothuvatiparambil - Lead Engineer - Search, Bank of America Another new-ish technology at the conference was Intent Detection (Learning to Rank is the other). Several of the other presentations I saw mentioned this and it was a hot topic around the lunch table.
Building Smarter Search Apps Using Built-in Knowledge Graphs and Query Introspection Ted Sullivan - Senior Solutions Architect, Lucidworks Ted’s been blogging about this concept it’s a very smart approach to increasing search accuracy.
Streaming Aggregation in Solr: New Horizons for Search Erick Erickson - Lucene/Solr Committer Aggregations are indispensable in analytics, and until recently Elasticsearch has had an edge on Solr in this regard.
What’s next Only a fraction of the excellent talks are called out above. Lucidworks does a great job of posting Lucene Revolution slides and videos, and they’ve already started posting the slides from Austin. Hopefully we’ll soon see more blogs, more videos, and more slides on these topics.