I recently completed a fairly straight-forward search project in which we were replacing a legacy system with Solr. The goal for the first release was to just make sure that we had the same recall in Solr as in the legacy system, with two secondary goals of updating the UI and maintaining search speed. We were also able to convince the stakeholders that this first iteration should have at least rudimentary click metrics built in, and they should live right in Solr.
This isn’t the most common practice with Solr. Usually search metrics are harvested from logs and put into something like the ELK stack (Elasticsearch Logstash Kibana) or sent to Google Analytics. There’s certainly nothing wrong with that approach, but for me it adds many more hoops I’ve got to jump through in order to answer questions about how well Solr is interacting with its users.
But while storing click traffic in Solr is a rare find from my (consultant’s) perspective, isn’t something new in the Solr world. In fact, this approach is part of the “batteries included” philosophy behind Lucidworks Fusion and it’s come up several times in the Solr mailing list. Certainly it’s useful for measuring things like how deep a user has to go in search results, but it can also be used to drive simple collaborative filtering or popularity boosting.
So with that in mind, let’s look at what we chose to measure in my project.
The main set of fields were:
date: When the event occurred
type: Some sort of tag indicating the type of event, like:
- NewQuery (the user started from scratch with a blank Search box)
- RefinedQuery (they edited the filters or added terms to the query)
- ViewDetails (this is a click on one of the search results)
- ZeroResultsQuery (the user queried but we didn’t find anything)
user: The logged in user ID
Since we were using Solr for this project we added a few dynamic fields for search events:
filterChanged_b: Was a filter changed
pagedChanged_b: Was the page changed
numfound_i: How many docs were retrieved
page_i: The current page of search results
querytime_i: How long did the query take
query_s: The query string
filter_ss: Any applied filters
(If you’re unfamiliar with the notation,
*_b is a Boolean,
*_i is an integer,
*_s is a string, and
*_ss is a multivalued string. These are default settings in Solr.)
We also added fields specific to events triggered by clicking on one of the search results:
docid_s: The ID of a clicked document
offset_i: The offset of the document in the search results
clicked_b: Did the user click the link in search results to get to the page (there are other ways to get there)
title_s: The title of the document (useful for reports)
After we started getting a nice log of user interaction with our search we needed to share that with the team through a dashboard. Even though the structure of the “documents” and answers we wanted were simple Solr queries to construct, only a few of us on the team spoke that language. I was going to provide an initial set of queries, but it was important that the rest of the team be able to write their own. Solr has the ability to parse a basic dialect of SQL, so we planned our dashboard around that. Apache Zeppelin fit the bill nicely and Eric Pugh wrote a nice article about using it with Spark and Solr a couple of years ago.
But for whatever reason I’ve never been able to get Zeppelin to install interpreters through Maven on Windows, which
is the installation method in the Solr docs. What I ended up doing was adding a new JDBC interpreter by pointing to
a local jar as documented in Zeppelin. Even then I had to copy
over all of the SolrJ client libraries into Zeppelin’s
local-repo directory. Our client’s server was Linux, so this
headache was localized to just my development machine.
Once Zeppelin was running the real fun began. If you’re an accomplished SQL writer you’re probably going to get immediately frustrated like I did. The punishment for syntax errors is a Java stacktrace that fills your paragraph area, so experimentation is discouraged. One thing that tripped me up was that Solr query syntax creeps into unexpected places. For example, to write a constriant over a field in a WHERE clause you do something like this:
SELECT type FROM events WHERE type = '(NewQuery NewAdvancedQuery RefinedQuery)'
Or to constrain over a range you do this:
SELECT type FROM events WHERE `date` = '[NOW/DAY-30DAY TO *]'
But it’s not all bad. If your data doesn’t require too much manipulation Zeppelin is still a quick way to build dashboards, and our events collection was certainly simple. These are the initial queries we came up with:
Top 20 Zero Results Queries Over 20 Days
SELECT query_s, COUNT(*) AS `count` FROM events WHERE type='ZeroResultsQuery' AND `date` = '[NOW/DAY-30DAY TO *]' GROUP BY query_s ORDER BY `count` DESC LIMIT 20
Top 20 Slowest Queries Over 7 Days
SELECT query_s, querytime_i FROM events WHERE `date` = '[NOW/DAY-7DAY TO *]' ORDER BY querytime_i DESC LIMIT 20
Query Time Over Year
SELECT querytime_i, `date`, query_s FROM events WHERE `date` = '[NOW/DAY-365DAY TO *]' ORDER BY `date` DESC
Search Types Over 30 Days
SELECT type FROM events WHERE type = '(NewAdvancedQuery NewQuery RefinedQuery)' AND `date` = '[NOW/DAY-30DAY TO *]'
Average Click Rank Over 7 Days
SELECT avg(offset_i) as average FROM events WHERE offset_i > 0 AND `date` = '[NOW/DAY-7DAY TO *]'
As you can see we put in a mix of performance and relevance metrics to watch, and some of them even look good as charts. I can write a set of Solr queries to get similar results. But keeping them organized in notebook-fashion, sharing them with my team, and running them all with one click are why I’ll keep coming back to Zeppelin.
While you were reading this you probably had your own ideas about what to track and how to visualize, or you’re doing something completely different in your own shop. I’d love to hear about it! Reach out to me on Twitter or drop me an email.