Roll your own search analytics

Scott StultsJanuary 12, 2018

I recently completed a fairly straight-forward search project in which we were replacing a legacy system with Solr. The goal for the first release was to just make sure that we had the same recall in Solr as in the legacy system, with two secondary goals of updating the UI and maintaining search speed. We were also able to convince the stakeholders that this first iteration should have at least rudimentary click metrics built in, and they should live right in Solr.

This isn’t the most common practice with Solr. Usually search metrics are harvested from logs and put into something like the ELK stack (Elasticsearch Logstash Kibana) or sent to Google Analytics. There’s certainly nothing wrong with that approach, but for me it adds many more hoops I’ve got to jump through in order to answer questions about how well Solr is interacting with its users.

But while storing click traffic in Solr is a rare find from my (consultant’s) perspective, isn’t something new in the Solr world. In fact, this approach is part of the “batteries included” philosophy behind Lucidworks Fusion and it’s come up several times in the Solr mailing list. Certainly it’s useful for measuring things like how deep a user has to go in search results, but it can also be used to drive simple collaborative filtering or popularity boosting.

So with that in mind, let’s look at what we chose to measure in my project.

The main set of fields were:

  • date: When the event occurred
  • type: Some sort of tag indicating the type of event, like:
    • NewQuery (the user started from scratch with a blank Search box)
    • RefinedQuery (they edited the filters or added terms to the query)
    • ViewDetails (this is a click on one of the search results)
    • ZeroResultsQuery (the user queried but we didn’t find anything)
  • user: The logged in user ID

Since we were using Solr for this project we added a few dynamic fields for search events:

  • filterChanged_b: Was a filter changed
  • pagedChanged_b: Was the page changed
  • numfound_i: How many docs were retrieved
  • page_i: The current page of search results
  • querytime_i: How long did the query take
  • query_s: The query string
  • filter_ss: Any applied filters

(If you’re unfamiliar with the notation, *_b is a Boolean, *_i is an integer, *_s is a string, and and *_ss is a multivalued string. These are default settings in Solr.)

We also added fields specific to events triggered by clicking on one of the search results:

  • docid_s: The ID of a clicked document
  • offset_i: The offset of the document in the search results
  • clicked_b: Did the user click the link in search results to get to the page (there are other ways to get there)
  • title_s: The title of the document (useful for reports)

These values were sent to a distinct endpoint in our application asynchronously via JavaScript, which added a record in Solr after some value sanitization (an XSS attack in our analytics dashboard would be suboptimal).

After we started getting a nice log of user interaction with our search we needed to share that with the team through a dashboard. Even though the structure of the “documents” and answers we wanted were simple Solr queries to construct, only a few of us on the team spoke that language. I was going to provide an initial set of queries, but it was important that the rest of the team be able to write their own. Solr has the ability to parse a basic dialect of SQL, so we planned our dashboard around that. Apache Zeppelin fit the bill nicely and Eric Pugh wrote a nice article about using it with Spark and Solr a couple of years ago.

But for whatever reason I’ve never been able to get Zeppelin to install interpreters through Maven on Windows, which is the installation method in the Solr docs. What I ended up doing was adding a new JDBC interpreter by pointing to a local jar as documented in Zeppelin. Even then I had to copy over all of the SolrJ client libraries into Zeppelin’s local-repo directory. Our client’s server was Linux, so this headache was localized to just my development machine.

Once Zeppelin was running the real fun began. If you’re an accomplished SQL writer you’re probably going to get immediately frustrated like I did. The punishment for syntax errors is a Java stacktrace that fills your paragraph area, so experimentation is discouraged. One thing that tripped me up was that Solr query syntax creeps into unexpected places. For example, to write a constriant over a field in a WHERE clause you do something like this:

FROM events
WHERE type = '(NewQuery NewAdvancedQuery RefinedQuery)'

Or to constrain over a range you do this:

FROM events
WHERE `date` = '[NOW/DAY-30DAY TO *]'

But it’s not all bad. If your data doesn’t require too much manipulation Zeppelin is still a quick way to build dashboards, and our events collection was certainly simple. These are the initial queries we came up with:

Top 20 Zero Results Queries Over 20 Days

SELECT query_s, COUNT(*) AS `count`
FROM events
WHERE type='ZeroResultsQuery'
AND `date` = '[NOW/DAY-30DAY TO *]'
GROUP BY query_s

Top 20 Slowest Queries Over 7 Days

SELECT query_s, querytime_i
FROM events
WHERE `date` = '[NOW/DAY-7DAY TO *]'
ORDER BY querytime_i DESC 

Query Time Over Year

SELECT querytime_i, `date`, query_s
FROM events
WHERE `date` = '[NOW/DAY-365DAY TO *]'
Query Times

Query Time Over Year

Search Types Over 30 Days

FROM events
WHERE type = '(NewAdvancedQuery NewQuery RefinedQuery)'
AND `date` = '[NOW/DAY-30DAY TO *]'
Search Types

Search Types Over 30 Days

Average Click Rank Over 7 Days

SELECT avg(offset_i) as average
FROM events
WHERE offset_i > 0
AND `date` = '[NOW/DAY-7DAY TO *]'

As you can see we put in a mix of performance and relevance metrics to watch, and some of them even look good as charts. I can write a set of Solr queries to get similar results. But keeping them organized in notebook-fashion, sharing them with my team, and running them all with one click are why I’ll keep coming back to Zeppelin.

While you were reading this you probably had your own ideas about what to track and how to visualize, or you’re doing something completely different in your own shop. I’d love to hear about it! Reach out to me on Twitter or drop me an email.

More blog articles:

We've been Solr-istas since day one!

Our founder wrote the first book on Solr, now in 3rd edition. We've helped organizations from the US Patent and Trademark Office to Cisco build smarter search solutions with Solr.

Solr Services Solr Relevance Engineer Training