Roll your own search analytics - OpenSource Connections

January 12, 2018 Scott Stults
Category: Uncategorized

I recently completed a fairly straight-forward search project in which we were replacing a legacysystem with Solr. The goal for the first release was to just make sure that we had the same recallin Solr as in the legacy system, with two secondary goals of updating the UI and maintaining search speed. We were also able to convince the stakeholders that this first iteration should have at leastrudimentary click metrics built in, and they should live right in Solr.

This isn’t the most common practice with Solr. Usually search metrics are harvested from logs and put intosomething like the ELK stack (Elasticsearch Logstash Kibana) or sent to Google Analytics. There’scertainly nothing wrong with that approach, but for me it adds many more hoops I’ve got to jump throughin order to answer questions about how well Solr is interacting with its users.

But while storing click traffic in Solr is a rare find from my (consultant’s) perspective, isn’t something new in the Solr world. In fact, this approach is part of the “batteries included” philosophy behind Lucidworks Fusion and it’s come up several times in the Solr mailing list. Certainly it’s useful formeasuring things like how deep a user has to go in search results, but it can also be used to drivesimple collaborative filtering or popularity boosting.

So with that in mind, let’s look at what we chose to measure in my project.

The main set of fields were:

date: When the event occurred
type: Some sort of tag indicating the type of event, like:
- NewQuery (the user started from scratch with a blank Search box)
- RefinedQuery (they edited the filters or added terms to the query)
- ViewDetails (this is a click on one of the search results)
- ZeroResultsQuery (the user queried but we didn’t find anything)
user: The logged in user ID

Since we were using Solr for this project we added a few dynamic fields for search events:

filterChanged_b: Was a filter changed
pagedChanged_b: Was the page changed
numfound_i: How many docs were retrieved
page_i: The current page of search results
querytime_i: How long did the query take
query_s: The query string
filter_ss: Any applied filters

(If you’re unfamiliar with the notation, *_b is a Boolean, *_i is an integer, *_s is a string, andand *_ss is a multivalued string. These are default settings in Solr.)

We also added fields specific to events triggered by clicking on one of the search results:

docid_s: The ID of a clicked document
offset_i: The offset of the document in the search results
clicked_b: Did the user click the link in search results to get to the page (there are other ways to get there)
title_s: The title of the document (useful for reports)

These values were sent to a distinct endpoint in our application asynchronously via JavaScript,which added a record in Solr after some value sanitization (an XSS attack in our analytics dashboard wouldbe suboptimal).

After we started getting a nice log of user interaction with our search we needed to share that with the team through a dashboard. Eventhough the structure of the “documents” and answers we wanted were simple Solr queries to construct, only a few ofus on the team spoke that language. I was going to provide an initial set of queries, but it was important that therest of the team be able to write their own. Solr has the ability to parse a basic dialect of SQL, so we planned ourdashboard around that. Apache Zeppelin fit the bill nicely and Eric Pugh wrote a nice article about using it with Spark and Solr a couple of years ago.

But for whatever reason I’ve never been able to get Zeppelin to install interpreters through Maven on Windows, whichis the installation method in the Solr docs. What I ended up doing was adding a new JDBC interpreter by pointing toa local jar as documented in Zeppelin. Even then I had to copyover all of the SolrJ client libraries into Zeppelin’s local-repo directory. Our client’s server was Linux, so thisheadache was localized to just my development machine.

Once Zeppelin was running the real fun began. If you’re an accomplished SQL writer you’re probably going to getimmediately frustrated like I did. The punishment for syntax errors is a Java stacktrace that fills your paragraph area, so experimentation is discouraged. One thing that tripped me up was that Solr query syntax creeps into unexpectedplaces. For example, to write a constriant over a field in a WHERE clause you do something like this:

SELECT typeFROM eventsWHERE type = '(NewQuery NewAdvancedQuery RefinedQuery)'

Or to constrain over a range you do this:

SELECT typeFROM eventsWHERE `date` = '[NOW/DAY-30DAY TO *]'

But it’s not all bad. If your data doesn’t require too much manipulation Zeppelin is still a quick way to builddashboards, and our events collection was certainly simple. These are the initial queries we came up with:

Top 20 Zero Results Queries Over 20 Days

SELECT query_s, COUNT(*) AS `count`FROM eventsWHERE type='ZeroResultsQuery'AND `date` = '[NOW/DAY-30DAY TO *]'GROUP BY query_sORDER BY `count` DESCLIMIT 20

Top 20 Slowest Queries Over 7 Days

SELECT query_s, querytime_iFROM eventsWHERE `date` = '[NOW/DAY-7DAY TO *]'ORDER BY querytime_i DESC LIMIT 20

Query Time Over Year

SELECT querytime_i, `date`, query_sFROM eventsWHERE `date` = '[NOW/DAY-365DAY TO *]'ORDER BY `date` DESC

Query Time Over Year

Search Types Over 30 Days

SELECT typeFROM eventsWHERE type = '(NewAdvancedQuery NewQuery RefinedQuery)'AND `date` = '[NOW/DAY-30DAY TO *]'

Search Types Over 30 Days

Average Click Rank Over 7 Days

SELECT avg(offset_i) as averageFROM eventsWHERE offset_i > 0AND `date` = '[NOW/DAY-7DAY TO *]'

As you can see we put in a mix of performance and relevance metrics to watch, and some of them even look good ascharts. I can write a set of Solr queries to get similar results. But keeping them organized innotebook-fashion, sharing them with my team, and running them all with one click are why I’ll keep coming back toZeppelin.

While you were reading this you probably had your own ideas about what to track and how to visualize, or you’redoing something completely different in your own shop. I’d love to hear about it! Reach out to me on Twitteror drop me an email.