Blog

Filtering results by query patterns with Regular Expressions and Querqy

Sometimes people use a shorthand syntax that has a very strong correlation with a specific set of products. For example, if you are on a home improvement website, and you get queries like 2x4x8, that can be a very strong signal that someone is looking for lumber/timber of particular dimensions. On our Chorus Electronics site as featured in the Meet Pete blog series, any queries that include a pattern like this: 16:9, 16:10, 4:3 are indicative of someone interested in screen protectors for laptops. The numbers refer to a specific display aspect ratio that each laptop screen has. So in this blog we’ll learn how to use regular expressions (regexes) to capture these query patterns and narrow the results – in this case to only screen protector products made by the brand Kensington. Of course in a perfect world, we would be supporting the decision to add this rule based on analytics data from our click logs or by looking at exit rates.

Querqy is a query pre-processor, allowing you to easily capture business rules when processing search queries for Solr and Elasticsearch. We’re going to take advantage of how extendable Querqy is by using a custom rule that supports matching Regular Expression patterns, and then applying a traditional FILTER rule. The querqy.regex.solr.RegexFilterRewriterFactory is currently not supported in SMUI (the easy-to-use front end for Querqy) so we’ll need to work directly with the Querqy APIs in Solr, but don’t worry, it’s pretty simple. This is a relatively new piece of code for Querqy, so it has limitations we’ll document at the end. We will update this blog as it evolves towards a 1.0 version.

If you’d like to try this example code out yourself you can install our pretend electronics store, Chorus Electronics, using this guide – if not you might want to skip to the end of this blog and watch the video where I demonstrate the feature.

Install the Querqy regex filter

We’ll start by confirming that we’ve added the Java JAR file that contains this code to our Solr setup in the solr/lib directory. You should have a file with the name querqy-regex-filter-1.1.0-SNAPSHOT.jar. If you don’t, you’ll need to go to the Github project at https://github.com/renekrie/querqy-regex-filter and download and compile the jar file via mvn package. Restart Chorus and you’ll be ready to continue. If you ran the quickstart.sh script you have the lib in your Solr Docker containers in /opt/querqy/lib/ and the library is integrated via a lib directive in solrconfig.xml.

Identify the aspect ratio query pattern

The next step is think about the query pattern you want to identify, and how you will tweak the rule.

In our case, we know that any query that looks like number, followed by a colon, :, followed by a number is a strong indicator of aspect ratio. For any query with an aspect ratio pattern, we want to filter to the brand Kensington as they make screen protectors where this makes sense (of course in the real world there are many other items with an aspect ratio). Our rule definition will look like this:

{
   "class":"querqy.regex.solr.RegexFilterRewriterFactory",
   "config":{
     "regex":"\\d+:\\d+",
     "filter" : "* filter_brand:Kensington"
   }
}

You can see that the regex pattern is \d+:\d+, however we escaped the backslashes due to the JSON formatting. We then apply the filter to say we want to append the filter for the brand Kensington. You can try out the regex pattern using a site like https://www.regextester.com and the string 16:9. We pass the original query through, so that if your query is 16:9 it will pass through, but if you also query for an aspect ratio and a model number, K58357WW 16:9, then both would be passed through.

Once you have decided the pattern, we register it with Solr:

curl --user solr:SolrRocks -X POST 'http://localhost:8983/solr/ecommerce/querqy/rewriter/regex_screen_protectors?action=save' -d '
{
  "class": "querqy.regex.solr.RegexFilterRewriterFactory",
  "config": {
    "regex":"\\d+:\\d+",
    "filter" : "* filter_brand:Kensington"
  }
}
'

Notice that the end point we use names the rewriter regex_screen_protector in conjunction with the action=save parameter? You can confirm the change via:

curl --user solr:SolrRocks -X GET http://localhost:8983/solr/ecommerce/querqy/rewriter/regex_screen_protectors

You may need to reload the collection as well:

curl --user solr:SolrRocks -X POST http://localhost:8983/api/collections/ecommerce -H 'Content-Type: application/json' -d '
  {
    "reload": {}
  }
'

Test the Querqy rewriter

Now, we can test this rewriter by issuing a query with the querqy.rewriters=regex_screen_protector parameter.

curl --user solr:SolrRocks -X GET 'http://localhost:8983/solr/ecommerce/select?q=16:9&defType=querqy&querqy.rewriters=regex_screen_protectors&fl=title,brand,attr_t_aspect_ratio'

We get back 14 different Kensington screen protectors. Now, lets pass in some extra query information, like the model number K58357WW.

curl --user solr:SolrRocks -X GET 'http://localhost:8983/solr/ecommerce/select?q=16:9%20K58357WW&defType=querqy&querqy.rewriters=regex_screen_protectors&fl=title,brand,attr_t_aspect_ratio'

Boom! We get just the most relevant product back on top of the hit list:

{"response":{"numFound":14,"start":0,"maxScore":8.232563,"numFoundExact":true,"docs":[
    {
      "title":"Kensington K58357WW display privacy filters Frameless display privacy filter 61 cm (24\")",
      "brand":"Kensington",
      "attr_t_aspect_ratio":"16:9"}]
}}

Test the query pattern on Chorus Electronics

We can also test this out in our demo Chorus Electronics web store. We’ve already added to the Querqy Request Handler defined in solrconfig.xml the regex_screen_protectors to the list of querqy.rewriters: <str name="querqy.rewriters">replace,common_rules,regex_screen_protectors</str>, and in the <queryParser name="querqy"> the mapping to output the log data when the rule is run:

<lst name="mapping">
  <str name="rewriter">regex_screen_protectors</str>
  <str name="sink">responseSink</str>
</lst>

If you have Chorus up and running, go to http://localhost:4000/catalog?q=16:9&search_field=default&view=gallery and you’ll see thousands of irrelevant results, anything with a 16 or 9 in the text, now flip to the Querqy enabled search and go again and look at all those lovely Kensington screen protectors!

Here’s the whole process on video:

Limitations of the Regex Rewriter

  1. Today it only supports appending a FILTER, you can’t do a BOOST or any other manipulation.
  2. It’s not incorporated in SMUI, the search management UI that acts as a front end to Querqy. There is some discussion about adding it as a Common Rule, if lots of folks find it useful, which would be a good step to adding it to SMUI.
  3. You only get one regex per named end point, so obviously you’ll add a new step in the rewriting chain for every regex pattern you add!
  4. The logging output tells you the pattern that was matched, but doesn’t really tie back to which rule.
  5. A common pattern would be to identify a regex pattern, and then try and apply it to a specific field. For example, if you search for what appears to be an aspect ratio, wouldn’t it be nice to filter to just the products who have that as a value in the attr_t_aspect_ratio field? Something like:
{
  "class": "querqy.regex.solr.RegexFilterRewriterFactory",
  "config": {
    "regex":"\\d+:\\d+",
    "filter" : "* attr_t_aspect_ratio:*"
  }
}

Today, this pattern of passing the value into the field doesn’t work. Instead it just filters to ANY product that has a value specified in the attr_t_aspect_ratio field. This may be good enough for you of course.

Conclusions

This technique is a powerful way to filter out irrelevant items – but you should be sure that the query pattern you have identified really does indicate that the user is only interested in a subset of products. If you inadvertantly filter out items the user might buy this could have a negative effect on revenue! A detailed analysis of your search query logs should give you confidence that this pattern does indicate a particular query intent, and of course you should continue to monitor the impact of this rule using analytics.

Querqy is part of Chorus, a joint initiative by Eric Pugh, Johannes Peter, Paul M. Bartusch and René Kriegler.

If you need help using tools like this on your e-commerce site, get in touch.

Image from Filter Vectors by Vecteezy