What’s new in Querqy, the query preprocessor for Solr (and Elasticsearch)?

March 5, 2021 René Kriegler
Category: Solr

We’ve recently released Querqy 5 for Solr – a good time to have a look at this new major version and at other exciting features that were implemented recently.

Querqy is becoming increasingly popular as a query preprocessor for Solr and Elasticsearch. In 2020 it gained a lot of momentum as part of the querqy.org project and as part of Chorus – an open source software stack for implementing ecommerce search (watch our video introducing the project).

In this blog post I want to introduce what I think are the most exciting new features that were implemented in Querqy’s version for Solr over the past half year. Some features have already been made available for Elasticsearch, others will hopefully follow shortly.

The new rewriter configuration API (Querqy 5)

Query rewriters are at the core of Querqy. You can either use one of the powerful rewriters that come included with Querqy or write your own. Probably the most popular rewriter is the Common Rules Rewriter that lets you use query time synonyms easily, including multi-term synonyms. Common Rules also gives you the power to boost, penalise and filter results depending on the query, for example, to bring ‘laptops for less then 500 Euros’ to the top of the results list if the query is ‘cheap notebook’:

    cheap notebook =>
      SYNONYM: laptop
      UP(100): * price:[* TO 500]
      DELETE: cheap

The configuration for those rules, and for rewriters in general, needs to live somewhere. It used to be specified in solrconfig.xml and some rewriters needed additional files in Zookeeper, like rules.txt.

With the latest Querqy version, we introduced a rewriter configuration API. Rewriters are now configured using HTTP requests and the best thing is that you don’t have to reload your collection or Solr core to activate changes. A simple ‘save’ HTTP request is all you need:

curl -H "Content-Type: application/xml" \
     -X POST 'http://solrhost:8983/solr/collection/querqy/rewriter/commonrules1?action=save' \
--data-raw '{
    "class": "querqy.solr.rewriter.commonrules.CommonRulesRewriterFactory",
    "config": {
        "rules" : "notebook =>\nSYNONYM: laptop"
    }
}'

As a nice side-effect, you now specify the chain of rewriters per request instead of using a static rewrite chain definition in solrconfig.xml. This is much more flexible and should facilitate experimentation. Overall, the new Solr rewriter configuration looks very similar to the Elasticsearch version. This should speed up porting rewriters from one search engine to the other.

Implementing a new rewriter management method required us to make changes to the very core of Querqy’s version for Solr and we couldn’t avoid making breaking changes. Given the impact of the change, we decided to increase the major version number: Querqy 5 for Solr.

If you already are a Querqy user, you will probably find the migration guide useful. We hope we could make the transition to the new rewriter API as smooth as possible.

Common Rules Rewriter

Three recent additions to the Common Rules Rewriter will make creating rules even more powerful and convenient.

Boolean input matching

The first new feature allows one to formulate more complex criteria to specify for which queries a rule should be triggered. The new powers come from the introduction of Boolean expressions for input matching (available in Querqy 5).

Let’s have a look at the following rules in Querqy 4:

smartphone =>
 FILTER: * category:smartphones

smartphone case =>
 FILTER: * category:"smartphone cases"

The intent is to filter the results for smartphone (and not smartphone cases) in the first rule and for smartphone cases (and not smartphones) in the second rule. The problem is that the first rule would also be triggered if the query was ‘smartphone case’ as the ‘smartphone’ input definition matches this query. While Querqy 4 had the means to express exact query matching (using quotes, like in "smartphone" =>), the approach was clearly limited.

Querqy 5 allows to turn on Boolean expressions for input matching, which is a much more powerful solution:

smartphone AND NOT case =>
  FILTER: * category:smartphones

smartphone AND (case OR cover) =>
  FILTER: * category:"smartphone cases"

Templates

The second new feature in the Common Rules Rewriter makes writing rules more convenient and robust. As rules in this rewriter are always unidirectional, specifying a bidirectional synonym used to require writing two rules:

notebook =>
  SYNONYM: laptop

laptop => 
  SYNONYM: notebook

Starting with Querqy 4.11 for Solr, it is possible to define and use templates for recurrent rule formulation patterns:

def bidi_synonym(syn1, syn2):
  $syn1 =>
    SYNONYM: $syn2
  $syn2 => 
    SYNONYM: $syn1

<< bidi_synonym: syn1=laptop | syn2=notebook >>
<< bidi_synonym: syn1=t-shirt | syn2=tee >>

We define a template using the def keyword. We can pass parameters to it and reference them in the output. The case of bidirectional synonyms can be solved by a template that produces rules for both directions.

This also opens the door to defining more complex function queries once and then just ‘calling’ them for concrete inputs:

# boost a product after checking its availability
def boost_if_available(id, factor):
  UP: if(and(query({!term f=id v=$id}),query({!term f=product_is_available v=true})),$factor,0)

# boost two different products for query 'lipstick' depending on availability
lipstick =>
  << boost_if_available: id=id123 | factor=200>>
  << boost_if_available: id=id456 | factor=10>>

# boost a specific Samsung product if the query contains 'samsung',
# using the same rule template as for query 'lipstick'
samsung =>
  << boost_if_available: id=id890 | factor=200>>

Synonym weights

Synonym rules have been a core feature of Querqy’s Common Rules Rewriter since day one, especially, as this rewriter helped overcome issues that the built-in synonym solutions of Solr and Elastisearch had/have with multi-term input and output and with scoring.

The notion of SYNONYM rules in Querqy has always been more technical instead of semantical. Specifying a rule X => SYNONYM: Y means ‘expand input X with Y’ – as compared to a strictly semantic approach where we would say Y means (nearly) the same as X.

Given this broader approach, we sometimes want to weight synonyms. For example, if the user searches for ‘laptop’, we would also want to bring up matches for ‘macbook’ (which, strictly spoken, is a hyponym – a narrower concept). We assume that users who search for ‘laptop’ will be interested in MacBooks a bit less than in other laptops. While there was a workaround in earlier Querqy versions to solve this synonym weighting by additional UP/DOWN rules, Querqy 4.11 introduced synonym weights:

laptop =>
  SYNONYM: notebook
  SYNONYM(0.1): macbook

Specifying a weight is probably more convenient to maintain than using additional boosting rules. It is also more accurate when it comes to scoring. In the example, macbook gets a lower score than notebook as it has a lower weight, 0.1, compared to the default weight of 1.0.

Integration tests

Many tests in Querqy for Solr have always been implemented as larger unit tests. This means that the scope of these tests is not just a class or a method but a small feature or behaviour, which would be tested with the help of the Solr testing framework.

In Querqy 5, we’ve added Docker-based integration tests that verify that different versions of Querqy work as expected with different Solr and JDK versions. The integration is validated by applying query rewriting to sample queries and searching a test index. We are very thankful to Torsten Köster, who came up with the concept and the implementation of these tests. If you want to learn more about the approach, you can read his blog post here.

It is a bit risky to show the new, live integration test badge in a blog post – what if it fails? – but at least we would know! Here it is:

Query rewriting outside Solr

Given that the core query rewriting components in Querqy don’t have any dependency on Solr, Elasticsearch and not even on Lucene, the idea of rewriting queries before sending them to the search engine had been sitting at the back of our minds for years. This would allow us to use Querqy with any search engine API. It would open up Querqy for using large pre-trained models and frequently changing databases in query rewriting. It wouldn’t be handy to make these components available inside Solr. On the other hand, rewriters that depend on information from the Lucene index, like Querqy’s built-in WordBreakCompound rewriter, might better live within Solr.

In Querqy 4.11, we’ve started an experimental package (querqy-core:querqy.rewrite.experimental) that provides external rewriting in combination with Solr: query rewriting starts outside Solr, then the Querqy query object is serialised and sent to Solr as a JSON object. Inside Solr, Querqy deserialises the query object, optionally continues query rewriting, and finally creates the Lucene query to be executed:

External query rewriting with Querqy — External query rewriting

While we’ve only experimented with using Java for external query rewriting so far, using JSON for serialisation opens the door for using other technologies for rewriting. Let us know if have a crazy query intent prediction or entity detection in queries in mind, and it only works with Python!

Word break/compound rewriter

The word break/compound rewriter has been around for almost a year. Over the summer, we redesigned it and added the ability to handle language-specific compounding morphology. For example, the German word arbeit (work, labour) takes the form arbeit + -s in compounds, like in arbeitserlaubnis = arbeit + erlaubnis (work permit).

While Querqy applies co-occurrence checks for word parts, there still is a small likelihood that breaking up a word produces a nonsensical split. For example, in one project, the German word blumen (flowers) got split into blue men as there were also documents about the Blue Men Show. With the latest Querqy release, it is possible to configure a list of protected words. Words on this list will never be split.

What’s next?

As Querqy developers we have a wealth of new feature ideas sitting at the back of our minds. Making it easier to write custom rewriters will probably be a major task over the next few month. We also recognise that Querqy users will take time to adopt the new features and migrate their applications to the new rewriter configuration. Their feedback will be part of what guides Querqy’s path of development and will help us further improve this query preprocessor for Solr and Elasticsearch.

Last but not least

I’d like to use the opportunity and send a big ‘Thank you!’ to the Querqy contributors whose fantastic work has moved Querqy a great step forward: Johannes Peter, Matthias Krüger, Tobias Kässmann, Torsten Köster – let’s keep the momentum going!

Do let us know if you’d like help with a Solr or Elasticsearch project using Querqy.

Image from Hand Vectors by Vecteezy