Maximizing Relevance with Hyperparameter Optimization

February 2, 2023 Dan Worley
Category: Relevancy

Intro

Search relevance is ever evolving. Production systems have gone from “one-pass” ranking to a pipeline of rankers that re-rank using the latest and greatest techniques. Early research shows that hybrid implementations of classical ranking (TF-IDF, BM25) paired with new approaches (KNN, LTR) can deliver significant performance improvements compared to picking one or the other. The key take-away is that classical ranking is still important, and will be for the foreseeable future. So instead of ignoring it for the newest approach, why not optimize it? Better yet, why don’t we use a technique that we might have learned from working with the new hotness – hyperparameter optimization – to improve the relevance of our tried and true classical rankers?

Hyperparameter optimization

If you’ve worked in the Data Science field long you’ve probably ventured into the realm of hyperparameter optimization, also called hyperparameter tuning. This is the space where we have a big bag of parameters [dimensions] and we want to figure out what the best possible values are for each parameter for a given task [objective].

Traditionally this is used for optimizing the performance of a supervised machine learning model, like decision trees. Decision tree fitting libraries, like Python’s sklearn, offer users fine grained control of how the decision trees are “grown”, via a plethora of hyperparameters like maximum-depth and minimum-samples-per-leaf.

It doesn’t take a stretch of the imagination when reading the documentation for a query-DSL, like multi-match or eDismax, to see that they too offer a lot of knobs to fine tune matching behavior and ranking values. Just like supervised machine learning tasks, choosing the best combination of parameters is context dependent, there are no guaranteed “best” settings. So if you want to ensure you get the best performance from your classical rankers you need to be doing hyperparameter optimization there too.

Coded Example

Prerequisites

Before we get too gung-ho let’s talk prerequisites:

Python familiarity
Mild tolerance to Data Science and the associated dark arts
A tmdb index in Elasticsearch from hello-ltr
Ratings! You need to provide ratings but we have some for the example 🙂

So how does one turn relevance tuning into a hyperparameter problem? In a nutshell hyperparameter tuning needs an objective function that returns a metric when given a set of input dimensions.

Example Dimensions

With that in mind, let’s define the dimensions first. Most query DSLs provide all sorts of knobs and options to play with. We could pick any of these to be our dimension space but for this example we’ll work with tuning the field boosts on a `multi_match` query across several indexed fields.

from skopt.space import Integer

space = [
    Integer(1, 20, name='title_boost'),
    Integer(1, 20, name='overview_boost'),
    Integer(1, 20, name='tagline_boost'),
    Integer(1, 20, name='cast_boost')
]

Example Objective

With the dimensions out of the way that leaves the objective. The objective function needs to summarize the model’s performance given the set of input dimensions it receives. This will be used to select the best performing combination of dimensions.

For search relevance this would be a summary metric like recall or nDCG at a cutoff position K. Since most machine learning frameworks require minimizing the objective function we need to multiply our summary metric of choice by negative 1, to fit the framework’s expectations. In this example we’ll use mean recall@10 to summarize the relevance of a given set of input dimensions across our test set of queries.

def objective(**params):
    es_query = {
        '_source': ['_id'],
        'query': {
            'multi_match': {
                'query': 'placeholder',
                'fields': [
                    f'title^{params["title_boost"]}',
                    f'overview^{params["overview_boost"]}',
                    f'tagline^{params["tagline_boost"]}',
                    f'cast^{params["cast_boost"]}'
                ]
            }
        },
        'size': 10
    }
    return run_queries(es_query)

The objective function is populating the boost values into a query template, executing the query to retrieve hits and then summarizing the hits with recall@10 for the rating we provided.

Example Minimization

From here all we need to do is kick off the full minimization process to find the best performing input dimensions possible. Scikit provides four different algorithms for minimization:

dummy_minimize: Randomly samples the dimension space uniformly.
forest_minimize: Utilizes decision trees to sequentially optimize the dimension space
gbrt_minimize: Utilizes gradient boosted regression trees for sequential optimization
Gp_minimize: Bayesian optimization via Gaussian Process

The following snippet shows an example of calling the `forest_minimize` process:

# The seed values for the dimensions
# You can replace the values here if you're working on optimizing an existing configuration
# or if you have a gut feeling about certain parameters.
seed = [1, 1, 1, 1]

result = forest_minimize(
        func=objective,
        x0=seed,
        dimensions=space,
        verbose=True
)

print('Maximum achieved recall: ', result.fun)
print('Optimum hyper parameters: ', result.x)

After running you will be able to see the maximum achieved recall along with the parameters that produced that value. If you were able to get an improvement you can take these values back to your search configuration and test them out.

Minimizer Showdown

If you’re wondering what the best minimize option is for your problem, the answer you’ll typically receive is that it depends on the data. If you have the time and a static index, an exhaustive Grid Search may be in order. If your data is changing frequently and you need to keep tuning in line then utilizing the minimize options discussed above can be a good fit. In hello-ltr we have several sets of ratings, the following is the tuning results against the set of basic movie title judgments.

Minimizer Function	Mean recall@10 (50 iterations)	Mean recall@10 (250 iterations)
Dummy	0.6752	0.6768
Forest	0.6752	0.6768
GBRT	0.6718	0.6759
GP (Bayesian)	0.6759	0.6781

We can see all of the minimization method perform roughly the same, we have a small number of dimensions we are working with so even for the dummy method with 50 random combinations managed to find a pretty good one. If you run the companion code, you can see the “smart” minimizers do consistently output better performing combinations compared to dummy, even if they eventually get to similar performance. There is improvement in all of the minimizer functions given more iterations, but its not large, only a few decimal places. The Gaussian Process minimize performs slightly better than the others and is recommended by other resources like this talk from Shopify.

Keep in mind that with more input dimensions you will likely see more variation between minimization algorithms. But given enough iterations most intelligent minimizers (not dummy_minimize!) will converge on an optimal set of dimensions.

Takeaways

Hyperparameter tuning can be a great asset to add to your relevancy tuning toolbelt. It’s especially useful for validating & optimizing existing systems with complicated queries.
Optimize for a metric that makes sense for your search use-case
Prone to the same issues as any data science problem. Ratings need to be validated! Garbage in, garbage out!
Using scikit-optimize’s minimization algorithms can save a lot of time compared to an exhaustive grid search.
Always validate assumptions from offline relevance tuning with trailing metrics or better yet A/B tests.

Companion Code

GIST with script and input file

If you need help using hyperparameter optimization – or any other technique for improving search relevance, get in touch – or come along and meet us at one of our Haystack events!

Image from Hyperspace Vectors by Vecteezy