How You Can Use Hello LTR - The Learning to Rank Sandbox

August 13, 2019 Doug Turnbull
Category: Solr

Learning to Rank just seems hard. Applying Machine Learning to relevance in Solr or Elasticsearch seems non-trivial, and it seems to require a lot of crufty code and plumbing. With Hello LTR we have come up with a series of code and notebooks that attempt to simplify the process. You should be able to ‘bring your own data’ to the notebooks and experiment with Learning to Rank rather easily yourself! Let’s walk through the process with one of the most exciting data sets we could think of – OpenSource Connection’s own blog search!

What is Hello LTR?

Hello LTR is a set of iPython notebooks that demonstrate Learning to Rank in Solr or Elasticsearch. Within Hello LTR comes a client library for performing the typical Learning to Rank workflow – nearly identical between the two search engines. Specifically,

Configuring and Indexing a corpus, using a specific index configuration
Load Learning to Rank Features into the search engine
Given a Judgment List, extract each feature’s values for each query-document pair, and output a training set
Train an offline LTR model using RankyMcRankFace, providing stats on features that worked well / didn’t work
Upload that model to the search engine
Issue searches using the model

We’re continuing to work to streamline this process in the Hello LTR notebooks, so you can mix in any data and judgments into a sandbox for experimenting with Learning to Rank and relevance for your data set. Certainly, as you go through this, we would welcome your feedback. We know there’s still a lot of work to do, and we welcome community members to help improve and democratize this stuff!

A tour through OSC’s Blog Search via Hello LTR

Let’s poke around this notebook which sets up a Learning to Rank task around OpenSource Connection’s blog. This notebook captures much of the search functionality in our blog, and is a fairly approachable example with a simple corpus and set of judgments. If you want to see other examples, the many other examples focus on TheMovieDB corpus.

Launching Solr or Elasticsearch

Under the docker folder in hello-ltr is a folder for Elasticsearch or Solr. A shell script launch_es.sh or launch_solr.sh builds the docker containers and loads the search engine. With docker installed on your system, simply launch those from the command line, and you should see either system spin up:

$ ./launch_es.sh
Sending build context to Docker daemon  6.656kB
Step 1/4 : FROM docker.elastic.co/elasticsearch/elasticsearch:6.4.1
 ---> 86c6e080644a
Step 2/4 : RUN bin/elasticsearch-plugin install -b http://es-learn-to-rank.labs.o19s.com/ltr-1.1.0-es6.4.1.zip
 ---> Using cache
 ---> 2a2c88b6298c
...

Launching and Setup Notebooks and Datasets

With all the hello ltr dependencies installed you should be ready to launch jupyter:

$ jupyter notebook

Jupyter with a selection of notebooks should be available for you to use. We’ll walk through the OSC blog Jupyter notebook, which you can hack up with your dataset as needed. So click on osc-blog.ipynb in the listing.

The first cell simply downloads the datasets we’ll use, including a dump of OSC’s blog (blog.jsonl) and a judgment list (osc_judgments.txt) along with the other datasets.

from ltr import download
download();

The corpus and the judgments are the primary input into the Learning to Rank process. The rest is search engine and Learning to Rank configuration and experimentation.

If you wanted to use your own corpus and judgments, simply place them in the data/ directory manually, you’ll see how those files are used and how they’re formatted in a bit, starting with the corpus itself, which we’ll get to in the next cell:


import json

articles = []

with open('data/blog.jsonl') as f:
    for line in f:
        blog = json.loads(line)
        articles.append(blog)

len(articles)

Here we just load the corpus (a jsonl file – a text file where each line is a json object). We create a list of objects, each object is a blog post, such as:

{'title': "Lets Stop Saying 'Cognitive Search'",
 'url': 'https://opensourceconnections.com/blog/2019/05/28/lets-stop-saying-cognitive-search/',
 'author': 'doug-turnbull',
 'content': ' I consume a lot of search materials: blogs, webinars, papers, and marketing collateral …'
 excerpt': ' We won’t address machine learning illiteracy in search if we can’t go beyond buzzwords. We need to teach concrete, specific techniques. \n',
 'post_date': '2019-05-28T00:00:00-0400',
 'categories': ['blog', 'Relevancy', 'Learning-to-rank'],
 'id': 2883614620
}

Configure Search Engine and Index the Corpus

First we create a client object that fulfills the Learning to Rank interface for a specific search engine, here we will use Elasticsearch:

from ltr.client import ElasticClient
client=ElasticClient()

The notebooks would be nearly identical for Solr or Elasticsearch (you can see various examples in hello-ltr of both search engines being used). The only differences are where we configure the search engine or use search engine syntax is needed to create Learning to Rank features.

Next we rebuild the index using the corpus:

from ltr.index import rebuild
rebuild(client, index='blog', doc_type='post', doc_src=articles)

Here rebuild deletes the provided index and then rebuilds it using a local config file.

Rebuild looks for configuration with that index’s name under the docker/elasticsearch or docker/solr directories. For Elasticsearch, this means docker/elasticsearch/_settings.json, which stores the typical JSON body you would use when creating a new Elasticsearch index. For Solr that means a Solr Config set folder name with the expected folder structure, containing your solr_config.xml, schema, etc. If you browsed to docker\elasticsearch you’ll notice blog_settings.json already provided in the Hello LTR codebase.

For your own dataset, you can copy the existing configuration or place a new index configuration with your index’s name in the Solr or Elasticsearch folder. If you want to modify your Elasticsearch or Solr configuration to tweak analysis or query settings, you would need to rebuild the containers and index via the notebook by repeating the command above.

Finally, after configuration, the documents are indexed to Solr or Elasticsearch. Hello LTR expects there to be a field called id on each document. It uses id as the document’s primary key (_id in Elasticsearch). As you would expect, each document’s field needs to be appropriately configured for the corresponding Solr or Elasticsearch configuration.

Configuring Learning to Rank Features

As you may know, features in learning to rank are templated Solr or Elasticsearch queries. Of course this includes the traditional TFIDF based scoring (BM25 etc) on fields you’ve crafted via analysis. It also includes the ability to add in numerical features like dates, etc, combined in arbitrary formulations with text based features, and more lately vector-based features that can capture an embedding-based similarity from an external enrichment system built with word2vec, BERT, or whatever the latest hotness is.

For the blog, there’s a handful of relatively simple features to explore. Here’s two. One is a constant score query, which returns a 1 or 0 depending on if the term matched. The second is a BM25 score in the content field. Other queries search with phrases, create a stepwise function around the age of the post (intuition being newer posts are more relevant), etc.

config = {
    "featureset": {
        "features": [
            {
                "name": "title_term_match",
                "params": ["keywords"],
                "template": {
                    "constant_score": {
                       "filter": {
                            "match": {
                                "title": ""
                            }
                       },
                       "boost": 1.0
                    }
                }
            },
           {
                "name": "content_bm25",
                "params": ["keywords"],
                "template": {
                    "match": {
                       "content": {
                          "query": ""
                        }
                    }
                }
            },
          ...

These features are loaded, configuring a feature set in the search engine with the name ‘test’:

from ltr import setup
setup(client, config=config, index='blog', featureset='test')

For your problem, if you think a specific set of features would be a good selection, you can express them here, and configure them into a featureset with it’s own name.

Logging Features to Create a Training Set

Logging feature values is one of the most complex engineering aspects of building a Learning to Rank system. In Hello LTR, we have a function which takes as input, a judgment list, and outputs a training set with every feature’s value for that judgment.

A judgment list grades how good a document is for a query. Some documents will be very relevant for a query. Others we mark as not relevant for a query.

The judgment list is expressed as an ‘stub’ RankSVM file format. This file format, common to learning to rank tasks tracks the grade in the first column. In our example, we use the standard of a 0 meaning most irrelevant and a 4 meaning perfectly relevant for the query. The second column is a unique identifier for the query, prefixed with qid. A comment with the document identifier follows. For example, here’s a snippet from the blog judgment list for one of the queries:

# qid:1: solr
4    qid:1     # 4036602523    solr
3    qid:1     # 2916170833    solr
2    qid:1     # 3986527021    solr
2    qid:1     # 3440818580    solr
0    qid:1     # 2926247873    solr
0    qid:1     # 3065216762    solr
0    qid:1     # 14036114      solr
0    qid:1     # 1765487539    solr

In the file header, query id 1 is associated with the keyword solr (# qid:1: solr). Farther down the list, a series of documents are graded for qid:1 (solr). Document w/ id 4036602523 is assigned a grade of 4 (perfectly relevant) for Solr on the line

4 qid:1 # 4036602523 solr.

Document 14036114 is assigned a grade of 0 (very irrelevant) for Solr on the line

0 qid:1 # 14036114 solr

The task of logging is to use the keywords for the query id (in this case solr) and compute the value for each feature for that graded document. This way model training can learn a good ranking function that maximizes the likelihood relevant documents will return towards the top for a query. We want to provide features that will help our model make decisions on when a document is relevant or irrelevant.

We see one feature from above is title_term_match which is a 1 or 0 based on whether a search term occurs in the title. Perhaps title_term_match is a 1 for the line

4 qid:1 # 4036602523

and perhaps title_term_match is a 0 for an irrelevant document

0 qid:1 # 1765487539 solr.

We repeat this process for every feature in our feature set, for every line for the ‘solr’ query.

Ideally, what we’d like to build is a training set that looks like:

4    qid:1     title_term_match:1      content_bm25:12.5    ...   # 40366025230
0    qid:1     title_term_match:0    content_bm25:0    # 1765487539

The RankSVM format expects features to be identified by ordinals, starting with feature id 1. So we need to transform these into a 1-based index of feature in the original feature list. Hello LTR does that for you and does the bookkeeping to keep them straight!

4    qid:1     1:1      2:12.5    ...   # 4036602523
0    qid:1     1:0    2:0    # 1765487539

There’s a lot of considerations for logging feature scores to consider in a live production system. What Hello LTR is focuses on the ‘sandbox’ use case: you have a corpus, a judgment list on that corpus, and you’d like to experiment with Learning to Rank features. In this case, we simply batch up every doc id for each query to the search engine, and ask for logged feature values. See the documentation linked above for how this happens.

Anyway, all the bookkeeping and plumbing you just learned about happens in just a few lines in Hello LTR. With the input file osc_judgments.txt, features from our featureSet test will be logged out and written to osc_judgments_train.txt in the right file format:

from ltr.log import judgments_to_training_set
trainingSet = judgments_to_training_set(client,
                                        index='blog',
                                        judgmentInFile='data/osc_judgments.txt',
                                        trainingOutFile='data/osc_judgments_train.txt',
                                        featureSet='test')

Train a Learning to Rank Model

With a training set prepared, you can now go through the process of training a model. Here we’re simply performing the training process and loading a model into the search engine for us to experiment with in the code below:

from ltr.train import train
trainLog = train(client,
                 trainingInFile='data/osc_judgments_train.txt',
                 metric2t='NDCG@10',
                 featureSet='test',
                 index='blog',
                 modelName='test')

print("Train NDCG@10 %s" % trainLog.rounds[-1])

The train function invokes RankyMcRankFace via the command line, and provides an opportunity to pass a variety of parameters that are passed along to RankyMcRankFace. The method returns a log (trainLog), parsed out of the command line output of Ranky with a lot of good information on the model training process.

A couple things to note about train:

trainingInFile is the training set you prepared in the logging step with the funky format 4 qid:1 1:1 2:12.5 ... # 4036602523 etc
A variety of LambdaMART hyperparameters are exposed (number of trees and leafs of each tree). In a future post we can go deeper into how to select these hyperparameters
RankyMcRankFace can optimize for a number of classic relevance metrics such as Precision, ERR, NDCG, and DCG. Here we optimize for NDCG@10 by passing this as an argument for metric2t parameter
The model is stored in the search engine at modelName=test. This model is associated with featureSet test to understand how the model’s features bind to a set of search queries at search time

The model output is also interesting, you can examine the impact of each feature via trainLog.impacts, which stores a dictionary of each index ordinal and how much error is reduced in the model:

{'2': 64.48048267606816,
 '3': 33.63633930748523,
 '7': 31.319488313331828,
 '8': 2.72292608517665,
 '1': 0.014245484167042312,
 '6': 0.007610925647204436,
 '4': 0.0,
 '5': 0.0}

Feature impacts are a bit complicated to interpret in these kinds of models, but this still gives you a hint that feature 2 (content_bm25) matters most followed by feature 3 (title_phrase_bm25) and feature 7 (excerpt_bm25`). Pretty interesting!

Training a good LambdaMART model is an art. Here we just want to get you started with the “Hello World” sandbox – there’s a lot we’re not discussing here about doing this well to help you get started. There’s additional functionality in ltr.train, such as feature_search, which performs a brute-force search for the feature mix that best optimizes NDCG@10 (or whatever metric you choose) based on k-fold cross validation.

Search with your model!

Finally, we can execute a search using the model, using the provided display:

blog_fields = {
    'title': 'title',
    'display_fields': ['url', 'author', 'categories', 'post_date']
}

from ltr import search
search(client, "beer", modelName='test',
       index='blog', fields=blog_fields)

According to the model I trained, the following posts are most relevant to a search for “beer”. As you might expect, a lot of events and something secret Google is doing to foster collaboration… hmm!

Holiday Open House at OSC &#8211; Come Share in the Fun 
1.3306435 
https: //opensourceconnections.com/blog/2013/12/11/holiday-open-house-at-osc-come-share-in-the-fun/ 
john-berryman 
['blog', 'solr'] 
2013-12-11T00:00:00-0500 
---------------------------------------
4 Things Google is doing internally to foster collaboration and innovation 
-0.032763068 
https: //opensourceconnections.com/blog/2008/03/13/4-things-google-is-doing-internally-to-foster-collaboration-and-innovation/ 
arin-sime 
['blog', 'Opinion'] 
2008-03-13T00:00:00-0400 
---------------------------------------
Trip Report: Shenandoah Ruby User Group 
-0.032763068 
https: //opensourceconnections.com/blog/2008/04/30/trip-report-shenandoah-ruby-user-group/ 
eric-pugh 
['blog', 'Code', 'Community', 'Speaking'] 
2008-04-30T00:00:00-0400

Eric Pugh, John Berryman, and Arin Sime seem to be the resident beer experts. Not unexpected.

Where to go from here

I encourage you to try to plugin your own data into Hello LTR! Try it out and please let me know how it goes. We love learning about unique problems in this space!

Hello LTR is also a training course we offer using a lot of this code to build and explore Learning to Rank models with search relevance experts. Please get in touch if you would like a 2-day hands on Learning to Rank course for your search team, or take a self-led training course on LTR.

How You Can Use Hello LTR – The Learning to Rank Sandbox