In previous articles, we introduced Learning to Rank (LTR) for Elasticsearch 1, 2, which uses machine learning to re-score the best documents in a query. On the surface, it offers a tempting off-the-shelf solution to a chronically hard problem. But is your infrastructure ready?
Machine learning is fed on data. The learning algorithm is applied to a set of training data, which is used to construct a model. The model is then used to evaluate new data. Once in production, user interactions can be leveraged to evaluate and iterate the model. In order to leverage the value of LTR, the infrastructure must be in place to collect and utilize data. This article will examine the infrastructure needs for a learning-to-rank project.
Step 1: Event Logs
Applied machine learning is fundamentally a data analysis problem. The success of a machine learning project, then, will be heavily dependent on the quality and quantity of the data available.
In the case of learning to rank, there are three basic questions:
- What did users search for?
- What documents were provided for each search?
- Were the documents provided good or bad for those keywords?
The answer to the first question is readily available on most server logs as API traffic: what endpoint was hit, with what parameters? The second will require application-level or orchestration level to record the IDs of the documents provided. The level of difficulty there is fundamentally a question of the infrastructure available. Evaluating the suitability of the provided documents for a given query based on user interactions is perhaps as much an art as a science.
Consider the case of an online shopping experience at an online furniture store looking to purchase a bookshelf:
- Users searched for a bookshelf
- Users clicked on one set of bookshelves among many returned.
- Users continued interacting with the product page. Perhaps they read a review or checked the dimensions.
- Users put the bookshelf in the shopping cart
- Users actually purchased the bookshelf.
In this case, there is a clear, easily measured progression of user interest based on interactions. Each of those interactions can be logged. But in order to be used for LTR, those interactions must be tied back to a given initial search.
In cases where the user’s interactions are not transaction-based, identifying appropriate conversion metrics can be much harder.
Google Analytics is a common platform for collecting data about broader user interactions: In a previous article, Doug looked at Google Analytics in the context of relevancy in general 3. This can be a great platform for identifying common queries, and metrics like page views, time on page, and bounce rate can tell you a lot about your user experience in general for common queries. It may or may not be possible to apply these metrics back to individual search results.
Why does tying user events back to a search query matter?
In order to understand why tying user events back to an initial search query matters, let’s take a quick review of what goes into an LTR implementation.
Generating LTR model requires a judgement list: a mapping file which numerically indicates how relevant a given document is for a given query. Consider our online furniture store, which has several products: small bookshelf, medium bookshelf, large bookshelf, kitchen table, end table, sofa,
With a very small data set, we can manually construct simple binary judgement mappings: a unit is appropriate (1) or not (0)
In most real-world scenarios, however, both the range of user queries and the set of products available will be much too large to construct by hand. These judgements will be constructed by mapping user events back to the initial source queries.
Step 2: Historical Features Values
Judgement lists only tell part of the story required for training an LTR model. In order for the model to score documents for new queries or documents, it is necessary to abstract both the query and the document to a feature set.
A feature is a representation of some aspect of the content being searched 4. In elasticsearch, a feature is a numerical attribute, such as a query score, function score, or even a numerical document value like recency or popularity. Let’s consider the online furniture store from the previous example, with two features: “score title”, and “popularity”. Let’s let score title be the frequency of the query term in the title, and popularity the percentage of sales of this item in the past month.
Selecting the best features is a hard problem - that may be what has lead you to investigate learning to rank in the first place. In order to train your model, it is necessary to have feature values for each document returned from a user search. Logging/gathering feature values for model training creates a bit of a bootstrapping problem for your infrastructure, and how to solve it depends a bit on how you plan to train your model.
Considerations with Training
Learning algorithms which process large training sets at once are called batch or offline methods. Other systems, which process examples one at a time are called online methods. 5 As with many real-world implementations, the learning to rank can use either or both scenarios.
Offline training requires some infrastructure beyond the event logging requirements discussed above. It must be possible to calculate or estimate feature values for the documents included in the training set. What this means depends a bit on what your features are. If your features are Elasticsearch queries, then the historical documents need to be indexed to calculate those query scores. If your training data is a superset of live data, or if your training data set is very large, that may require additional capacity for your Elasticsearch cluster.
This leaves a great deal of flexibility:
- Organizations potentially have a lot more training data to work with, because they can use historical data
- As was already mentioned, selecting the right feature set can be challenging. By not requiring feature values to be logged with the historical data, offline training makes it easier to hypothesize and test the effects of different features.
There are also some risks implicit to using historical data for training:
- The value of a feature at the time a historical query was made might be different than the value of that feature “right now”. For example, consider document recency as a potential document feature. If a document is posted on day 0, a user queries on day 2, that document on day 2 was pretty new. When collecting features on day 30 the current freshness value might be 30 days, but the value used in training should be 2 days.
- This same principle applies to text-based similarity scores. In a lucene-based search engine, basic similarity scores are dependent on document frequency. If the collection is changing frequently, the similarity score for the same document can and will change over time. A rare term today might be common next week, and rare again in a month.
What’s Next and Get In Touch!
This article is part of a series on learning to rank. Topics include:
- Basics: more about what learning to rank is exactly
- Applications: Using learning to rank for search, recommendation systems, personalization and beyond
- Models: What are the prevalant models? What considerations play in selecting a model?
- Considerations: What technical and non-technical considerations come into play with Learning to Rank? If you think you’d like to discuss how your search application can benefit from learning to rank, please get in touch. We’re also always on the hunt for collaborators or for more folks to beat up our work in real production systems. So give it a go and send us feedback!