Title Search: when relevancy is only skin deep

December 8, 2014 Doug Turnbull
Category: Uncategorized

How do users judge that articles, books, and blog posts are relevant to what they’re searching for? What about you? If you’re searching for an article on ‘Socrates’, what might be relevant to you?

A lot of search relevancy work with Solr or Elasticsearch focusses on getting really deep into written prose–the type that occurs in the body text of articles and books. To extract whether a document is relevant, we study complex scoring models, latent semantic indexing, natural language processing and other techniques that can extract meaning and metadata from written prose.

Those are powerful techniques. But do you know what often really matters? Often its the simple things. Sometimes its just something as simple as whether our search term, “Socrates”, is mentioned prominently in the article’s title. Titles speak volumes about curated, published content.

Titles are:

very thoughtfully written by authors,
the most concise description of publication “aboutness”,
written to intentionally to snag the target audience, and
appeal to our extremely low attention span

So having a very good title search helps tremendously with search relevance.

Great! Problem solved. Search over the ‘title’ field and we’re done, right?

Not quite.

In fact, much of my most challenging relevancy work has turned out to be focussed heavily on understanding and working with title text. In this article, I want to share my thoughts on what has worked well when working with title fields across a number of relevancy projects using Solr and Elasticsearch.

Term Frequency Doesn’t Matter, but Phrase Matching Does

Which article on Socrates is more relevant?

Socrates on Socrates

Socrates, a brief biography

They’re both probably just about the same. Both articles about Socrates. Compare your intuition about these titles with your intuition on these snippets of body text.

Plato was pretty cool. Plato liked to party a lot. Sometimes with Socrates.

Socrates was kind of grumpy. Socrates didn’t like Plato

You can see in these snippets, how often our search term (lets say “Socrates”) is mentioned matters much more in our intuition when judging relevance. We want a computer to see that the term “Socrates” is mentioned more frequently in the second snippet, and therefore rank this document highly when returning search results. The other document mentions “Socrates” only once, in passing, so our notion is often that this document should not rank highly in the search results.

The measure for how often a term (ie Socrates) is mentioned in a document is known as “term frequency” or TF for short. Term frequency is a fundamental part of most relevancy ranking algorithms. Lucene-based search engines like Solr and Elasticsearch use the term frequency in their default scoring formula.

Much of the information-retrieval science that goes into search engines, relies on our intuitions about written prose like the snippets above. Which is why term frequency is a key component of relevancy scoring. What about title search? Is term frequency as important?

In the case of title search, aboutness hasn’t correlated to term frequency in my relevancy tuning experiences. Instead, simply a mention of a term in a title seems enough to convince us that the title is likely about what we’re searching for. Titles don’t go on and on verbosely about a subject like written prose does, they are concise and to the point. When a term is mentioned more than once, such as our article “Socrates on Socrates” its a fleeting flourish, not meaningful to judging results. Mentioning Socrates twice doesnt convince us this article is any more about “Socrates” than an article simply entitled “Socrates”.

Phrase Matching

Phrase matching, taking the users multi-word query (“Socrates on Socrates”) and attempting to find that exact phrase in a title, matters tremendously. We want to support users searching for exact titles or parts of titles.

To support this form of querying, the search index needs to have a feature known as term positions enabled. This tells a search engine like Solr or Elasticsearch where individual terms in a field (“Socrates on Socrates”) are with relation to each other. You can imagine them as getting encoded as Socrates{0,2}, on{1} – denoting “Socrates” occurs in positions 0 and 2 and “on” occurs in position 1. (check out how this really looks here).

Once each term is indexed with its associated position, the search engine can match up the positions of the terms in the user’s query with the positions of terms in the field (in our case “title”). A search for phrases or exact titles like “Socrates on Socrates” can use position information to rank phrase matches more highly than simple term matches.

Phrase Matching without Term Frequencies

As we noted in our last section, our ideal title search disables term frequencies in scoring but uses phrase matching. Unfortunately, neither Solr or Elasticsearch have the ability to enable positions while disabling term frequencies. The options for Elasticsearch are documented here, they allow configuring a field with:

freqs (doc numbers and term frequencies),
positions (doc numbers, term frequencies and positions)

Similarly, in Solr’s schema, you can

omitPositions
omitTermFreqsAndPositions

What we need is an “omitTermFreqs” while keeping positions on. This doesn’t appear to be a feature, and indeed as my colleague Peter Dixon-Moses pointed out seems to be tied to how Lucene’s underlying index API is organized.

Solutions?

An often cited solution is to write a custom Lucene Similarity plugin. Lucene’s Similarity class defines rules for how exactly TF or other ranking statistics are calculated from data stored in the index. You’d still enable term frequencies when indexing your fields, but when it came to using them you’d hijack the calculation and return 1.0 instead. You can write some really basic Java code and return whatever you want for these statistics.

public class NoTfSimilarity extends DefaultSimilarity {    public float tf(float freq) {        return 1.0f;    }}

(then similarly enable this similarity for your field in Solr or Elasticsearch)

Not everyone likes Java plugins. I try to avoid them unless I have a clear need I can’t solve with the external API. So another solution I use is simply to define two fields:

notf_title (A title field configured with both term frequencies and positions disabled)
phrase_title (A title field with freqs/positions enabled, for phrase matching)

Solr’s edismax handler let’s me specify fields for normal single term query matching/scoring and other fields for just phrase matching. With Solr, I could do:

q=”Socrates on Socrates”qf=notf_titlepf=phrase_title

Elasticsearch’s Query DSL lets me explicitly state

{    "query": {        "match_phrase": {            "phrase_title": ”Socrates on Socrates”        }        “match”: {             “notf_title”: : ”Socrates on Socrates”        }    }}

By working with Solr or Elasticsearch’s query DSLs and applying the appropriate boosts and weights to both queries, we can tune how much influence we want from term frequencies vs phrases.

Know Your Pantheon – IDF, Norms, and Keepword Curation with Titles

With much less text, often focussed on a specific subject, title fields don’t follow the same statistical patterns of free text “body” fields. We saw this above when we discussed term frequency. Let’s explore another consequence of this. Let’s say you searched for “Who was Plato” in our philosophy articles. To your chagrin, you get the following set of results:

Search: Who was Plato?Results:– Who was Thales?– Who was Socrates?– Who was Aristotle?– … (50 results later)– Who is Doug Turnbull?– Plato: The Biography

What went wrong? Why is the Plato result so low in our search results?

You would think that “Plato” is pretty specific. It turns out for our title field, however, that a search for “Plato” is less meaningful in relevancy ranking than a search for “Who”. This is because of something called inverse document frequency (IDF) and how it tends to work in odd ways with title fields.

What is inverse document frequency? Whereas term frequency tells us about how frequent a term is in a specific document, another statistic, “Inverse Document Frequency” (IDF) tells us about how rare a term is over all of the documents.

IDF matters because it helps us measure how relatively important a search term is in a document relative to other matching documents in the corpus. For example, if Socrates is mentioned in two documents, then we have a notion that one of those articles contain roughly 50% of the available mentions of Socrates. If we introduce more Socrates articles – say two more documents, suddenly our original document represents a lower proportion of all the Socrates – 1/4 or 25% of them. Suddenly our original document isn’t scored as highly.

More importantly for us, when our query contains multiple search terms “Who was Plato”, the relative IDF (again, read rareness) of each term plays a role. In our hypothetical “body” field, Plato is very rare in written prose compared to “who”, so matches on “Plato” dominates the ranking. English prose tends to follow some common statistical patterns. However, with our title field, it turns out “Plato” is a very common (thousands of articles written about Plato) and “Who” is exceedingly rare (very few articles with “Who” in the title). Again, we’re not seeing a normal sample of English prose in our title field – the founding intuitions behind these stats arent quite maintained. In other words, titles are a statistical outlier, and we need to revisit some assumptions.

How do we fix this? Build a Pantheon.

Is IDF useless for title fields? No, not quite, but it requires us to carefully curate a list of terms that are likely to get written about in our domain to craft a meaningful IDF. What do I mean by this? Well in the context of Philosophy, proper names such as “Plato” and “Aristotle” are obviously in. Important also to capture slightly more obscure philosophers like “Erasmus Darwin”. We’d also include other important topics like “epistemology” – all important philosophy jargon, topics, and other nouns that articles would be written about in our corpus.

I call this list of concepts our “Pantheon”. A list of subjects in our domain, professionally curated by domain experts. Actually building such a list can be time consuming – but for many fields, much of the work is done for you. For example in medicine, there’s the MeSH vocabulary that attempts to cover everything that could be written about in medicine. For other domains, it may be possible to create such a list yourself manually or use NLP techniques.

We can use our pantheon along with a KeepWordsFilter to create yet another search field to use in our search. We can create a “keep words” list that contains the terms in our pantheon. Only terms in our list make it into the search index. We can call this field pantheon_title. For example, when the following title is analyzed to go into the index:

Who was Socrates

we will strip out all terms other than the ones in our pantheon:

Socrates

Similarly the title

Socrates and Plato on Metaphysics

Can be boiled down to these three members of our pantheon:

Socrates Plato Metaphysics

Now IDF can be used on this new field to help us. IDF now tells us about how commonly written about the topics in our high value terms list. Socrates is going to be a very common term in our new title field, with many pantheon_title fields mentioning “Socrates”. However, Erasmus Darwin, not so much. The search engine can use IDF to know how rare (and therefore how highly ranked) titles that match Erasmus Darwin should be.

Additionally, we’ve filtered out the junk words. If I search for “Who is Socrates”, we’ll get something closer to desired behavior, as we’ll be searching a field that has stripped out all the words that have little to do with philosophy. For free text “junk” words (so called stop words) tend to be ok to leave in as they are so common they have very little impact on the scoring. However, here we need to remove them precisely because the search engine doesn’t have enough data to know they are junk through IDF.

With this new field, we can rely heavily on it in our title search solution. Giving it a strong preference over plain-text title matches. For Solr, this would mean calibrating the weights to give more emphasis to pantheon_title, only relying on notf_title as a last-resort match:

qf=pantheon_title^100 notf_titlepf=title

With Elasticsearch, we could use the query DSL’s multimatch to boost appropriately:

{    "query": {        "match_phrase": {            "phrase_title": ”Socrates on Socrates”        }        “multi_match”: {            “query” : ”Socrates on Socrates”           “fields”:  [“notf_title”, ”pantheon_title^100”]        }    }}

Normal caveats about selecting correct dismax boosting apply.

Getting No More Specific Than The Query

Oh great, looks like our search engine for philosophy articles has taken on a new set of documents from the International Journal of Hyperdomal Metaphysics. Unfortunately, after indexing these document we’ve noticed our search for Socrates has gone askew. Suddenly we’re getting very esoteric titles that happen to mention Socrates, not general titles just about Socrates:

Search: SocratesResults:

What is the relationship between Socrates and Plato when it comes to the differences in opinion on hyperdomal metaphysics?
…
Socrates Bio

Typically in my experience, when users search for “Socrates” they want the general result – a bio or general article that covers the search term. Only when users get more specific do they want to dive into the more specific subjects. All things being equal, you don’t want to be any more specific than a user’s query.

This behavior is related to what are known in Lucene-based search engines as “norms”. Norms bias search results to shorter pieces of text. It can be thought of as a bias towards density. Consider these two pieces of prose about Socrates

Socrates was a philosopher. Everybody loves Socrates.

Socrates partied all the time. He partied with Plato.He also partied with ......Socrates was the coolest philosopher ever.

These two pieces of prose both contain two mentions of our search term “Socrates”. Search engines tend to bias search results towards documents where a larger percentage of the text contains the search term. In other words, bias search results towards shorter fields. In our example, the first snippet would be scored more highly than the second.

With our title field, and more importantly our pantheon_title field, we can use norms to a related effect. If our search term is simply “Socrates”, shorter versions of pantheon_title that just contain “Socrates” come first. Additional important concepts in our pantheon_title field cause the document to be punished somewhat in relevancy ranking. We can see this if we take apart our titles:

title: What is the relationship between Socrates and Plato when it comes to the differences in opinion on hyperdomal metaphysics? pantheon_title: Socrates, Plato, hyperdomal metaphysics

title: Socrates Bio pantheon_title: Socrates

With norms on, we’ll bias towards the shorter matching pantheon_title field. We’ll get exactly as specific as the user’s search query, no more, no less. We end up with a saner ranking:

Socrates on Socrates
Socrates Bio
…
What is the relationship between Socrates and Plato when it comes to the differences in opinion on hyperdomal metaphysics?

Only by the user adding “Plato” and/or “hyperdomal metaphysics” to their search query will they pull up the more specific title into our search results.

Luckily for us, Solr and Elasticsearch enable norms by default. However, its an often cited optimization to disable norms. So if you find yourself in a sticky situation where your title results are far more specific than you’d like, be sure to enable the setting on your field. For Solr this is the omit_norms setting. Elasticsearch exposes this setting through the mappings API.

These are just the building blocks

I’ve given you the broad, sweeping pieces of a good title search. There’s a lot of extra puzzles that need to be worked out.

Working with your pantheon can be tricky – consider a pantheon full of multi-term phrases. Will the search engine treat them the same as single terms? What about nested concepts, like these two:

hyperdomal metaphysics
metaphysics

How do these get treated? Do we need more than just a KeepWordsFilter to capture these? Should we craft field norms to consider these redundant concepts, and not punish for related concepts? What about synonyms between these concepts? How will those factor into norms, IDF, and term frequency?

Here’s another puzzle: if instead of a vocabulary, we have an ontology – a set of connected concepts – could we use the notion that multiple concepts are related? Could we use the fact that we have nothing on “hyperdomal metaphysics” to surface results on the parent concept of “metaphysics”?

And this is only one important piece of a good relevancy solution. What about that prose in the body text? What about layering in other positive signals of relevance, like tracking and using user usage metrics?

In short, relevancy is hard work! But at OpenSource Connections, we love it! Contact us if you want this level of detail applied to your search. And check out Quepid, our relevancy regression tester and its little sister Splainer if you need help with your search tuning!