This is the second post in my blog series about participating in TREC 2020. If you want some background about what exactly is TREC, check out my first post. This second post is a high-level look at the search strategy we used for the News track.
TREC is a conference designed to socialize ideas and strategies in information retrieval. Past TREC submissions are the best place to learn about what works: we stood on the shoulders of these past submissions and built a open source stack search system. Our strategy’s foundation was Elasticsearch’s More Like This query; we then layered on new features, like parameter tuning, named entity recognition and language embedding in our effort to improve our system’s retrieval performance.
Open source search stack
At OSC, we see the value in search solutions that are usable by anyone, anywhere. So the first requirement for our TREC submissions was that everything in our strategy be implemented in open source tools. We used a combination of Python and Elasticsearch, but these ideas could easily be implemented in a different search engine, like Solr.
Learning from the those who came before us
Because TREC is focused to advancing search, one of the requirements for participation is publishing your methodology. No proprietary trade secrets allowed; everything has to come out. This stipulation makes it easy to learn from previous research and ensures that meaningful findings are available to the whole community. So before we even began indexing documents and working with a search engine, we wanted to learn about what past News track participants have tried. There are loads of unique ideas and features to be found in these past publications and this post won’t cover all of those. Instead I want to highlight the common threads that ran through many different submissions. If they worked for a lot of other folks, chances are good that they would work for us. There are four common threads that I’ll highlight here.
Search strategy in past News tracks
BM25 is good
The first major take-away from past papers was that BM25 works really well as the relevance scoring function. Past participants tested using other relevance functions like Query Likelihood and Relevance Language Model, instead of BM25, but we didn’t find a paper where they outperformed BM25. There is a reason BM25 is the default scoring method in Solr and Elasticsearch today. This convinced us to not adjust the scoring functions used and spend our development time elsewhere.
Document structure matters
The second take-away was using the natural structure of the news articles. The documents in the training corpus include the paragraph structure. Because news articles follow a formula, there is information captured in the ordering of content. Broader, more important details occur in earlier paragraphs. To leverage this structure, a common strategy was to index paragraphs as separate fields, instead of combining them into a single body field. We used this separate field strategy and took it a step further. We also used our enrichment strategies – named entities and transformers – at the paragraph level too.
Named entities are important to news
The third major take-away was using named entity recognition (NER), after all this is a news related task and proper nouns carry a lot of importance in articles. Most of the papers we found attempting NER used the spaCY (Python) libraries implementation, so we decided to follow their lead. spaCY is well documented and has a vibrant community of users, so while this step is more about enrichment before the index, it fits with our goal of using tools that were already in widespread use.
Transformers are hot right now
This fourth major take-away was the use of transformer models for language embeddings. This direction was driven more by the general search community and wasn’t mentioned in the past TREC submissions we read. Unless you’ve been living under a rock for the last two years, you’ve heard about BERT, which is perhaps the most popular transformer. Transformers have been setting performance records on benchmark datasets in question answering and the search community is beginning to use them for document retrieval, but there are still some questions about their utility in retrieval systems.
One of the major limitations of these models is their performance on longer passages of text, like news articles. None of the papers we read reported an increase in performance when BERT was added to their relevance solution. Despite this, we have been very keen on the SentenceBERT model from UKPLab, because it is specifically aimed at addressing the shortcomings of traditional BERT on longer pieces of text. So, we decided we’d try it.
Our search strategy
To summarize our search strategy for TREC 2020: News Track:
- Use BM25 as a our primary scoring function and take advantage of the specialized Elasticsearch query “More Like This”
- Place a higher importance on the content of earlier paragraphs
- Enrich our index with fields of named entities
- Add semantic embeddings to capture context and complement BM25
Next we’ll be writing some shorter technical posts about the queries and performance results. If you’d like help building a cutting-edge information retrieval system for news or any other kind of content, do get in touch!