According to Wikipedia, “Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant.”
So there shouldn’t be any harm in removing them from the data we are indexing in search engines and from the queries that are sent to these to retrieve relevant results, right?
Wrong! Here are 10 reasons why you should not remove stop words.
Nowadays, disk space is seldom a limiting factor, so removing stop words to save a significant amount of space is the wrong approach.
Weird side effects
There are edge cases (e.g. removing stop words in a fuzzy matching query) where you think that stop words are removed but internally they are preserved. This produces unwanted, inaccurate or irrelevant results.
Shakespeare would turn in his grave
Try searching for
“to be or not to be” in a search application that removes stop words. Shakespeare wouldn’t approve.
Loss of meaning
Blunt instrument for relevance improvements
Removing stop words to improve relevance is the wrong motivation. Oftentimes, relevance tuning happens on a query-by-query basis, making globally removing stop words a rather blunt instrument.
Search systems can include multiple reranking steps after retrieving the matching documents based on scoring mechanisms like BM25. These rerankers can be internal or external to the search system. For instance, external reranking can be done via services like Metarank or learning to rank plugins in the cases of Lucene-based search engines, like Elasticsearch. Neural search is increasingly used for reranking tasks. Removing stop words may now lead to poor recall, i.e. less relevant documents being retrieved in the first step. The rerankers can only rank those documents that are being retrieved in the first step. Removing stop words may well mean losing good ranking candidates.
Better solutions for dealing with words you want to get rid of
Imagine a query like
“cheap notebook”. Chances are,
“cheap” is not a feature of the notebooks in your index. Removing
“cheap” is one way to deal with such queries. A more sophisticated way of handling this query would be to translate it to match a certain price range, e.g any price up to $500.
This is generally called Query Rewriting and can be done with Querqy’s common rules rewriter for all Lucene-based search engines. This approach lets you control how you want to handle specific queries by applying domain knowledge and guiding the user in the right direction.
Loss of accuracy
Some stop words are often used in specific industries or domains, which can be important for accurate search results, e.g
“like” in social media search applications.
Stop words can help to disambiguate the search query, allowing the user to get more accurate results: Try to find
“movies without nicolas cage” (yes, there’s a few!) or
“notebook without dvd drive” with stop words removed. Results quickly become irrelevant when stopwords are removed from these queries.
IDF already does the job
Traditional retrieval engines use the inverse document frequency (IDF) in their scoring formulas to reduce the influence of stop words. BM25 is the default in Lucene-based search engines. In its scoring formula the IDF takes care of discounting those words that are very frequent within an index. The more documents contain a specific word the less its discriminatory power is, so even without removing stop words they are already treated as insignificant words when it comes to ranking results.
Conclusion – should you remove stop words?
In conclusion, removing stop words often proves to be an inefficient relevance tuning option and other techniques like query rewriting can be more appropriate for specific queries. Do these 10 reasons mean that you should never use stop words? While this post focuses on situations when it hurts to remove stop words, there are times when removing stop words helps. It is important to know the impact and consequences of each of the strategies.
Contact us if you need our help with identifying the right strategy for tuning your search.
Image from Octagon Vectors by Vecteezy