5 Reasons why you should remove stop words (at indexing time)

January 30, 2023 David Fisher
Category: Relevancy

We have seen 10 reasons why you shouldn’t remove stopwords. Here are 5 reasons why you should remove stop words at indexing time (of course, indexing is only half the battle so perhaps that’s why we could only come up with half the reasons!).

Index size

While storage feels almost infinite, and cheap, to boot, it does still cost money to store stopwords. Let’s consider a simple index. Each posting consumes 8 bytes for document ID, 4 bytes for position. An uncompressed inverted list for a stopword that occurs in 1 billion documents would be 12 billion bytes, requiring roughly 11 GB of storage space. There are perhaps a few dozen words, in English, that would fit that profile. Now, we know that we can compress our inverted lists, using a variety of techniques. Even with 90% reduction, we’re still looking at a GB+ of data. Since speed is of the essence, we’ll likely use SSDs as our primary storage, if possible, to maximize performance. Alas, they are a bit pricier than spinning rust.

Processing time

Perhaps we can shoulder the burden of paying for all of that extra storage. There is a second consideration. It takes a goodly bit of time to scan through an inverted list with 1+ billion entries, even with the variety of optimizations that are available to speed up the task. We could use impact-sorted inverted lists, various safe or unsafe query optimizations, but in the end, adding the score afforded by ‘a’ appearing in the document may prove more expensive than the benefit to the user experience when retrieving that edge case document that actually needed to have ‘a’ in a phrase.

Stopwords are a blunt instrument for relevance improvements

While presented as a reason not to remove them, blunt instruments have value, especially in the early stages of building a search system. Understanding your collection includes understanding which terms in your documents do not convey any content. Adding a word to a stoplist is a quick and dirty way to test such an assertion.

Higher precision

While removing stop words might reduce recall in some circumstances, such as when you simply must know who said, “2b or ! 2b, that is the equation”, and can’t just ask ChatGPT, taking non-content bearing terms out of the index can improve the precision of numerous queries across your use cases.

Exclude the “seven dirty words”

There are words in every language that you do not want to be indexed, the seven dirty words you can’t say on television first listed by comedian George Carlin being the most prominent ones. This is especially true when relying on user-generated content, e.g. in autosuggest features based on previous user queries.

Conclusion – why you should remove stop words

In conclusion, removing stop words at indexing time can potentially provide benefits. Whether or not that is the case depends on the nature of your specific search activities and user information needs. The ultimate answer, as with almost all of search relevance, is “it depends”!