Blog

Falsehoods Programmers Believe About Search

As much as anyone I’m a fan of resurrecting trends and memes and pretending it’s cool. In that vein dear friends, I’ve exhumed the venerable “Falsehoods Programmers Believe” party from 4 years ago to bring you one about, no less, Search.

Search is a deceptively complex field, where competence is hard-won through training, practice, and experience. The list stands at a total of 105 falsehoods. I couldn’t mash up the ole 99-problems meme with this to cull 6 unworthy items, because they are all worthy. I will leave you with that brief introduction and, of course, the list:

  • Search engines work like databases
  • Search can be considered an additional feature just like any other
  • Search can be added as a well performing feature to your existing product quickly
  • Search can be added as a well performing feature to your existing product with reasonable effort
  • Choosing the correct search engine is easy and you will always be happy with your decision
  • Once setup, search will work the same way forever
  • Once setup, search will work the same way for a while
  • Once setup, search will work the same way for the next week
  • The default search engine settings will deliver a good search experience
  • Customers know what they are looking for
  • Customers who know what they are looking for will search for it in the way you expect
  • Customers who don’t know what they are looking for will search accordingly
  • A customer using the same query twice expects the same results for both searches
  • Customers only search for a few terms
  • Customers only search for less than some set number of terms
  • Customers never copy and paste a whole document into a search bar
  • Customers balance quotes and parenthesis
  • Customers that don’t balance quotes or parenthesis don’t expect phrasing or grouping
  • You can pass the customer query directly into your search engine
  • You can write a query parser that will always parse the query successfully
  • You will never have to return a query parse error to the customer
  • When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon
  • Customers notice their own misspellings
  • Customers don’t expect your search to correct misspellings
  • It is possible to create a list of all misspellings
  • It is possible to create an algorithm to handle all misspellings
  • A misspelled word is never the same as another correctly spelled word
  • All customers expect spelling correction to work the same
  • All customers want their misspellings corrected
  • A search should always return results, no matter how absurd
  • If you don’t have any results to show, customers won’t mind
  • When the perfect results are shown to the customer, they will notice it
  • You don’t need to monitor search queries, results, and clicks
  • Customers won’t get nervous that you are logging their search activity
  • Search queries are not affected by GDPR
  • Looking at the data, it is always possible to tell whether a customer found what they were looking for
  • Customers will click on what they are looking for when they’ve found it
  • You can build a search that works like Google
  • You can build a search that works like Google sometimes
  • You should use Google as a target for your search
  • Customers don’t mind if your search doesn’t work like Google
  • Customers don’t expect your search to work like Google
  • Customers won’t compare you to Google
  • A bad search, no matter how minor nor how rare, will never reflect poorly on your product
  • Since Google doesn’t use facets, customers don’t need them
  • Facet hit counts are always correct
  • Facets have no impact on performance
  • You can just cache queries to get performant facets
  • Personalized search is easy
  • Learning to rank is easy and just requires a plugin
  • You have enough data for learning-to-rank
  • Over time, you can curate enough data for learning-to-rank
  • You don’t need to spend lots of time formatting content for it to work well in your search engine
  • Text extraction engines will always produce text that doesn’t need to be post-processed
  • All your markup will be stripped as you expect it to be
  • Content is well formed
  • Content is mostly well formed
  • Content is predictably well formed
  • Content, sourced from a database and templates, are formed the same
  • Content teams treat search as their top priority
  • Manually changing content to improve search is easy
  • Improving content can be automated with reasonable effort
  • Queries for ‘C programming’ and ‘C++ programming’ will produce different results
  • Queries for ‘401k’ and ‘401(k)’ will produce the same results
  • Tokenization as it works out of the box is right for your content and queries
  • Tokenization can be changed to meet the needs of your entire corpus and all queries
  • Tokenization can be changed to meet the needs of most of your corpus and most queries
  • Tokenization can be conditional
  • You should roll your own tokenizer
  • You will never have a debate about tokenization
  • Regular Expressions for tokenization is a good idea
  • Regular Expressions have minimal performance impact
  • You will never have a debate about regular expressions
  • You should remove stop words
  • You should not remove stop words
  • You know what the list of stop words should be
  • Stop words will never change
  • When you find the stopword ‘in’, you know it doesn’t mean Indiana
  • It’s easy to make certain things case sensitive
  • Case sensitivity is a good idea
  • Synonyms are easy
  • Synonyms will improve recall in the way you want
  • Synonyms have the same relevance in all documents
  • Synonyms for Abbreviations and Acronyms always work as you expect
  • Synonyms can be extracted from your corpus with natural language processing
  • Using Word2Vec for synonyms is a good idea
  • Stemming will solve your recall problems
  • Lemmatization will solve your recall problems
  • Lemmatization dictionaries are static
  • Languages don’t change
  • Natural language processing (NLP) tools work perfectly
  • Incorporating NLP into your analysis pipeline is straightforward
  • Search queries are complete sentences and can be accurately tagged with parts of speech
  • Showing a list of search suggestions is easy
  • Suggestions should just use the out of the box search engine suggestions
  • Suggestions should incorporate customer query logs
  • Customers would never type anything offensive into your search bar
  • Customers would never try to hack you through your search bar
  • Customers don’t need highlighting to find what they’ve searched for
  • Default highlighters are good enough for all your content and queries
  • Making a custom highlighter isn’t too difficult. It’s just matching strings right?
  • Making a custom highlighter that is better than the default version will take less than a year
  • Turning on caching will solve your performance issues
  • Customers don’t expect near real time updates
  • 30 second commit time is short enough for everyone

Keen to avoid believing falsehoods about search? Let us help!