As much as anyone I’m a fan of resurrecting trends and memes and pretending it’s cool. In that vein dear friends, I’ve exhumed the venerable “Falsehoods Programmers Believe” party from 4 years ago to bring you one about, no less, Search.
Search is a deceptively complex field, where competence is hard-won through training, practice, and experience. The list stands at a total of 105 falsehoods. I couldn’t mash up the ole 99-problems meme with this to cull 6 unworthy items, because they are all worthy. I will leave you with that brief introduction and, of course, the list:
- Search engines work like databases
 - Search can be considered an additional feature just like any other
 - Search can be added as a well performing feature to your existing product quickly
 - Search can be added as a well performing feature to your existing product with reasonable effort
 - Choosing the correct search engine is easy and you will always be happy with your decision
 - Once setup, search will work the same way forever
 - Once setup, search will work the same way for a while
 - Once setup, search will work the same way for the next week
 - The default search engine settings will deliver a good search experience
 - Customers know what they are looking for
 - Customers who know what they are looking for will search for it in the way you expect
 - Customers who don’t know what they are looking for will search accordingly
 - A customer using the same query twice expects the same results for both searches
 - Customers only search for a few terms
 - Customers only search for less than some set number of terms
 - Customers never copy and paste a whole document into a search bar
 - Customers balance quotes and parenthesis
 - Customers that don’t balance quotes or parenthesis don’t expect phrasing or grouping
 - You can pass the customer query directly into your search engine
 - You can write a query parser that will always parse the query successfully
 - You will never have to return a query parse error to the customer
 - When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon
 - Customers notice their own misspellings
 - Customers don’t expect your search to correct misspellings
 - It is possible to create a list of all misspellings
 - It is possible to create an algorithm to handle all misspellings
 - A misspelled word is never the same as another correctly spelled word
 - All customers expect spelling correction to work the same
 - All customers want their misspellings corrected
 - A search should always return results, no matter how absurd
 - If you don’t have any results to show, customers won’t mind
 - When the perfect results are shown to the customer, they will notice it
 - You don’t need to monitor search queries, results, and clicks
 - Customers won’t get nervous that you are logging their search activity
 - Search queries are not affected by GDPR
 - Looking at the data, it is always possible to tell whether a customer found what they were looking for
 - Customers will click on what they are looking for when they’ve found it
 - You can build a search that works like Google
 - You can build a search that works like Google sometimes
 - You should use Google as a target for your search
 - Customers don’t mind if your search doesn’t work like Google
 - Customers don’t expect your search to work like Google
 - Customers won’t compare you to Google
 - A bad search, no matter how minor nor how rare, will never reflect poorly on your product
 - Since Google doesn’t use facets, customers don’t need them
 - Facet hit counts are always correct
 - Facets have no impact on performance
 - You can just cache queries to get performant facets
 - Personalized search is easy
 - Learning to rank is easy and just requires a plugin
 - You have enough data for learning-to-rank
 - Over time, you can curate enough data for learning-to-rank
 - You don’t need to spend lots of time formatting content for it to work well in your search engine
 - Text extraction engines will always produce text that doesn’t need to be post-processed
 - All your markup will be stripped as you expect it to be
 - Content is well formed
 - Content is mostly well formed
 - Content is predictably well formed
 - Content, sourced from a database and templates, are formed the same
 - Content teams treat search as their top priority
 - Manually changing content to improve search is easy
 - Improving content can be automated with reasonable effort
 - Queries for ‘C programming’ and ‘C++ programming’ will produce different results
 - Queries for ‘401k’ and ‘401(k)’ will produce the same results
 - Tokenization as it works out of the box is right for your content and queries
 - Tokenization can be changed to meet the needs of your entire corpus and all queries
 - Tokenization can be changed to meet the needs of most of your corpus and most queries
 - Tokenization can be conditional
 - You should roll your own tokenizer
 - You will never have a debate about tokenization
 - Regular Expressions for tokenization is a good idea
 - Regular Expressions have minimal performance impact
 - You will never have a debate about regular expressions
 - You should remove stop words
 - You should not remove stop words
 - You know what the list of stop words should be
 - Stop words will never change
 - When you find the stopword ‘in’, you know it doesn’t mean Indiana
 - It’s easy to make certain things case sensitive
 - Case sensitivity is a good idea
 - Synonyms are easy
 - Synonyms will improve recall in the way you want
 - Synonyms have the same relevance in all documents
 - Synonyms for Abbreviations and Acronyms always work as you expect
 - Synonyms can be extracted from your corpus with natural language processing
 - Using Word2Vec for synonyms is a good idea
 - Stemming will solve your recall problems
 - Lemmatization will solve your recall problems
 - Lemmatization dictionaries are static
 - Languages don’t change
 - Natural language processing (NLP) tools work perfectly
 - Incorporating NLP into your analysis pipeline is straightforward
 - Search queries are complete sentences and can be accurately tagged with parts of speech
 - Showing a list of search suggestions is easy
 - Suggestions should just use the out of the box search engine suggestions
 - Suggestions should incorporate customer query logs
 - Customers would never type anything offensive into your search bar
 - Customers would never try to hack you through your search bar
 - Customers don’t need highlighting to find what they’ve searched for
 - Default highlighters are good enough for all your content and queries
 - Making a custom highlighter isn’t too difficult. It’s just matching strings right?
 - Making a custom highlighter that is better than the default version will take less than a year
 - Turning on caching will solve your performance issues
 - Customers don’t expect near real time updates
 - 30 second commit time is short enough for everyone