A Search Hackday at Haystack US 2023 - OpenSource Connections

May 4, 2023 Charlie Hull
Category: Community

Last Thursday I was lucky enough to share a small room with some of the cleverest people I know working in search today, at our Search Hackday, part of Haystack US 2023.

We had OSC alumni Bertrand Rigaldies, Max Irwin and Doug Turnbull (who also co-wrote a book I believe); Erik Hatcher (long term Lucene/Solr committer, once Lucidworks’ Principal Architect), Lucene/Solr committer & independent search expert Gus Heck, Chris Morley from the current OSC team fresh from his Haystack talk the day before, AWS OpenSearch developer advocate David Tippett, Shopify’s Elasticsearch expert Chris Fournier, Sean Mullane, another Charlottesville-based search expert with a background in healthcare search and Michael Froh who also works on AWS OpenSearch (if I’ve forgotten and you were also there do let me know, people did come and go during the day).

I remember thinking ‘what an amazing search team this would be if they all worked on the same well-funded project!’ Perhaps I should have gone out and found some friendly VCs? However on the day we worked on several projects which I’ve tried to document below based on some hasty notes and a Slack channel we set up to share our achievements (the only reward was sandwiches and coffee). Any omissions or mistakes are entirely my fault of course and not all the work was recorded.

Mighty Max for E-commerce

Max Irwin made “Mighty Instacommerce” which is a fast vector search demo across 250k e-commerce products. The stack is node.js as a lightweight app, Mighty Inference Server to get vectors with a custom fine-tuned ecommerce model, and the Qdrant vector search engine. He created a test instance so if you’re interested in trying it do reach out to him via the Mighty website. On the way home he added a search-as-you-type feature.

Cursing recursion with Lucene

Chris Fournier worked on a Lucene bug which can cause an exception when synonyms are used – apparently this can crash quite a number of Elasticsearch instances at once in a spectacular fashion. He came up with a pull request that limits recursion depth which fixes the problem, which Erik Hatcher & Doug Turnbull reviewed on the day. My old Flax colleague Alan Woodward, another Lucene committer who wasn’t at the event also chipped in and this PR looks well on the way to being accepted into the next Lucene release.

Bashing a big bitset

Michael Froh decided to look at Lucene’s MultiTermQuery family to see what can be done to optimize the case where MTQ matches a whole bunch of terms and a whole bunch of documents for a term and adds them into a big bitset during the rewrite phase. He was especially thinking of the case where the field has doc values, where you might be better off letting other parts of the query drive the scorer and check to see if the doc value matches the automaton.

After learning that the Lucene optimization that he wanted to apply had already been implemented a few months ago (by one of his old Amazon Product Search teammates), he decided to see if he could add it to OpenSearch (and prove that it speeds things up). This reminds us what a huge and complex project Apache Lucene is – it’s very easy to be working in one part and not to notice others working somewhere else…

So he wrote a little benchmark that does the following:

Index 10 million documents of the form {"id" : "12345", "number": 5}, where the id field is distinct for each document.
Benchmark a query of the form id:[* TO "1000000"] number:0 (running 10 times ignoring the result, and 1000 times collected, output average of collected runs).
Ran that with and without DocValuesRewriteMethod applied to the term range query on the id field.

With the default term-based range query, the average query time was 168ms. With DocValuesRewriteMethod, it dropped to 86ms. Hooray!

Taming ChatGPT for query expansion

Doug and Bertrand experimented with the use of Elasticsearch’s out-of-the-box More-Like-This (MLT) feature, setting the “like” MLT parameter to text provided by ChatGPT for the query at hand, effectively doing query expansion using ChatGPT. This could be useful when a user enters a long natural language query but there are no vectors in the search index (which contains large text fields). It’s important to remember that for very large indexes adding vector data could be expensive. A related paper on Query2doc: Query Expansion with Large Language Models was identified.

Proposed solution

Base ranker: As an additional clause in the should query (in addition to classic BMB25 term and phrases clauses), use Elasticsearch’s out-of-the-box More-Like-This (MLT) feature setting the this MLT parameter to text provided by chatGPT for the end-user’s query at hand.
Note: The various boost and parameter values across the Elasticsearch query were selected using a a brute-force exploration with randomized values.
Re-rank (external to search engine) using cosine similarity between the query and the matches’ main text field.
Note: The vector embeddings were generated from various generic sentence BERT models.

Example

End-user’s query: what is hypervisor
Generated Elasticsearch query:

{
  "size": 100,
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "remaining_lines": {
              "slop": 36,
              "query": "what is hypervisor",
              "boost": 51.50271184475267
            }
          }
        },
        {
          "match_phrase": {
            "first_line": {
              "slop": 25,
              "query": "what is hypervisor",
              "boost": 79.6874768935059
            }
          }
        },
        {
          "query_string": {
            "query": " \"virtual machine\" manager OR  \"vmware\" software OR  \"red hat\" virtualization OR  hypervisor versus \"container\"  OR  \"hyper-v\" technology OR  \"xen\" hypervisor  OR  \"kvm\" virtualization OR  \"oracle\" virtualbox OR  type 1 \"hypervisor\" OR  \"vmware vs hyper-v\" comparison",
            "boost": 1.0,
            "fields": [
              "raw_text"
            ]
          }
        },
        {
          "query_string": {
            "query": " \"virtual machine\" manager OR  \"vmware\" software OR  \"red hat\" virtualization OR  hypervisor versus \"container\"  OR  \"hyper-v\" technology OR  \"xen\" hypervisor  OR  \"kvm\" virtualization OR  \"oracle\" virtualbox OR  type 1 \"hypervisor\" OR  \"vmware vs hyper-v\" comparison",
            "boost": 1.0,
            "fields": [
              "first_line"
            ]
          }
        },
        {
          "match": {
            "raw_text": {
              "query": "what is hypervisor",
              "boost": 54.340333413958
            }
          }
        },
        {
          "match": {
            "first_line": {
              "query": "what is hypervisor",
              "boost": 23.980411594706087
            }
          }
        },
        {
          "more_like_this": {
            "boost": 91.85675452219638,
            "fields": [
              "first_line"
            ],
            "like": " what is hypervisor? a beginner's guide to vmware's virtualization technology |  understanding the role of hypervisor in virtualization | 1. virtual machine monitor",
            "min_term_freq": 1,
            "min_word_length": 1,
            "min_doc_freq": 1,
            "max_query_terms": 100
          }
        },
        {
          "more_like_this": {
            "boost": 42.19607013804867,
            "fields": [
              "raw_text"
            ],
            "like": "Title: What is Hypervisor? A Beginner's Guide to VMWare's Virtualization Technology\n\nBody:\n\nVirtualization technology has revolutionized the way we use computers and data centers. It has made it possible to run multiple operating systems and applications on a single physical server, leading to increased efficiency, flexibility, and cost savings. At the heart of this technology is the hypervisor, a software layer that enables virtualization. In this article, we'll explain what hypervisor is and how it works.\n\nWhat is Hypervisor?\nHypervisor, also known as virtual machine monitor (VMM), is a software layer that allows multiple virtual machines (VMs) to run on a single physical server. VMs are self-contained and independent instances of an operating system and the applications that run on it. The hypervisor sits between the physical hardware and the VMs, managing the resources of the physical server and allocating them to the VMs.\n\nTypes of Hypervisors\nThere are two types of hypervisors: type 1 and type 2. Type 1 hypervisors run directly on the physical hardware, while type 2 hypervisors run on top of an existing operating system. [Note: this text is much longer but I've edited it here for readability],
            "min_term_freq": 1,
            "min_word_length": 1,
            "min_doc_freq": 1,
            "max_query_terms": 100
          }
        }
      ]
    }
  }
}

Here’s another approach, asking ChatGPT just for queries:

One challenge they identified is that ChatGPT doesn’t do well with specific acronyms, and compounding/decompounding, given it understands tokenized text at the end of the day. This was the most interesting project of the day for me however as it shows how to use the power of LLMs to improve search quality without relying on search results provided using them to be entirely accurate or truthful.

MongoDB Atlas Search pretending to be Solr

Erik’s Hackday project implemented a Solr-like interface to MongoDB Atlas Search – which could be a quick way to work with tools like Quepid that are Solr-compatible.

A request like https://redacted.data.mongodb-api.com/app/data-zzzz/endpoint/solr/movies/select?q=search&rows=10&start=0&fl=id,title&debug=true hits an HTTPS /solr/movies/select defined endpoint which maps to an Atlas (JavaScript) Function in the cloud that extracts the parameters and maps them into a MongoDB aggregation pipeline using the $search operator, returning a result like that shown below. Currently the parameters handled are q, fl, debug, start, and rows.

{"responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "q": "search",
      "rows": "10",
      "start": "0",
      "fl": "id,title",
      "debug": "true"
    }
  },
  "response": {
    "numFound": 13,
    "start": 0,
    "numFoundExact": true,
    "docs": [
      {
        "title": "The Search",
        "id": "573a1393f29313caabcde024"
      },
      {
        "title": "The Search",
        "id": "573a13d8f29313caabda5622"
      },
      {
        "title": "Search and Destroy",
        "id": "573a139af29313caabcef748"
      },...

The generated aggregation pipeline for that request is:

[
      {
        "$search": {
          "index": "default",
          "queryString": {
            "query": "search",
            "defaultPath": "title"
          },
          "count": {
            "type": "total"
          },
          "highlight": {
            "path": "title"
          },
          "scoreDetails": true
        }
      },
      {
        "$project": {
          "id": "$_id",
          "_id": 0,
          "title": 1
        }
      },
      {
        "$skip": 0
      },
      {
        "$facet": {
          "docs": [
            {
              "$limit": 10
            }
          ],
          "meta": [
            {
              "$replaceWith": "$$SEARCH_META"
            },
            {
              "$limit": 1
            }
          ]
        }
      }
    ]

It’s not all about search

Chris Morley decided to build something entirely different and not about search at all! He built a simple game with a few simple ‘players’ that would go after ‘treasures’ in HTML5 Canvas with Node and Typescript:

Conclusion

This wasn’t a traditional Hackday – we didn’t work well into the evening or overnight, most projects were individual rather than shared, but it was great to spend time together. Conversations between hacking ranged between driving your own race car, to tales of search conferences past, to industry rumours. Thanks everyone who joined us – I’m looking forward to the next time!

Our next Haystack event is a smaller On Tour event in Spain, but we expect to return with Haystack Europe in September or October (venues and dates to be confirmed – maybe we’ll have another Search Hackday too).

Join our events mailing list if you want early notice of our plans – and happy hacking!