Blog

Berlin Buzzwords 2023: AI injects new energy into the search for solutions

Last week the OSC EU team returned to Berlin for one of our favourite events on the calendar, Berlin Buzzwords. This year’s event was busier, happier and definitely focused on search with AI – so what did we learn about what’s hot, what’s not and what’s coming next?

Like last year I chose to take the train to Berlin from the UK – I’m trying personally not to fly when I can find an alternative method. It’s slower overall of course, but I can work much more comfortably than on a plane, security is less onerous and it delivers you directly to the heart of the city. I’d encourage everyone who travels for business to consider the environmental impact – and if you want tips about travel in Europe I’d recommend The Man in Seat 61.

Berlin Buzzwords is a three-day event focused on “storing, processing, streaming and searching large amounts of digital data, with a focus on open source software projects.” – so as you can imagine it’s a good place to meet those at the cutting edge of search. After some online events and a quieter in-person conference last year, it felt like a return to the old days – but with many new things to discuss.

Day 1 – Dinner with search luminaries

Although I didn’t make the Barcamp on Sunday afternoon, that night was the speaker’s dinner and I got to sit with ex-colleague and Lucene committer Alan Woodward, Nick Burch (long time Apache member and Barcamp host), Alessandro Benedetti and some other Sease folks and to catch up with a few others including Anshum Gupta, Uwe Schindler and Jo Kristian Bergum, looking smarter than any of the rest of us in a jacket. Looking around, I felt there was a lot of search talent in the room; this year Search was very much in the ascendant at Buzzwords, possibly because of all the buzz around vectors and AI.

Day 2 – How Open is AI, petabyte scale and vectorizing your search engine

Up bright and early the next day (and it was bright, this was a hot and sweaty week in Berlin, note to self: take shorts and sandals next time) for the keynote – a very good talk on the Open in Open AI (not OpenAI!) from Jennifer Ding – worth watching. OSC colleague Atita gave her talk on vectorizing your open source search engine at breakneck speed, cramming in lots of great content in the 20-minute timeslot – unlike some presenters who struggled to get into any depth with that time limitation. I also caught (by mistake, thought I was in another room, but gladly) a talk by Suman Karumuri who has built a lot of large-scale search for people like AirBNB and Slack, on petabyte-scale search using Kafka & Lucene that is apparently 10x cheaper to run than OpenSearch/Elasticsearch. Although this is targetted at log metrics applications at the moment, it would make a good solution for large-scale e-commerce, although it needs some work on areas such as updates. Jo Kristian Bergum talked about generating synthetic data with LLMs which reminded me of a recent OSC blog from Scott Stults.

Lunch was excellent and tasty as ever at Buzzwords, with great vegan & vegetarian options for those that need them, and I caught up with David Tippett and more of the OpenSearch folks and chatted about September’s OpenSearchCon. Also in evidence were a large crew from Adelean who hosted our Paris Haystack on Tour, Piotr Stachowicz from edrone who helped us run Haystack on Tour in Poland and lots of Weaviate & Qdrant folks gently competing over who could give away the most swag. Full marks to whoever thought of putting a big freezer full of icecream in the corner of the sponsor’s lounge, especially given the weather!

After lunch I watched Matt Williams from Cookpad give an excellent talk on scaling their search team – a great example! Next was a not very deep talk from Bloomberg on how they’d un-customized their Solr setup, I was hoping for more detail especially on their large scale stored search implementation. Nick Burch was next on how to run a LLM on your laptop with some excellent and amusing advice on what is now available as open source. I next took a break to prepare for the panel I was running but caught the end of Lars Albertsson talking about AI and plane crashes – like me he had taken the train to Berlin, all the way from Norway – I guess if you work on aircraft safety you might prefer not to get up in them often!

On to hosting the Great Search Engine Debate – check out the video here. This was the third time I’ve run a panel like this, to answer the question ‘which search engine should I choose for my application?’. It’s a somewhat ridiculous question of course as it’s so dependent on your situation, but it’s a great chance to find out more about some of the options available. I was lucky to have five fantastic panellists from Vespa, Solr, Elasticsearch, Qdrant and Weaviate and we kept the debate friendly and informative – I particularly enjoyed the responses to my question “If you can’t choose your own search engine, what would be the next choice?”.

search with AI at the Search Engine Debate
Germany, Berlin. Buzzwords 2023. 19.06.2023 Photo Jan Michalko CC License

We then enjoyed some free drinks sponsored by Hyperspace and networked and chatted. A little later I joined a party for dinner including people from Vespa, OpenSearch and Sematext – we went to the excellent Trio restaurant. We had some great chats about the world of search and search consulting.

Germany, Berlin. Buzzwords 2023. 19.06.2023 Photo Jan Michalko CC License

Day 3 – Java support for vectors, MLOps and Solr V2 APIs

Next day started a little later (most of the heavy socialising at Buzzwords happens on Monday night and the organisers know this!). I started with Uwe Schindler’s talk on what’s coming next in Lucene – an essential for me, with some great stuff on Project Panama and native vector support in Lucene. I stayed in the room for Mercari’s talk on MLOps which implied they’re not using Elasticsearch LTR but a separate ML setup. Across the yard the next talk was Jason Gerlowski’s talk on Solr V2 APIs – we were happy to help with the development of this by hosting Meetups for the Solr Contributor Bootcamp .

After a client lunch down the road (involving a dash through a sudden rainstorm) my next talk was Torsten Bøgh Köster and Denis Berger on their Solr performance work….and then I rather ran out of talks I wanted to attend so spent some time catching up on email and chatting outside.

I also had a chat with Paul from Plain Schwarz, the amazing team who run Buzzwords about helping us with Haystack EU again this coming autumn – all went well and we should have dates and a ticket shop up within the next few weeks, keep your eyes peeled!

The conference seemed to have passed at great speed – suddenly it was over, although the OSC team and quite a few others would stay in Berlin for the Mix Camp E-commerce Search (MICES) the next day.

Thoughts and conclusions

Apart from attending the talks above, I caught some general themes at the event which I’ve tried to gather here along with my thoughts:

  • What’s ‘open’ in this new world of AI, vectors and LLMs is becoming harder to understand. Your search engine can be open source, but the model used for re-ranking may not be – and what about the training data it used, how open is that and did people even consent to it being used? There are both ethical and business-risk issues.
  • In just a few short years we’ve added lots more technology choices for building search applications – and all of these have different approaches to AI features:
    • Vector-native engines like Weaviate & Qdrant may not have the long history, large community and full feature set of older engines but are innovating at high speed and growing fast (powered by massive injections of VC funding)
    • Lucene-based engines are catching up, especially as Java adds vector-related improvements – this week Lucene gained some significant upgrades – and they have the advantage of large communities, many examples, arrays of standard search features and a large install base
    • Vespa continues to impress with its massive toolset, including native vector support, and the team’s commitment to educating us all on AI topics is admirable
    • Commercial engines (which weren’t really represented at Buzzwords) are all adding AI features, although of course if they’re not open source the efficacy of these is hard to measure (I’ve been unimpressed by some of the demos I’ve seen)
  • Even once you’ve chosen an engine there are significant challenges around choosing machine learning models, training them and getting them into production – these can have a significant impact on team size, costs and stability.
  • AI techniques can be applied at lots of different points in the search stack – from rating documents, to extracting & cleaning data, to summarizing documents. It’s not just about vector-powered ranking.
  • Scaling is still a challenge, particularly for the largest companies who often have to develop custom search architectures. We’re now in the petabyte range!
  • Finally, I must say it’s always a privilege to spend time with the search community and to find out what amazing things people are building.
  • Thanks to all the speakers, my panel participants and the organising team for a great event!

We’ll be back at Berlin Buzzword next year – and in the meantime do contact us if you’d like to discuss how the topics above affect your business – we’re always glad to help.