A reality check for LLMs: access control and other lessons from enterprise search

December 21, 2023 Charlie Hull
Category: AI

There’s a lot of people writing a lot of things on platforms such as LinkedIn around Large Language Models – LLMs – and how they will shortly be able to cope with pretty much any task we throw at them, once we sort out the fiddly bits like scaling, licensing and cost. Sadly, it’s a little more complicated than that, at least in a commercial context. Here’s a few reasons why LLMs on their own are unlikely to solve some well-known problems such as access control – and some ways we might fix this as part of implementing Generative AI.

Enterprise search is a well-trodden field with many challenges (for a starting point you can read the book I co-authored with Professor Udo Kruschwitz, or the classic Enterprise Search by Martin White). Unlike web search, where website authors actively work to publish, enhance and tag their content so it can be found, enterprise content is sometimes actively hidden by the structure or practices of the organisation. As I wrote over a decade ago it’s “characterised by rarity and unconnectedness rather than popularity and context”. People don’t always work to share nicely. If you analyze the queries used in enterprise search systems, you’ll find a very long tail of queries that have literally never been used before (in contrast to the more popular queries, usually for things like ‘vacation policy’ or ‘timesheet system’ rather than any specific information need). This doesn’t help when you’re trying to figure out what users are likely to look for.

“But look, you found the notice, didn’t you?”
“Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.”
Douglas Adams, The Hitchhiker’s Guide to the Galaxy

The miasma of access control

One key characteristic of enterprise search (and also some website search, where access can be limited by subscription level) is that content may be protected by multiple levels of access control. The point of this is not just to stop some users reading particular content (a junior employee shouldn’t be able to find out executive salaries), but also to stop them knowing it even exists. Take the example of a planned office relocation: according to a report written for the Board, the company is moving to smaller premises in a different city, because the founder wants to be closer to their family: not everyone is going to fit in these offices so downsizing might be expected. So, documents about the purchase of the new premises are sensitive.

How this access control is implemented usually depends on the underlying content management system: it could be Sharepoint, Documentum, Google Docs or even just Unix file permissions, or more often lots of different systems that have been adopted over the years, updated (we hope consistently) by HR as people join and leave the organisation, are promoted or move to a different business unit. Access control for content providers such as publishers can be fine-grained down to the paragraph level, with users only allowed to read summaries or parts of certain items. You can probably imagine quite how complicated this can get.

In a search engine, we have to use this access control information either at index time (we record who can access content in the index as a special field we can filter on) or query time (we try to access a result on behalf of the user before we show it to them) or more commonly both. One side effect of building an enterprise search engine is that it can reveal the holes in your corporate security model – which is why testing is vital before you launch it.

Data quality and GIGO

Another issue is data quality, or rather than lack of it. Most of us will be familiar with the challenges of document versioning and will have seen titles such as:

FINAL v3.5 engineering specification 4/March/2018 - CH reviewed, AP signed off.doc.

Is this really the final, final version? Why does the date in the title not agree with the file modification date? Why are there 17 other versions of this document, some PDF copies and some just called "engineering specification template 2017 USE THIS ONE.doc"? Who is “AP” and do they still even work here?

Even the best search engine, AI-powered or not, cannot work well if the source data it is provided with is garbage. Data quality is a true Achilles heel of any information system.

Access all areas with a LLM

Let’s imagine we’re trying to build a AI-powered system to access our corporate information, or the content we’re publishing. We’re probably not going to send all this private information to OpenAI, but we might use a LLM provided by a trusted partner like Microsoft, or even train and host a LLM internally. Our model now encompasses the corporate knowledge it is trained on – but it literally has no concept of security or versioning (or the lack of it). We can ask questions of our PDFs, but we may not like the answers or even be allowed to see them. Hallucination is rife and trust in the system plummets. Someone even tries some prompt injection attacks and manages to get those executive salaries in full, or a copy of a whole book they’re not paying the right subscription for.

RAG Rides to the Rescue

One potential solution to the access control problem is Retrieval Augmented Generation, or RAG. Simply put, this uses a search engine to retrieve a set of documents and then asks the LLM to base its answers on this set. The search engine can be traditional & lexical, use a vector database or even a hybrid of the two, but importantly it can use the access control mechanisms discussed above to restrict the set of documents the LLM can draw on. So we can ask a question: "What are the plans to move premises?" and the LLM won’t ‘know’ that these have even been discussed. It’s still difficult, and fiddly, and may reveal holes in the underlying security model, but it’s a better approach than exposing a LLM that knows everything but not who is allowed to read it:

Data quality is a more difficult challenge, but we can use similar techniques to those used in search projects for many years: assess what data exists using an audit process, have rock-solid authoring standards, make content producers accountable to these standards and also consider not including certain repositories in your index if they are irrelevant or out of date. Perhaps no-one will ever look for that engineering specification from 2018, so simply don’t index it – ‘less or more’ applies here.

Conclusion

AI-powered techniques are hugely exciting and are certainly revolutionizing the world of search and information access. However, they can’t be used out of the box in all contexts, as this increases business risk. Luckily there are techniques from established best practice that we can learn from.

Get in touch today to explore how LLMs and other AI techniques can supercharge your enterprise or website search

Images by Unlock Vectors by Vecteezy