More Query Understanding: Brand Detection with LLMs

June 26, 2024 Daniel Wrigley
Category: Ecommerce

Introduction

What are your users looking for on your search platform? How often are they searching with a specific pattern, e.g. using queries with a brand or category name? Can we build brand detection systems with LLMs?

Understanding what your users are searching for is important to identify how you can actively support their ‘search journeys’. Building on what we learned from my previous overview on how language models can be used to understand your users better with query understanding, this blog shows you how you can leverage the power of large language models (LLMs) to detect brand names in queries. Using an e-commerce dataset we will evaluate how well it works, how this approach compares to a naive way of identifying brands in queries and what you can do with this knowledge.

Fine-Tuning a Transformer Model for Brand Detection

Detecting brands is a type of named entity recognition problem and in the past approaches like Conditional Random Fields were applied to handle this challenge. There are successful examples of applying CRF on product titles to extract entities that are used for content enrichment and product titles. Product titles with their short length are similar to queries, so applying these traditional approaches in this context is absolutely possible.

In this post we want to explore the power of LLMs and worry less about what features may be important for our use case and how many features we would need to use. More specifically, we look at fine-tuning an off-the-shelf model to serve our needs as pretrained models may

not do exactly what we want them to do, e.g. detect persons, organizations or persons rather than brands.
know nothing about our data, users and domain.

By combining a pretrained transformer model with our data we can close these gaps.

In the context of this blog post we use a pre-trained transformer model (distilbert-base-uncased) as our base large language model.

The Dataset

To fine-tune a pre-trained LLM we need labeled data. We can help ourselves with the Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. It contains more than 2.6 million judgments (query-product pairs with ratings) and basic product information including the brand of the product. In the fine-tuning dataset we only want to have the query and the brand of the product that was rated for this query. We completely disregard all other product or query metadata as well as the rating. For simplicity, we assume in the following data preprocessing steps that a query contains a brand when the brand string can be found in the query string. This opens the door to inaccuracy as we might have two identical queries where one occurrence is labeled to be a brand query and the other to not be a brand query. This is an improvement we see necessary when moving from an exploratory phase to an optimization phase.

Following the BIO-encoding format we can transform the dataset into a train and test dataset where each entry consists of three parts: the query, the tags of the query and the labels for these tags.

Data Preprocessing

Our data starting position looks like this by taking the query string and the brand of the associated product:

Query	Brand
$25 apple gift card not email	Apple
#do not disturb, jeidah bila	Bila
revent 80 cfm	Panasonic

Afterwards we process the data as follows: we lowercase query and brand as a simple normalization step and then tag the queries thus:

B-brand: beginning tag of a brand when brands have multiple tokens or tag of a brand when a brand has only one token.
I-brand: inside tag of a brand when brands have more than one token.
O: token is something else than a brand.

We apply a tokenization step to divide the queries into tokens. Each pretrained language model comes with a tokenizer which we can use for this step.

Input query:

calvin klein t-shirt

Resulting tokens:

[ '[CLS]', 'calvin', 'klein', 't', '-', 'shirt', '[SEP]' ]

Notice the special tokens [CLS] and [SEP] that mark the beginning and end of a sentence. For our use case they have no meaning. Language models in general rely on them for tasks like sentence classification. Also note that tokenization may split words into sub-word tokens, e.g. ‘diapers‘ is split into 'dia' and '##pers'.

Knowing that Calvin Klein is a brand lets us specify the following tags for the tokens of the above input query:

[O, B-brand, I-brand, O, O, O, O]

Fine-tuning expects label IDs that are numeric rather than “human-readable”. To prevent the special tokens [CLS] and [SEP] from being treated as regular tokens we assign the label IDs -100 to them which will lead to them being ignored in the actual training step within the fine tuning process.

Label ids:

B-brand: 1
I-brand: 2
O: 0
[CLS]/[SEP]: -100

For the tags of the above mentioned input query this results in the following list of label ids:

[-100, 1, 2, 0, 0, 0, -100]

Fine-Tuning

Applying the preprocessing steps on the Shopping Queries Dataset and additionally filtering to the only products of locale ‘us’ results in ~1.7 million query-tag-label triplets. ~200k are queries with brands, ~1.5 million are queries without brands.

Fine-tuning on the whole dataset would most likely result in overfitting. Additionally, we applied a very simple technique to label the data. Brands that are not part of the product metadata in this dataset will be ignored for example which may lead to the model learning that these are no brands.

Randomly sampling data for fine-tuning did not work as well for us as there are too few brand queries in the resulting sample. So we chose to randomly sample 100k queries with brands and 50k queries without brands and merged the two samples into one fine-tuning dataset. We further split the dataset 80/20 for a training and evaluation set.

For this blog post we chose the distilbert-base-uncased transformer model and used PyTorch for the fine-tuning process with the hyperparameters listed in the linked Huggingface docs.

Evaluation

Fine-tuning with the Huggingface Trainer class allows you to calculate evaluation metrics along the way by passing a function into the training process. We picked the output values showing the overall metrics the seqeval framework provides us with together with training and validation loss reported by the fine-tuning process itself:

Precision
Recall
F1
Accuracy

Epoch	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1	0.055700	0.055556	0.865970	0.951763	0.906842	0.958838
2	0.039000	0.047234	0.903423	0.958055	0.929937	0.966421
…
10	0.013200	0.068385	0.927528	0.963747	0.945291	0.972867

The table shows that even in the earlier epochs all metrics show values above 90%.

Comparison with “naive” Approach

Arguably, we chose a simple way to generate fine-tuning data for this task by identifying queries that directly contain brands of products where we had access to judgments. Looking at the amount of training data may also open the door to saying that the model does nothing else than memorizing everything we use for fine-tuning. Memorizing all brands is an approach that could also be implemented naively by string matching operations.

To evaluate the usefulness of the LLM-based approach further we generated a set of queries that is distinct from the fine-tuning dataset and compared the results of brand detection by using the fine-tuned transformer model with a naive string matching approach.

Evaluation Query Set Description

We create an evaluation set of 102 randomly sampled queries from the original dataset. We manually labeled them with the correct brands if a brand occurred in the query. In this query sample there are all sorts of queries:

Queries with brands (e.g. “reebok nano”)
Queries without brands (e.g. “rocketbook”)
Queries with typos (e.g. “tddler batman sandslls boy”)
Queries with brands that do not occur in the product data (e.g. “honda atc parts”)
“Nonsensical” queries in the context of online retail (e.g. “tiny homes for sale”)
Specific queries (e.g. “iphone x new unlocked 256gb”)
Broad queries (e.g. “hairdryer”)

This means that the queries cover a wide range of intents that can be found in an e-commerce context. Within the 102 queries there are 26 queries that contain a brand and 76 without a brand. The queries in the dataset were not part of the training dataset to make sure there is no unfair advantage for the fine-tuned language model when the query is part of the fine-tuning.

The “naive” Approach

Here is how the “naive” approach works: we extract all brands that occur in the ESCI dataset. For every query in the evaluation dataset we split the query into words by splitting on whitespaces. We reconstruct the query token by token and look for matches in the brands. That way we take multi-term brands into account.

If any of the reconstructed query parts matches in the brands set we assume the query contains the matched brand.

Results & Metrics

We run the naive approach for each query in the evaluation dataset and we let the fine-tuned transformer model detect brands in the same queries and then measure the accuracy of each approach.

Approach	Accuracy
Fine-tuned Transformer	82.4%
Naive Approach	40.2%

The fine-tuned transformer results in an accuracy of 82.4%, the naive approach has an accuracy of 40.2%. This makes the fine-tuned transformer clearly the superior approach. The naive approach fails often for brand names that also have a lexical meaning, i.e. they are homonyms.

Examples of these are brands “not”, “without”, “gnome”, “digital”, “neon”, or “touch”. All of these have in common that they are homonymous to words that occur in queries but are meant in a different way. These false detections drastically reduce the accuracy of the naive approach.

Looking at the errors made by the LLM we can observe a different picture: the LLM is able to detect valid brands in queries where there are none according to the labels. The brands “rocketbook”, “honda” and “armoray” are among these. A possible explanation for this is that these brands are not or were not part of the assortment and are thus not explicitly labeled to contain a brand. Diving into these individual cases is interesting especially when figuring out systematical shortcomings of the different approaches or the training/evaluation dataset in order to improve and refine them.

Brand detection in historic behavioral data

So how can we apply our new brand detection system? The next section will outline how to use brand detection in queries offline and online. Offline means applying brand detection on past queries, online means applying brand detection on live queries as part of a query pipeline, to guide users better to finding what they are looking for.

Search Result Quality Assessment

Knowing how different classes of queries perform is always a first step in improving them. Detecting brands in past queries lets you compare your search result quality & linked KPIs for queries with brands with those of queries without brands.

Trend Detection

What’s a ‘hot’ brand today and how is this different to the past? Knowing about brand trends obtained from past search queries can be useful for:

Inventory Management
Marketing and Promotions
Product Sourcing and Partnerships
User Experience Optimization
Market Analysis and Forecasting

Gap Analysis

Are your users searching for products by brands that you do not have in your assortment? You can use this information for:

Product Expansion
Supplier Relations
Market Analysis
Customer Engagement
Competitive Analysis

Assortment Value Analysis

What are the valuable parts of your assortment? Knowing more about brands in queries can help us improve:

Supplier Relations and Negotiations
Inventory Management
Marketing and Promotions
Customer Engagement and Feedback

Applications of brand detection on incoming queries

Using brand detection mechanisms on past queries is an offline application. You can also use brand detection live on incoming query traffic to improve how you handle each query.

When you are very confident that “X” is a detected brand in a user query you can use this as a filter to narrow down the result set you are presenting to the user. When you are less confident you can apply boosts that will not limit recall (and thus maybe eliminate relevant documents from the result) but increase precision.

More sophisticated approaches include rules that depend on the detected intents. In e-commerce (B2C and B2B) we often see the requirement to treat some brands differently than others, for example due to contractual agreements with these brands or higher margins for products made by these brands. Another option is boosting specific products or product groups associated with a certain brand. For example, the German brand Miele produces vacuum cleaners, washing machines and dishwashers (among other things). Knowing that certain products, e.g. vacuum cleaners sell better or mean higher margins for you lets you apply this knowledge as rules for the detected brand Miele in user queries and thus boost results including vacuum cleaners.

Conclusion

Large language models provide a way to do named entity recognition for detecting brands in queries. With a simple approach to gather labeled fine-tuning data we have shown that good results can be achieved without the cumbersome process of feature selection and feature engineering required when aiming for a neural network or a CRF approach. This makes fine-tuning transformer models a low-threshold approach for query understanding tasks that can be used to your benefit either offline or online.

Do you need help understanding your users through their queries? Get in touch today!

Image from Footwear Vectors by Vecteezy