Solr Json API Tutorial

May 16, 2019 Elizabeth Haubert
Category: Uncategorized

Edismax is the query parser-of-choice for many Solr applications. The default behaviors are correct for a wide range of use cases. The syntax has become familiar. While edismax has its bugs, the ‘happy path’ is pretty stable. For many teams, there is no compelling reason to look beyond the edismax query parser.

But what happens when we want to give the search engine more guidance? In this article, we’ll take a look at one case for taking more control over the query structure, and how the Solr Json API lets us do that.

The Scenario

We’ve blogged before about the notion of term-centric and field-centric queries, but first, a quick review. Edismax constructs a Maximum Disjunction, which means

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.1

Consider a query:

q=breakfast comedymm=100%

So long as our fields are relatively simple, and have relatively similar analysis patterns, edismax will usually produce a term centric interpretation. In this case, the disjuction will look something like:

+(((cast_en:breakfast | title_en:breakfast | genres_en:breakfast) (cast_en:comedi | title_en:comedi | genres_en:comedi ))~2))

And we will successfully find comedies in the TMDB database like “Breakfast at Tiffany’s” or “The Breakfast Club”. But as the complexity of fields in the system grows, chances are there are some queries which will produce different analysis, and Solr won’t be able to marry up the terms neatly. In this case, we’ll end up with a field centric interpretation, then the disjunction is per-field, and we can end up with something like:

+(((Synonym(title_en:breakfast title_en:first_meal_of_the_dai) title_en:comedi)~2) | genres_precleaned:breakfast comedy)    

There are good reasons where this might be the desired result. Let’s assume for this article that there is a frequent use case where users might want to issue unstructured queries and require all terms to be present, but those terms won’t all occur in the same field.

The Bool query parser

Most edismax users are already familiar with the idea of issuing one blanket full-text query, and layering additional boost queries (bq) and filters (fq) against it. But there is a catch – boost queries won’t affect recall, and filter queries don’t affect scoring. What if we wanted to issue multiple baseline queries which affect both recall and scoring?

Enter the BoolQParser. This lets us construct a Lucene Boolean Query. The name can be a little misleading. The outstanding virtue of the BoolQParser is that it layers 4 additional clauses on our queries: must, should, filter, and must_not, so we can construct multiple required clauses. Filter and must_not are variations on the theme of fq; “should” clauses are similar to bq. Must clauses add something new. Like our original ‘q’ term in eDismax, “must” clauses contribute to both scoring and recall. Unlike ‘q’, each clause is potentially its own query.

So we might manually write that ‘breakfast comedy’ query like:

http://localhost:8983/solr/tmdb/select?q={!bool%20must=title_en:breakfast%20must=genres:comedy}&fl=title%20genres%20score&rows=25&debugQuery=true

Even with this limited interpretation, the syntax gets pretty ugly, pretty quickly.

The JSON Request DSL

In 7.1, Solr introduced the JSON Request API. At the most basic level, this is just reformatting the familiar Solr parameters into a json block, and passing that in the request body. A few of the terms have been reformatted, so check the reference manual if you are translating queries directly.

http://localhost:8983/solr/tmdb/select?debugQuery=on&defType=edismax&fl=title&q=breakfast%20comedy&qf=title%20genres&mm=100%25

curl "http://localhost:8983/solr/tmdb/query?" -d '{  "query": {		"edismax": {		  "query": "breakfast comedy",		  "fl": "title",		  "qf": "title genres",		  "mm": "100%"		}	}}'

By itself, this isn’t anything drastic; particularly for teams who push most of the request into the request handler configuration. Now let’s revisit a more complex query. Now, we can cleanly specify a term-centric query, with a separate clause for each term:

curl "http://localhost:8983/solr/tmdb/query?" -d '{	"query": {		"bool": {			"must": [				{ "edismax": 					{  "qf": "title genres",					   "query":"breakfast"					}				},				{ "edismax": 					{  "qf": "title genres",					   "query":"comedy"					}				}							]		}	}	}'

And if we’d like to boost cases with that (or some other phrase), we get:

curl "http://localhost:8983/solr/tmdb/query?" -d '{	"query": {		"bool": {			"must": [				{ "edismax": 					{  "qf": "title genres",					   "query":"breakfast"					}				},				{ "edismax": 					{  "qf": "title genres",					   "query":"comedy"					}				}							],			"should": [				{					"complexphrase": {						"query": "breakfast club",						"inorder": "true"					}				}			]		}	}	}'

What’s next?

If you would like to discuss how your search application can evolve and grow, please get in touch. We’re also always on the hunt for collaborators or for war stories from real production systems. So give it a go and send us feedback!

The Scenario

The Bool query parser

The JSON Request DSL

What’s next?

References: