Using Solr’s New Atomic Updates

Scott Stults — March 14, 2013 | 0 Comments | Filed in: solr

A while ago we created a sample index of US patent grants roughly 700k documents big. Adjacently we pulled down the corresponding multi-page TIFFs of those grants and made PNG thumbnails of each page. So far, so good.

You see, we wanted to give our UI the ability to flip through those thumbnails and we wanted it to be fast. So our original design had a client-side function that pulled down the first thumbnail and then tried to pull down subsequent thumbnails until it ran out of pages or cache. That was great for a while, but it didn’t scale because a good portion of our requests were for non-existent resources.

Things would be much better if the UI got the page count along with the other details of the search hits. So why not update each record in Solr with that?

A quick caveat: When we’re prototyping we store all of the fields in our schema. That gives us flexibility on what we display and doesn’t hurt performance too much. Once the dust around the design settles a bit we then look at schema optimization.

So we’re running Solr 4.x and we’ve got all of our fields stored. Those are the two requirements for doing atomic updates! Enough with the history, on with the code…

Each bundle of TIFFs comes with an index and a count of the number of pages. That looks like this:

0014317 USP2009w20     2
0015257 USP2009w20     1
0028699 USP2009w43     3
0032450 USP2009w20     2
0032451 USP2009w20     2
0056575 USP2009w12     3
0066524 USP2009w12     3
0072873 USP2009w19     4
0087053 USP2009w19     4
0087923 USP2009w19     4

The first column is a seven digit grant number, the second column is a DAT catalog number (which we ignore), and the third is the number of pages.

Up next is some shell script that reads in those lines and makes an update request to Solr:

while read line
do
  declare -a arr=( $line )
	curl --url "http://localhost:8983/solr/us_patent_grant/update/?commit=false" \
    -H "Content-Type: text/xml" -d "<add><doc><field name='id'>0${arr[0]}</field><field name='page_count_int' update='set'>${arr[2]//[[:space:]]/}</field></doc></add>"
done

The only weird part about this script is the use of arrays and the funky whitespace trimming I had to do on the last element, and the zero-padding I had to do because our grant IDs are eight digits. To run it you just:

./update_solr.sh < 2009_cumulative.csv

Quick and crappy, but it works. If I have to do this again I’ll figure out how to batch up the updates or maybe even (sigh) break out the Java code and use the SolrJ interface. In the meantime, it works and I’m really glad Solr finally got this feature!

Developed in Charlottesville, VA | ©2013 – OpenSource Connections, LLC