A while ago we created a sample index of US patent grants roughly 700k documents big. Adjacently we pulled down the corresponding multi-page TIFFs of those grants and made PNG thumbnails of each page. So far, so good.
You see, we wanted to give our UI the ability to flip through those thumbnails and we wanted it to be fast. So our original design had a client-side function that pulled down the first thumbnail and then tried to pull down subsequent thumbnails until it ran out of pages or cache. That was great for a while, but it didnt scale because a good portion of our requests were for non-existent resources.
Things would be much better if the UI got the page count along with the other details of the search hits. So why not update each record in Solr with that?
A quick caveat: When were prototyping we store all of the fields in our schema. That gives us flexibility on what we display and doesnt hurt performance too much. Once the dust around the design settles a bit we then look at schema optimization.
So were running Solr 4.x and weve got all of our fields stored. Those are the two requirements for doing atomic updates! Enough with the history, on with the code…
Each bundle of TIFFs comes with an index and a count of the number of pages. That looks like this:[gist id=5164194]
The first column is a seven digit grant number, the second column is a DAT catalog number (which we ignore), and the third is the number of pages.
Up next is some shell script that reads in those lines and makes an update request to Solr:[gist id=5164172]
The only weird part about this script is the use of arrays and the funky whitespace trimming I had to do on the last element, and the zero-padding I had to do because our grant IDs are eight digits. To run it you just:
Quick and crappy, but it works. If I have to do this again Ill figure out how to batch up the updates or maybe even (sigh) break out the Java code and use the SolrJ interface. In the meantime, it works and Im really glad Solr finally got this feature!