A while ago we created a sample index of US patent grants roughly 700k documents big. Adjacently we pulled down the corresponding multi-page TIFFs of those grants and made PNG thumbnails of each page. So far, so good.
You see, we wanted to give our UI the ability to flip through those thumbnails and we wanted it to be fast. So our original design had a client-side function that pulled down the first thumbnail and then tried to pull down subsequent thumbnails until it ran out of pages or cache. That was great for a while, but it didn’t scale because a good portion of our requests were for non-existent resources.
Things would be much better if the UI got the page count along with the other details of the search hits. So why not update each record in Solr with that?
A quick caveat: When we’re prototyping we store all of the fields in our schema. That gives us flexibility on what we display and doesn’t hurt performance too much. Once the dust around the design settles a bit we then look at schema optimization.
So we’re running Solr 4.x and we’ve got all of our fields stored. Those are the two requirements for doing atomic updates! Enough with the history, on with the code…
Each bundle of TIFFs comes with an index and a count of the number of pages. That looks like this: [gist id=5164194]
The first column is a seven digit grant number, the second column is a DAT catalog number (which we ignore), and the third is the number of pages.
Up next is some shell script that reads in those lines and makes an update request to Solr: [gist id=5164172]
The only weird part about this script is the use of arrays and the funky whitespace trimming I had to do on the last element, and the zero-padding I had to do because our grant IDs are eight digits. To run it you just:
./update_solr.sh < 2009_cumulative.csv
Quick and crappy, but it works. If I have to do this again I’ll figure out how to batch up the updates or maybe even (sigh) break out the Java code and use the SolrJ interface. In the meantime, it works and I’m really glad Solr finally got this feature!