Indexing StackOverflow in Solr

John Berryman — February 18, 2013 | 0 Comments | Filed in: solr

One thing I really like about Solr is that its super easy to get started. You just download solr, fire it up, and then after following the 10 minute tutorial you’ll have a basic understand of indexing, updating, searching, faceting, filtering, and generally using Solr. But, you’ll soon get bored of playing with the 50 or so demo documents. So, quit insulting Solr with this puny, measly, wimpy dataset; Index something of significance and watch what Solr can do.

One of the most approachable large datasets is the StackExchange data set which most notably includes all of StackOverflow, but also contains many of the other StackExchange sites (Cooking, English Grammar, Bicycles, Games, etc.) So if StackOverflow is not your cup of tea, there’s bound to be a data set in there that jives more with your interests.

Once you’ve pulled down the data set, then you’re just moments away from having your own SolrExchange index. Simply unzip the dataset that you’re interested in (7-zip format zip files), pull down this git repo which walks you through indexing the data, and finally, just follow the instructions in the README.md.

If you’re interested in how it works, basically we’ve modded the schema (solr_home/collection1/conf/schema.xml) to incorporate the fields from StackExchange’s post files. We use extractDocument.py to convert from StackExchange format to Solr’s xml format. (It’s a simple file, go ahead and take a look.) Finally we index by simply posting the output to Solr’s update endpoint.

So, is this a mature indexing platform for StackExchange? It’s getting there! With a recent change we’ve added the capability to post to Solr in batches. (Earlier the script was posting the entire file to Solr, and thus running out of memory.) Besides the indexing script, we’ve also included a rudimentary visualization component (which will likely be expanded greatly in the future). Finally, we realized that it’s painful to download the entire StackExchange data dump just so that you can start playing with Solr. Therefore we’ve included all the posts for the SciFi Stack Exchange so that you can beging experimenting immediately.

Want an interesting first project with your new Solr data?


Check out my LinkedIn Follow me on Twitter

Developed in Charlottesville, VA | ©2013 – OpenSource Connections, LLC