Blog

Indexing StackOverflow in Solr

One thing I really like about Solr is that its super easy to get started. You just download solr, fire it up, and then after following the 10 minute tutorial youll have a basic understand of indexing, updating, searching, faceting, filtering, and generally using Solr. But, youll soon get bored of playing with the 50 or so demo documents. So, quit insulting Solr with this puny, measly, wimpy dataset; Index something of significance and watch what Solr can do.

One of the most approachable large datasets is the StackExchange data set which most notably includes all of StackOverflow, but also contains many of the other StackExchange sites (Cooking, English Grammar, Bicycles, Games, etc.) So if StackOverflow is not your cup of tea, theres bound to be a data set in there that jives more with your interests.

Once youve pulled down the data set, then youre just moments away from having your own SolrExchange index. Simply unzip the dataset that youre interested in (7-zip format zip files), pull down this git repo which walks you through indexing the data, and finally, just follow the instructions in the README.md.

If youre interested in how it works, basically weve modded the schema (solr_home/collection1/conf/schema.xml) to incorporate the fields from StackExchanges post files. We use extractDocument.py to convert from StackExchange format to Solrs xml format. (Its a simple file, go ahead and take a look.) Finally we index by simply posting the output to Solrs update endpoint.

So, is this a mature indexing platform for StackExchange? Its getting there! With a recent change weve added the capability to post to Solr in batches. (Earlier the script was posting the entire file to Solr, and thus running out of memory.) Besides the indexing script, weve also included a rudimentary visualization component (which will likely be expanded greatly in the future). Finally, we realized that its painful to download the entire StackExchange data dump just so that you can start playing with Solr. Therefore weve included all the posts for the SciFi Stack Exchange so that you can beging experimenting immediately.

Want an interesting first project with your new Solr data?


Check out my LinkedIn Follow me on Twitter