Blog

Interning and Solr

Welcome to the first blog post (of many, hopefully) by OpenSource Connections’ 2012 summer interns. After our first few weeks as interns at OSC, we have begun to get the hang of search in general and Solr in particular. I am a rising fourth year at UVa majoring in Computer Science and Math with a Minor in East Asian studies. I have experience as a Teaching and Research Assistant with the University but this is my first internship with a software company. It is my hope that this summer I will get plenty of experience with cloud computing and big data.

When I tell people that I am working for a company that specializes in search, the usual response is, “Oh, like Google?” Until last week I could only half-heartedly agree and mention it was called Solr, mention that it was used by Netflix, Zappos, Reddit, and others. Now I can confidently say that Solr is a whole lot more than just a black box that is “kind of like Google.”

So what do you tell someone who wants to understand what it means when a person says he works on search? Like most kinds of software development, search can be broken up into back end and front end/functionality and feel. Many people think that search is all about the perfect algorithm, factoring in as many variables, like page views and links, to find your perfect result. But creating a good interface to help someone use your search in the way you intend is just as important.

Since Solr is a well-established search engine, the algorithms have already been established and developed by the open source community. For most, the back end of a Solr search page consists of adjusting pre-established parameters to match with existing data. There is no struggling to invent a new algorithm; it has been done and it works well. There is a lot more thinking about data. How can data be organized efficiently and logically? Can your existing data be transmitted easily into Solr?

Though Solr is written in Java, there is no need to write in it. A web developer who has never programmed in Java need not worry. All the configuration for Solr can be managed through a series of xml files. Modify these, start the Solr server, import the data, and search away!

Probably the most important piece of the front end is the search engine results page or SERP. By default, the response to a query to Solr is an xml response document – definitely not the most pleasing thing to look at, but it contains all of your desired information. There are apis to return json, pythonphp and others that usual web developers should be comfortable with.

For our first project, http://vacode.org/ (and the State Decoded project), we have chosen to use AjaxSolr which uses the little loophole of jsonp. It has a nice base of widgets and abstract types to build a beautiful SERP. Additionally, there are nice examples of the important parts of the SERP: easy pagination, visible tag clouds and filters, and of course the list of results. For getting things done quickly, it is definitely a great tool for creating the front end of your dreams.

When we started, the search engine in use was Sphinx, and the search page was very bare bones.

Throughout the process we learned how to stand up our own Solr and modify the configuration to import data from a database as well as some interesting web development. We added plenty of cool features: filtering by title and section, tag cloud, and page suggestions. We also added the ability to search and filter through different kinds of documents that were being stored (e.g., court decisions, comments, etc.). Hopefully you will see a version of our search go into production soon, but in the meantime, here is a sneak peek!

It seems many web developers are daunted by deploying and maintaining their own search engine, but I would encourage them to give Solr a try and listen in as the OSC intern team learns the ropes.

Until next time,

Joseph