Weve been using Apache Camel a fair amount recently as our ingestion pipeline of choice. It presents a fairly nice DSL for wiring together different data sources, performing transformations, and finally sending data to Solr. Using the normal Solr component, you can write code that looks like this:
from(“file://foo?fileName=input.csv”) .unmarshall().csv() .split(body()) .to("bean:convertToSolrDoc") .setHeader(SolrConstants.OPERATION, SolrConstants.INSERT) .to(“solr://localhost:8983/solr/collection1”)
This code defines a camel route. This route takes a csv file, splits it line-by-line into individual records. For each record (after the split) it transforms the csv record into a SolrInputDocument. The Solr document is then sent to Solr for insertion.
Being able to tie together data and features from different systems is Camel’s strong point. Perhaps you want to build a distributed pipeline and tie in leader-election so that only one of these routes runs at a time. Well you can! Just tie in some features from the Zookeeper component.
Built-In Solr Component
The Solr component has most of the basic features you’d expect. It wraps SolrJ and performs fairly standard indexing operations. However as we get more experience with Camel, we’ve been finding more-and-more that we’re missing many much-needed features.
For example, the Solr component:
- Has no built in support for Solr Cloud. More often our clients expect to be able to run SolrCloud, and would like to leverage SolrJ’s CloudSolrServer for optimal document routing.
- Does not support communication to Solr over https, only http. Even in trusted backend environments, many of our clients feel the need to put sensitive data behind https. So this becomes another must.
- While you can write to Solr, there’s no way to read through Solr search results. In Camel terminology, while there’s a Camel producer there’s no consumer. This is a very important nice-to-have for work such as reindexing from Solr back to Solr
Building a Better Solr Component
So what have we done? Well instead of lots of one-offs, we’ve decided to make dramatic improvements to the Camel Solr component! You can find our improvements here ready for production use (specifically this pull request)! What’s been done out of our wish list above?
We now support Solr over https (by specifying the URI solrs://) and SolrCloud (by specifying solrCloud in the URI and passing a zkHost and collectionName parameter). For example, if I wanted to reimplement the example above to go to Solr Cloud I would simply do:
from(“file://foo?fileName=input.csv”) .unmarshall().csv() .split(body()) .to("bean:convertToSolrDoc") .setHeader(SolrConstants.OPERATION, SolrConstants.INSERT) .to(“solr://localhost:8983/solr?zkHost=localhost:8123/solr&collectionName=collection1”)
Getting all this to work wasn’t terribly difficult, but we wanted to ensure that all of the camel tests passed regardless of whether Solr was being used over http, https, or in cloud mode. So much of the challenging work was getting tests to run against Solr in each of these modes. We learned a lot about how to get the embedded Jetty Solr server to work in https.
We’re particularly grateful for the MiniSolrCloudCluster support that was recently added to Solr 4.8. This addition allowed us to sidestep a large amount of work we had started to create an embedded Jetty Solr server that was running in Solr Cloud server. Instead, we were able to bring up a MiniSolrCloudCluster for our tests with only minimal fuss. In fact if you need a usage example other than the existing tests, check out this file in our test code.
Naturally, we’d like these changes to flow back to the Camel project. So we’ve attached our pull request onto this Jira issue. We hope you’ll help us out by voting for this issue. In any case, please try out our component, and please feel free to send us bugs/feedback/pull requests! And stay tuned to continued improvements!
And of course, if you need any help tying together a reasonable to connect your systems to Solr, contact us and leverage our expertise with Camel and under frameworks that help make your data searchable!