Blog

Using Solr’s new Healthcheck command

SolrCloud is great, up until you start seeing lots of ZooKeeper exceptions and you think to yourself, “wtf?”.

Thats when you discover that you should have been monitoring your various servers proactively, but since you now have a number of Solr processes, and many nodes, that is a pain.

Well, as part of the new SolrCLI command line interface to Solr, we have a new tool, HealthCheck to play with: https://github.com/apache/lucene-solr/blob/a847deabd09e9110957d22ec3a294c64ffd6e159/solr/core/src/java/org/apache/solr/util/SolrCLI.java#L871

You pass in your collection you are interested in checking, as well as the ZooKeeper URL, and it helpfully interrogates the cluster:

  1. Goes through and reports health of each replica, including useful thinks like the uptime, memory consumption, number of documents.
  2. If the shard doesn’t have a leader, then it marks the replica status as DOWN, otherwise if it can’t reach the replica, then ERROR, or hopefully, ACTIVE.
  3. Finally, it rolls, per shard, all of the replicas state to a single response that is healthy, degraded, down, or my nemesis, no_leader.
./bin/solr healthcheck -c collection1 -z localhost:2181

Which outputs a whole bunch of JSON.

{  "collection":"collection1",  "status":"healthy",  "numDocs":14,  "numShards":2,  "shards":[    {      "shard":"shard1",      "status":"healthy",      "replicas":[        {          "name":"core_node2",          "url":"http://192.22.22.1:8983/solr/collection1_shard1_replica1/",          "numDocs":5,          "status":"active",          "uptime":"0 days, 1 hours, 9 minutes, 6 seconds",          "memory":"68.2 MB (%7.5) of 310 MB"},        {          "name":"core_node3",          "url":"http://192.22.22.1:8985/solr/collection1_shard1_replica2/",          "numDocs":5,          "status":"active",          "uptime":"0 days, 1 hours, 9 minutes, 7 seconds",          "memory":"64.8 MB (%7.1) of 243 MB",          "leader":true}]},    {      "shard":"shard2",      "status":"healthy",      "replicas":[        {          "name":"core_node1",          "url":"http://192.22.22.1:8983/solr/collection1_shard2_replica1/",          "numDocs":9,          "status":"active",          "uptime":"0 days, 1 hours, 9 minutes, 7 seconds",          "memory":"68.5 MB (%7.5) of 310 MB"},        {          "name":"core_node4",          "url":"http://192.22.22.1:8985/solr/collection1_shard2_replica2/",          "numDocs":9,          "status":"active",          "uptime":"0 days, 1 hours, 9 minutes, 7 seconds",          "memory":"64.9 MB (%7.1) of 243 MB",          "leader":true}]}]}

This is still better then looking at clusterstate.json directly, however if I combine it with jq, I can get this:

sansabastian:bin epugh$ ./solr healthcheck -c collection1 -z localhost:2181 | jq .statusINFO  - 2015-01-13 17:26:32.247; org.apache.solr.util.SolrCLI$HealthcheckTool; Running healthcheck for collection"healthy"

Now one thing I noticed was that with a downloaded version of Solr 4.10.3, running ./solr healthcheck would give me classpath issues. I ended up putting this little shell script together to unpack solr.war and run the healthcheck Java class directly:

COLLECTION_NAME=collection1ZKHOST=172.31.64.31:2181java -cp WEB-INF/lib/*:solr-4.10.3/example/lib/ext/* org.apache.solr.util.SolrCLI healthcheck -v -collection ${COLLECTION_NAME} -zkHost ${ZKHOST}

This is all great, but does mean I need to be able to access ZooKeeper and my cluster from wherever I am running from. Someday, hopefully this capability is boiled into the Solr Admin.
I know, patches welcome ;-).