Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups

June 11, 2019 Charlie Hull
Category: Solr

We’re very happy to present a guest blog by Alex Addison, Senior Developer at LexisNexis UK. Alex’s team will be presenting a talk on some of their work at the next London Lucene/Solr Meetup on July 4th.

Here at LexisNexis we use Solr extensively, and I’m on a team building some new production-ready infrastructure around it. We ran into a problem in how they interact with AWS’s resiliency tool, the Auto-Scaling Group. Here’re the solutions and workarounds we considered, and some guidance and code to help you solve the same problem.

Introduction to Solr Cloud and Zookeeper

For those who know what SolrCloud is and how it uses Zookeeper, feel free to skip this section. For the rest of you, Solr is a search engine, with a cloud mode which helps with scaling, load-balancing, and all of the problems which come with large data-sets and distributed computing. To enable all of this, it uses Zookeeper for cluster coordination and configuration. You just start Solr with a parameter pointing it at each of your Zookeeper servers, and away you go.Of course, this means you need a Zookeeper cluster (known as an ensemble) as well as your Solr Cloud cluster, but that’s not a problem, 3 Zookeeper servers running on small, cheap server instances can handle quite a large SolrCloud cluster, and this setup allows you to separate concerns of search and cluster management.

What’s the problem?

To ensure your Zookeeper ensemble keeps running in AWS, you should keep it in an auto-scaling group with at least 3 instances in separate Availability Zones. EC2 instances can fall over or be terminated, sometimes without warning, and without Zookeeper SolrCloud can’t index data and stay up to date. New EC2 instances created by the Auto-Scaling Group are likely to have different IP addresses and DNS names, so Solr will no longer know where to find all of its’ Zookeeper servers. Some useful discussion can be found in a blog post by Credera here.

Potential solutions

We considered a number of solutions:

Manual
- But manually do what? Also, I’m not too keen on waking up at unsocial hours to do this.
Tell Solr when the Zookeeper ensemble changes
- There doesn’t seem to be a mechanism for this without taking the Solr server down, and looking at the Solr code, I think it would be quite time consuming and difficult to implement into Solr.
Re-start Solr when the Zookeeper ensemble changes
- We could do this, and it would probably be fine, but it could take a long time and is either painstaking manual work, or you have to work out how to automate knowing that a SolrCloud server is up to date and ready; you probably don’t want to just tear all of your SolrCloud servers down at the same time.
Put all of the Zookeeper servers behind a load balancer
- Solr’s designed to know about each of its’ Zookeeper servers, not one at random (which might change) – and I’m told Solr and Zookeeper have slightly weird connections, so it might simply not work. In any case, the point here is resiliency, and we’re adding another thing to go wrong.
Put each Zookeeper server behind its’ own load balancer
- Solr and Zookeeper have slightly weird connections, so it might simply not work. In any case, the point here is resiliency, and we’re adding another thing to go wrong.
Give each Zookeeper server a persistent IP address or DNS record which can survive re-creation of the instance, and which is attached to the new instance as it starts
- One way of attaching an IP address or DNS record to an instance as it starts is using a Lambda. Fortunately, there’s a guide on how to do that.
- Alternatively, we could give the EC2 instances permissions and allow them to pick up a nominated IP address or DNS record.

Our Standards

It’s worth noting at this point that we think the right thing to do here is an automated approach – waking people in the middle of the night if a Zookeeper server fails is not where we want to be. We’re heavily into infrastructure-as-code and for this project we’re using Terraform. We don’t have access to make changes in production unless we’re on the support rota, and even then, access is given sparingly.

A neat solution, two ways

Here we use ENIs (an additional Elastic Network Interface per Zookeeper Server). So far as we are aware, you could equally well use DNS records – but we haven’t tested it (so Java/Solr might cache the DNS resolution, for example). We used ENIs partly because we were using Terraform, partly because we didn’t see any need to clutter up our DNS namespaces, and partly because the instructions and sample code available to us were working with ENIs. Instructions and code are provided here as examples, with no guarantees of quality or fitness for purpose.

Use a lambda to attach a network interface to a zookeeper instance

Depending on exactly how you set up your infrastructure, this might look substantially more complicated, as AWS’ online console does a number of steps under the hood to save you time. For this you will need: a Zookeeper server, set up and running in an ASG (Ideally using Exhibitor or similar). This is simplest if you have one server per ASG, but can be adapted to work for several per ASG. Do this for each ASG.

Create a new ENI
Attach the new ENI
Create a Lifecycle hook for the Lambda to run on instance launch
Create a Lambda with suitable permissions and code to re-attach your network interface when a new instance is created – example code is here
Check it all works!

Use EC2 startup script to attach a network interface to a zookeeper instance

While this is simpler than creating a lambda, it requires that you set it up when you set up your Zookeeper servers, so it can’t be retrofitted into existing ensembles as easily. When you create an ASG, there is an option to provide a user data script. If you’re using an Amazon Linux based image, you should be able to use this one-liner (with your own ENI id in it): aws ec2 attach-network-interface --device-index 1 --network-interface-id eni-0123456789abcdef --instance-id $(ec2-metadata --instance-id | cut -d " " -f 2) --region $(ec2-metadata --availability-zone | sed 's/placement: (.*).$/1/').In order for this to succeed, the EC2 instance is going to need some permissions:

{    "Version": "2012-10-17",    "Statement": [        {            "Action": [                "ec2:DescribeNetworkInterfaces",                "ec2:AttachNetworkInterface",                "ec2:DescribeInstances"            ],            "Effect": "Allow",            "Resource": "*"        }    ]} 

Conclusion

In hindsight it seems obvious that persistent addresses for Zookeeper servers are a useful way to connect Solr to them. Going a step further and automating how they are assigned simplifies ongoing maintenance and removes manual steps which could be error prone. For new Zookeeper ensembles I’d suggest you attach the addresses as part of the instance startup script.

Credits

This write-up covers the work of many people. Credit is due to:

Team Columbo in LexisNexis UK, including several members of OSC.
Several DevOps specialists; especially Andy.

Thanks Alex!