I find myself frequently doing backup and restores of Solr collections as part of my experimentation workflow for improving search relevancy, so I wanted to share the steps I follow.
I learned an, shall we say, interesting difference between a single shard collection and a multiple shard collection! Typically when you are backing up and restoring Solr collections you do this via a shared filesystem mounted across all your Solr nodes, and that is how the Solr Reference Guide is written.
However, if your index is made of a single shard, then you don’t need that shared file system mounted into each Solr node. In my case, I only had a single shard, so I didn’t need that shared file system.
If you want to follow along, these steps work with the repository https://github.com/epugh/playing-with-solr-streaming-expressions. Follow the first section in the README for setting up the sample Solr cluster. The Solr Reference Guide has the page Collections API that explains these commands in detail.
The command to back up is pretty simple:
curl 'http://localhost:8983/solr/admin/collections?action=BACKUP&name=myBackup2020-06-12&collection=books&location=/tmp/fake_shared_fs' -H 'Content-type:application/json'
Then look at the
./tmp/fake_shard_fs and you’ll see the shards and the Zookeeper setup exported.
Restoring is similarly simple. Here we are restoring our backup into a new
curl 'http://localhost:8983/solr/admin/collections?action=RESTORE&name=myBackup2020-06-12&location=/tmp/fake_shared_fs&collection=restored_books' -H 'Content-type:application/json'
What if I don’t have a Shared Filesystem?
As an experiment, I wondered if I can merge a multi-shard collection into a single collection, and then back it up? I thought first about using the MIGRATECOLLECTION command, but then remembered that in 8.x the REINDEXCOLLECTION command was introduced:
curl 'http://localhost:8983/solr/admin/collections?action=REINDEXCOLLECTION&name=books&target=single_shard_books&numShards=1' -H 'Content-type:application/json'
Okay, now let’s try and see if we can back it up to a different directory than our fake shared filesystem:
curl 'http://localhost:8983/solr/admin/collections?action=BACKUP&name=single_shard_books_backup&collection=single_shard_books&location=/tmp' -H 'Content-type:application/json'
No joy. It seems like it should work, and at least has worked if you had just a single Solr node in the cluster. I’ll have to investigate this more. I also noticed that the BACKUP and RESTORE commands aren’t exposed to the end user in Solr Admin collections UI. I’ll update this blog post if I get a chance to add them to the Solr Admin UI.