And now to the fun subject of virtual machine images!
My computer blue screened the other day. So I got fairly paranoid. This is the second time in a month this has happened. I needed to make sure my backup strategy was solid. The most important thing I want to backup is my Ubuntu VirtualBox image. Here at work we use CrashPlan which seems to do a decent enough job automatically backing up the directories I tell it to. However, the dumb me was just pointing CrashPlan at the directory with my virtual machine images and hoping for the best. This is, of course, problematic as CrashPlan is most likely running its backup while I’m actively working in my virtual machine. How can I have any guarantee that that image is in any kind of consistent state?
Backing up with a versioning system
I tried all kinds of hair-brained schemes to backup my virtualbox image in something resembling automatic. I really wanted some kind of diff-based approach to backing up the large file. One thought I had was that before launching the virtual machine in a batch file or script I’d commit the vdi it to a version control system. Then I’d point CrashPlan at the associated repository which would be backed up in the background. I don’t reccomend this. Git turns out to have a sensible maximum file limit, which was a pretty big “hey idiot… what are you doing” warning. I found and tried Boar, a versioning system which attempts to work well for binary files. Unfortunately even its diffing ability in this situation was lacking. After 2 commits of a 20 GB .vdi file, the repository had grown to 40 GB+. This solution wasn’t very space efficient. Moreover committing such a large file is tediously slow. Everytime I launched my virtual machine I’d have to wait for this boring 5-10 minute process.
Backing up using VirtualBox snapshots
The canonical solution turns out to involve a VirtualBox feature known as snapshots which I had before now known nothing about.
From a users perspective, a snapshot is a restore point. If I create a snapshot, I can go back to that point in time. The best part is I can take a snapshot of a system even while its running. In the simplest use-case you have a linear progression in time of various snapshots. You can restore your virtual machine to any snapshot in the history. You can also do crazy things like go back to a snapshot and create a branch from that snapshot — taking your virtual machine in multiple experimental directions from a single restore point.
How snapshots actually work makes it extremely powerful for backups. A snapshot turns out to be the diffing system I was looking for. When you take a snapshot of a virtual machine, in the default “normal” mode, the associated parent (either the virtual machine image or another snapshot) is frozen and no longer written to. Instead, all writes go into the file associated with new snapshot. This file is in essence a kind-of commit log against the underlying virtual file system of everything that has happened after the snapshot in time. Its a diff of stuff thats changed since the snapshot took place. Restoring to the snapshot point is as simple as throwing away the snapshot-file — the commit log — and unfreezing the snapshot’s ancestor.
Deleting a snapshot is not removing all the changes in that commit log. Instead its instead folding that snapshot into its ancestor (back to the vdi or another diff file). Its actually committing the diff.
Which leads me to understanding why this can work as a backup strategy.
You can backup a running VirtualBox virtual machine by maintaining a cascade of snapshots. One snapshot, knows as “current” is the most recent snapshot. Restoring to it restores to the last backup. The diff file associated with it (holding all the stuff that has happened AFTER the snapshot) reflects all the non-backed up changes and is the where VirtualBox is actively keeping the guest OS’s writes. This diff is a kind of “commit log” of all the changes that are going to the virtual disk. The “previous” snapshot is the restore point before current. In this script, previous’s commit log is the old current. The commit log/diff associated with “previous” reflects the changes between the previous/current snapshots. Finally this script also has a “deleteMe” — the old previous. On every run, deleteMe, is folded back into the main vdi file by telling VBoxManage to delete the snapshot (deleting a snapshot doesn’t remove the associated data, it just folds in the data and forgets the restore point).
This strategy lets us keep 2 restore points (in case an accidental backup backs up an unstable image). Its a great strategy, but….
Snapshots — Tread Carefully
Sadly, for me personally, live VirtualBox snapshots haven’t been a terribly robust backup strategy. I’ve unfortunately seen several snapshots fail. Or, worse, had VirtualBox crash while a snapshot was taking place. Luckily I haven’t lost a lot of data, as I’ve been diligent about pushing my code to github. When a snapshot has failed, I’ve had to edit my virtual machine’s vbox.xml file. A file that clearly states “DO NOT EDIT” at the top. Its easy to fall into the lull of thinking that this seems like something that should either succeed or fail atomically like comiting to a versioning system. It hasn’t been my experience that this is the case.
Here’s a gallery of horrors of some of the errors I’ve seen. First there’s the “A differencing image of snapshot could not be found” where somehow a snapshot image file gets lost
I’ve also encountered this error — “Hard Disk XXX cannot be directly attached to the virtual machine because it has 1 differencing child hard disks”. I’ve had errors taking snapshots, including having the snapshot process hang with a live VM. Sadly I can’t say I trust live snapshots right now.
I’ve reverted to a simpler, non-live backup strategy that only takes snapshots immediately before starting up my virtual machine and won’t take a snapshot while VirtualBox.exe is running (asking you to close VirtualBox before continuing). This is a combination of the script above reworked into Python on Windows and this Python ActiveState recipe. I’ve replaced the VirtualBox icon pinned to the taskbar with a batch file that runs my script, and use the default VirtualBox icon on the task bar.
I also only ever have one snapshot currently running — “current”. Before launching VirtualBox, current gets compacted into the main vdi image via a snapshot delete. I then take a new “current” snapshot which VirtualBox uses. For me, this seems to be the best solution thus far. I have one differencing image active at one time. The main vdi gets updated right before the VM launches. Therefore when crashplan backs up that folder, it should be backing up a stable vdi thats not constantly changing and getting unstable. This seems to be the best solution for me for maintaining a trustworthy backup strategy without weird errors that crash my live VM.
Anyway, I’d definitely be curious to here about your experiences backing up VMs!