Duplicity and Amazon Glacier
Posted 20 January 2015
For ages I've had a NAS at home for storing photos, DVDs and a years worth of random files and data. I've got a second nas that's physically away from the main one that backs up the first one. It's worked fine, but it's a pretty basic solution, which I've been looking to improve. I've finally got my act together and got a Raspberry Pi running Duplicity. I've also been experimenting with Amazon Glacier as a way to send the full backups off site.
I heard about Duplicity a while back because it's a relatively simple tool that's got lots of useful features. The main one I was after (besides being able to do full and incremental backups) is the ability to make the backup into little 'blobs' and encrypt each blob (using GPG). Whilst I don't suppose I have anything super secret to back up, I don't like the idea of sending anything off "to the cloud" without encrypting it. As a result, Duplicity works very well, and as it turns out is insanely easy to get started with too. After getting the primary and backup filesystems mounted onto the Pi, it took only minutes to get going with a (full) backup. The Pi isn't very good at encryption though, so that part of the process does seem to slow things down a bit.
Amazon Glacier is an interesting service too (if you haven't kept up with Amazon's dozens of products, it's cold-storage for files, something like a cloudy tape backup service). Putting things into Glacier is cheap too - much cheaper than any other data storage solution I've found. Getting data out of Glacier will cost you a bit, although even that isn't ruinous (it does take hours to retrieve even the smallest file though). I guess the intention is that you push your data into the Glacier and for the most part just delete it when it gets old, rather than needing to retrieve it all again very often.
When I first heard about Glacier, the tools available were pretty basic. These days there are quite a few different options available. I elected for a command line Java client (glacier-cli). It seems to support the whole API, although it's pretty slow (not that speed is important particularly - the whole service is pretty slow!). A simple Bash script is enough to start copying the Duplicity archive files up to Glacier.
Glacier bundles files together into 'vaults'. You can have up to 1000 vaults, and each can have a huge amount of files in it. You can get an inventory of a vault (but it takes about 4 hours), but all Glacier remembers about your files is a long ID number (and maybe some meta if you supplied some). All future use of those files needs the ID number, so that means you've got to remember which file got which ID number, and which IDs ended up in which vault. Indeed, I created a test vault and threw a couple of files into it, but couldn't delete the vault until I'd deleted the files inside it, for which I needed to run a inventory to get the IDs. None of this happens quickly (in future, I'm going to budget days to complete jobs like this).
Going forwards, I think a proper 'daemon' that runs the backups, pushes files to Glacier and then deletes the backups would do rather better than a couple of cheesy Bash scripts. There's a good looking CPAN module now for Glacier interaction, so I may well be writing some Perl to do this in the near future. Keeping track of the files that have been uploaded to Glacier is pretty important, as is only locally deleting them when they've been sent up successfully (and aren't needed by Duplicity). Whilst I've got some representation of those features in my Bash scripts, I still think something a bit more solid would be better in the long term.
More blog posts:Previous Post: Git-Backed Website Content | Next Post: Ruby rspec Unit Tests