Tuesday, August 2, 2011

backup strategy model

a couple years ago i wrote up a formalization of my backup strategy, but wasn't sure where i wanted to publish it. i finally decided to just stick it in my blog:

this document aims to describe a simple to implement and comprehensive backup strategy, including some essential capabilities for encryption, deltas, delta retention, large files, remote backups (geo-diversity), and diverse storage device sizes.

the target audience are relatively sophisticated users with a moderate level of familiarity with backup technologies. it is assumed that the user has physical access (sneaker-netable) to a remote, network connected site or sites on a fairly regular basis (i.e. commutes to work), and is able to attach enough storage (disk) there to contain the data sets plus deltas, and has access to a portable mass storage device (i.e. usb hard drive) for sneaker-net operations.

it is my belief that even with familiarity of backup technology, it is a challenge to define a good strategy that is safe and readily applicable to a variety of situations.

the general method of this document is to identify the essential properties of data sets, then of backup tactics, and then to define a simple formula for mapping one to the other in a way that optimizes pragmatic costs, resulting in a overall best practice strategy.

data set properties

there are 4 of them: base size, delta size, update frequency, and sensitivity.

the first 2 are based on size of data. these days, in terms of remote backups, data sizes generally fall into two buckets: large and small. data is large if it is too much to upload to a remote site over the cloud in a reasonable amount of time (hours). otherwise, data is small. upload time should take into consideration a throttled connection, since backup is likely sharing a network link with other normal network traffic. this large/small distinction applies in the sense of the total size of the data set, as well as in the sense of deltas to be backed up (i.e. are changes relatively large or small). a small data set may be something like a source code tree, and a large one may be something like an mp3 collection. a small delta may be something like adding a single mp3 to the collection, and a large delta may be something like mythtv capturing several gigs worth of video to a video folder.

update frequency also falls into a couple buckets in terms of remote backup, rare and common. rare is relative to delta retention time... generally a retention time of 3-6 months is more than adequate for most backup situations... if a change has gone 6 months without the need for reversion, it is generally safe to prune it's delta. so, if a set is updated less than once every 3-6 months, we'd classify it as rarely updated.

sensitivity is certainly the most subjective data set property, and can potentially have a range of classes that is very complex... it can depend on who "owns" the data, how old it is, who it is safe to be exposed to, the nature of environment in which a remote backup site resides, and perhaps several other factors. for the purposes of this document, we will define it fairily simply: a data set is sensitive if the user deems it unfit for un-encrypted exposure to a remote environment. for example, if the remote environment is an office, it is probably safe to back up your mp3 collection there without encrypting it (you probably want to listen to it there anyways). but perhaps your resume backup should be encrypted. if the remote environment is a public webhosting server, you might need to encrypt your mp3 collection to avoid exposing/publicizing copyrighted material illegally, but it's probably ok to back up your resume there. so given a remote site, a backup set is either sensitive or it is not... again, a 2-bucket property. one more note about sensitivity, a paranoid person may prefer to encrypt all remote backups. while this is a certainly in the realm of possibility, it should be understood that encrypting does exact some sacrifice, e.g. accessibility of an mp3 collection as described above, as well as some other technical, resource, and management burdens, which will be described later in this document.

so, there are 4 essential properties of a dataset, each of which are basically binary in nature... so on the surface it seems as though this should result in 16 different data set profiles... but actually a small base size implies small change sizes, so 4 of these combinations are actually bogus... therefore, there are 12 essential data set personalities. at first this seems overly complex, but in the next sections we will discover that there are only 3 useful backup tactics, and certain properties dominate others relative to those tactics, and so the optimal decision about how to back up a given set is actually relatively simple.

backup tactic properties

there are essentially 2 properties of a backup tactic, is it remote or local, and is it encrypted or clear... definitions of those properties is obvious.

once again, we have a set of binary properties... 2 of them so it would seem as though this should result in 4 backup tactic profiles, but actually an encrypted backup implies it is remote. there is no point in encrypting a data set and storing it on the source/local environment, since the risk exposure has not really changed.

reasoning/mapping the strategy

the essentials:

in general, if something is worth backing up, it is worth backing up remotely. however this is not always possible with certain data set properties. specifically, sets with frequently updating large deltas are only suitable to local backups.

sensitive backups should obviously be encrypted, assuming they are not frequent large deltas, in which case they can only be backed up locally as stated above, which furthermore more implies that there is no point in encrypting, as stated further above.

the tool, resource, and device constraints:

this document assumes tools are differential, meaning most backups are not full backups but just delta backups, vastly preserving storage and bandwidth. large data sets can and should be backed up remotely (assuming they are not frequent large deltas). of course the full backups will consume storage, but more importantly bandwidth... this is where the portable mass storage device comes in... sneaker netting is probably the only channel with enough bandwidth. obviously the device should be large enough for a full backup of the data set, so this is one constraint on data set sizes. if you have a data set larger than commonly available drive sizes, you will want to try to divide it into smaller sets.

i use 2 tools for backups, duplicity for encrypted backups and rdiff-backup for clear. they are both differential and remote-over-ssh capable. your remote site should be able to initiate or terminate an ssh channel.

duplicity is a traditional differential scheme... differentials generally consume over twice as much storage as the source. someone familiar with differentials should understand that this is because storage for 2 full backups are needed to leapfrog the retention periods, plus storage for the deltas/incrementals between the fulls. this has major effects on it's suitability for remote backing up of a data set: if the data set is large, you will be forced to routinely sneaker full backups to the remote site (which may not be a burden you want to bear on a regular basis), or (better) you will need to try to divide the set into a large but stable portion and a smaller dynamic portion. the large portion will only consume storage for a single full backup, and will only require a single sneaker operation, saving both storage and sneaker effort. the smaller dynamic portion would not require routine sneaker-net, and also marginalizes the issue of the backup consuming 2x the storage of the data set since this portioned data set will be small.

rdiff-backup it is a "reverse-differential" tool, meaning that rather than starting with a full backup of a data set and storing the deltas for each backup cycle needed to bring it up to date, rdiff-backup always keeps the backup set up to date, and instead stores the deltas needed to back-date it. this accomplishes 4 important things:

1) traditional forward differentail solutions require routine full backups... with rdiff-backup, you need take only one full backup and thereafter all backups are only delta backups. this means the backup only consumes 1x the storage of the data set plus the deltas, and more importantly will only require a single sneaker operation for the initial full backup.

2) the most common needs for restore from backup are those "oh shit" moments, like accidentally deleting the wrong thing or performing some irreversible operation. you usually realize it right away. in these cases, you generally want the most recent backup. in rdiff-backup, the most recent backup is the most trivial one to restore, since no deltas need be applied.

3) there is never a need to subdivide a large data set based on stable and dynamic portions the way you may need with duplicity.

4) each cycle will expire a delta, spreading out disk io for deletion rather than doing a huge full deletion each time a retention period passes, saving disk io.

(one disadvantage of reverse-differential is it does not work for tape, since random updates of the full are not possible. but with disk sizes today, who wants to mess with tape? unfortunately amazon s3 also fits this pattern... data can not be updated but only created/deleted... we can still use s3 for small sets though)

after having considered the above, we'll decide which tactic makes the most sense for a given set, and perhaps whether you want to refactor/divide the sets. here is a table listing the optimal tactic for each possible data set personality:

frequent updates
| large changes
| | large set
| | | sensitive
| | | |
0 0 0 0 remote/clear
0 0 0 1 remote/encrypted
0 0 1 0 remote/clear
0 0 1 1 remote/encrypted (subdivided or routine-sneaker)
0 1 0 0 bogus - small set would have relatively small changes
0 1 0 1 bogus - small set would have relatively small changes
0 1 1 0 remote/clear
0 1 1 1 remote/encrypted (subdivided or routine-sneaker)
1 0 0 0 remote/clear
1 0 0 1 remote/encrypted
1 0 1 0 remote/clear
1 0 1 1 remote/encrypted
1 1 0 0 bogus - small set would have relatively small changes
1 1 0 1 bogus - small set would have relatively small changes
1 1 1 0 local/clear
1 1 1 1 local/clear

on inspection, we can see that there are 2 dominating patterns, and the decision flow is simple enough: if updates are large and frequent, backup should be local/clear (rdiff-backup). otherwise, backup is based on sensitivity of the data set.

conclusion:

once you understand the essential properties of data sets and backup tactics, modern technology and resources commonly available to a relatively sophisticated user allow for a simple implementation of a pragmatically effective best practice backup strategy.

No comments:

Post a Comment