Tuesday, August 2, 2011

backup strategy model

a couple years ago i wrote up a formalization of my backup strategy, but wasn't sure where i wanted to publish it. i finally decided to just stick it in my blog:

this document aims to describe a simple to implement and comprehensive backup strategy, including some essential capabilities for encryption, deltas, delta retention, large files, remote backups (geo-diversity), and diverse storage device sizes.

the target audience are relatively sophisticated users with a moderate level of familiarity with backup technologies. it is assumed that the user has physical access (sneaker-netable) to a remote, network connected site or sites on a fairly regular basis (i.e. commutes to work), and is able to attach enough storage (disk) there to contain the data sets plus deltas, and has access to a portable mass storage device (i.e. usb hard drive) for sneaker-net operations.

it is my belief that even with familiarity of backup technology, it is a challenge to define a good strategy that is safe and readily applicable to a variety of situations.

the general method of this document is to identify the essential properties of data sets, then of backup tactics, and then to define a simple formula for mapping one to the other in a way that optimizes pragmatic costs, resulting in a overall best practice strategy.

data set properties

there are 4 of them: base size, delta size, update frequency, and sensitivity.

the first 2 are based on size of data. these days, in terms of remote backups, data sizes generally fall into two buckets: large and small. data is large if it is too much to upload to a remote site over the cloud in a reasonable amount of time (hours). otherwise, data is small. upload time should take into consideration a throttled connection, since backup is likely sharing a network link with other normal network traffic. this large/small distinction applies in the sense of the total size of the data set, as well as in the sense of deltas to be backed up (i.e. are changes relatively large or small). a small data set may be something like a source code tree, and a large one may be something like an mp3 collection. a small delta may be something like adding a single mp3 to the collection, and a large delta may be something like mythtv capturing several gigs worth of video to a video folder.

update frequency also falls into a couple buckets in terms of remote backup, rare and common. rare is relative to delta retention time... generally a retention time of 3-6 months is more than adequate for most backup situations... if a change has gone 6 months without the need for reversion, it is generally safe to prune it's delta. so, if a set is updated less than once every 3-6 months, we'd classify it as rarely updated.

sensitivity is certainly the most subjective data set property, and can potentially have a range of classes that is very complex... it can depend on who "owns" the data, how old it is, who it is safe to be exposed to, the nature of environment in which a remote backup site resides, and perhaps several other factors. for the purposes of this document, we will define it fairily simply: a data set is sensitive if the user deems it unfit for un-encrypted exposure to a remote environment. for example, if the remote environment is an office, it is probably safe to back up your mp3 collection there without encrypting it (you probably want to listen to it there anyways). but perhaps your resume backup should be encrypted. if the remote environment is a public webhosting server, you might need to encrypt your mp3 collection to avoid exposing/publicizing copyrighted material illegally, but it's probably ok to back up your resume there. so given a remote site, a backup set is either sensitive or it is not... again, a 2-bucket property. one more note about sensitivity, a paranoid person may prefer to encrypt all remote backups. while this is a certainly in the realm of possibility, it should be understood that encrypting does exact some sacrifice, e.g. accessibility of an mp3 collection as described above, as well as some other technical, resource, and management burdens, which will be described later in this document.

so, there are 4 essential properties of a dataset, each of which are basically binary in nature... so on the surface it seems as though this should result in 16 different data set profiles... but actually a small base size implies small change sizes, so 4 of these combinations are actually bogus... therefore, there are 12 essential data set personalities. at first this seems overly complex, but in the next sections we will discover that there are only 3 useful backup tactics, and certain properties dominate others relative to those tactics, and so the optimal decision about how to back up a given set is actually relatively simple.

backup tactic properties

there are essentially 2 properties of a backup tactic, is it remote or local, and is it encrypted or clear... definitions of those properties is obvious.

once again, we have a set of binary properties... 2 of them so it would seem as though this should result in 4 backup tactic profiles, but actually an encrypted backup implies it is remote. there is no point in encrypting a data set and storing it on the source/local environment, since the risk exposure has not really changed.

reasoning/mapping the strategy

the essentials:

in general, if something is worth backing up, it is worth backing up remotely. however this is not always possible with certain data set properties. specifically, sets with frequently updating large deltas are only suitable to local backups.

sensitive backups should obviously be encrypted, assuming they are not frequent large deltas, in which case they can only be backed up locally as stated above, which furthermore more implies that there is no point in encrypting, as stated further above.

the tool, resource, and device constraints:

this document assumes tools are differential, meaning most backups are not full backups but just delta backups, vastly preserving storage and bandwidth. large data sets can and should be backed up remotely (assuming they are not frequent large deltas). of course the full backups will consume storage, but more importantly bandwidth... this is where the portable mass storage device comes in... sneaker netting is probably the only channel with enough bandwidth. obviously the device should be large enough for a full backup of the data set, so this is one constraint on data set sizes. if you have a data set larger than commonly available drive sizes, you will want to try to divide it into smaller sets.

i use 2 tools for backups, duplicity for encrypted backups and rdiff-backup for clear. they are both differential and remote-over-ssh capable. your remote site should be able to initiate or terminate an ssh channel.

duplicity is a traditional differential scheme... differentials generally consume over twice as much storage as the source. someone familiar with differentials should understand that this is because storage for 2 full backups are needed to leapfrog the retention periods, plus storage for the deltas/incrementals between the fulls. this has major effects on it's suitability for remote backing up of a data set: if the data set is large, you will be forced to routinely sneaker full backups to the remote site (which may not be a burden you want to bear on a regular basis), or (better) you will need to try to divide the set into a large but stable portion and a smaller dynamic portion. the large portion will only consume storage for a single full backup, and will only require a single sneaker operation, saving both storage and sneaker effort. the smaller dynamic portion would not require routine sneaker-net, and also marginalizes the issue of the backup consuming 2x the storage of the data set since this portioned data set will be small.

rdiff-backup it is a "reverse-differential" tool, meaning that rather than starting with a full backup of a data set and storing the deltas for each backup cycle needed to bring it up to date, rdiff-backup always keeps the backup set up to date, and instead stores the deltas needed to back-date it. this accomplishes 4 important things:

1) traditional forward differentail solutions require routine full backups... with rdiff-backup, you need take only one full backup and thereafter all backups are only delta backups. this means the backup only consumes 1x the storage of the data set plus the deltas, and more importantly will only require a single sneaker operation for the initial full backup.

2) the most common needs for restore from backup are those "oh shit" moments, like accidentally deleting the wrong thing or performing some irreversible operation. you usually realize it right away. in these cases, you generally want the most recent backup. in rdiff-backup, the most recent backup is the most trivial one to restore, since no deltas need be applied.

3) there is never a need to subdivide a large data set based on stable and dynamic portions the way you may need with duplicity.

4) each cycle will expire a delta, spreading out disk io for deletion rather than doing a huge full deletion each time a retention period passes, saving disk io.

(one disadvantage of reverse-differential is it does not work for tape, since random updates of the full are not possible. but with disk sizes today, who wants to mess with tape? unfortunately amazon s3 also fits this pattern... data can not be updated but only created/deleted... we can still use s3 for small sets though)

after having considered the above, we'll decide which tactic makes the most sense for a given set, and perhaps whether you want to refactor/divide the sets. here is a table listing the optimal tactic for each possible data set personality:

frequent updates
| large changes
| | large set
| | | sensitive
| | | |
0 0 0 0 remote/clear
0 0 0 1 remote/encrypted
0 0 1 0 remote/clear
0 0 1 1 remote/encrypted (subdivided or routine-sneaker)
0 1 0 0 bogus - small set would have relatively small changes
0 1 0 1 bogus - small set would have relatively small changes
0 1 1 0 remote/clear
0 1 1 1 remote/encrypted (subdivided or routine-sneaker)
1 0 0 0 remote/clear
1 0 0 1 remote/encrypted
1 0 1 0 remote/clear
1 0 1 1 remote/encrypted
1 1 0 0 bogus - small set would have relatively small changes
1 1 0 1 bogus - small set would have relatively small changes
1 1 1 0 local/clear
1 1 1 1 local/clear

on inspection, we can see that there are 2 dominating patterns, and the decision flow is simple enough: if updates are large and frequent, backup should be local/clear (rdiff-backup). otherwise, backup is based on sensitivity of the data set.

conclusion:

once you understand the essential properties of data sets and backup tactics, modern technology and resources commonly available to a relatively sophisticated user allow for a simple implementation of a pragmatically effective best practice backup strategy.

Saturday, July 30, 2011

my gym laptop - some commentary about building a touchscreen video viewing platform to use at the gym.

so i have the same story i suspect that many folks do regarding fitness... i knew it was time to get my rear to work on it, but the gym was boring, running was too hard on my knees, cycling seemed a little dangerous and weather dependent, etc etc excuses excuses. for a long time i dreamed a solution would be having a room in my house to put an elliptical machine and be able to watch my own programs, both to pass the time and to sortof kill 2 birds with one stone. this was reinforced by the idea that visual stimulus really helps make workouts go by faster. the problem that it probably took me too long to acknowledge is living in san francisco (and reluctant to leave), i wasn't likely to have this this extra workout room in my house any time soon.

i finally came into the idea that instead of bringing an elliptical into my house, i should be able to take my videos to the gym. in fact i have a gym a block from my house, one of the nice tradeoffs of having little personal space due to living in a dense area. in 2009 when i did this project, this felt like this was a particularly opportune project idea, with lots of technologies finally coming available at what felt was an exciting breakthrough time.

the solution may seem a little more obvious to others than it does to me. some might say iphone, but i don't think the screen size cuts it, either for viewing or control from a bouncy situation. some might say ipad, but i doubt the storage or locked-down environment would work. in either case, i doubt real quality roaming streaming capability is there. also i wanted to be able to sync all my content in a way that made the experience integrate with my sofa watching experience... random access and delete after consuming both at the sofa and at the gym, and that would take decent storage.

i decided to build the ui based on a web browser, which i have a lot of experience coding to. but there were still a lot of things i felt i needed, and fortunate to have at the right time:

a capacitive touch interface with a decent screen size. it needed to work with linux also. i decided on the hp tx2 tablet laptop, which i am very happy with, and some smart folks in the ubuntu community had just finished the work of figuring out how to drive it in xwindows. the tablet mode is very nice, i can lean it up in the magazine holder on whichever machine i'm using, and adjust the viewing angle as needed depending the holder's position. but, the tx2 is not without it's issues. only one of the three bezel buttons seems to work under linux for me... better than none tho. the removable dvd drive catch broke, so i don't really have one anymore. why it was removeable in the first place is a mystery... i called hp and the dvd drive is the only option available for that bay. it doesn't want to power up without a power plug, but after that i can remove the plug and run on battery normally. the touchscreen is really very good, but does have some phantom jumping and sometimes clicking when the screen displays a lot of moving dark contrast... not usually a problem though.
good amount of storage. these days a 640G 2.5 inch drive to go in a laptop can hold a very respectable amount of video for less than $100.
a compositing window manager (i felt the right control design was a transparent one, which would accommodate large buttons for bouncy fat finger manipulations, without encumbering video display real estate). i switched from fvwm (my wm of 11 years) to compiz, and the switch was shockingly painless even with my extensive old school xwindows desktop customizations. it also had the right features i needed to programatically control the window opaqueness (which i did with compiz display rules based on window titles, and i was able to manipulate the window titles with vanilla html title fields).
file synchronization (i wanted the exact same content available on the elliptical as on my sofa, and deletable in either place). i happened on relatively new csync, which is a sortof efficient bidirectional rsync. it fit better than unison, a different synchronizer i love but is more suited to text, often doing full file scans which just did not work on the huge video or even audio files (i've cobbled together a similar portable podcast content platform based on a sansa clip for my work commute, but that's another story).
good content collecting tools (podcatcher, youtube downloader, compatible format computer dvr, various sniffy scraper techniques, bittorrent).
grab-and-drag firefox plugin - for scrolling ui lists from the touchscreen
a number of other technical tools that were either relatively new or i just hadn't learned how to use yet:
- xautomation/wmctrl - for launching and positioning windows in the wm... fullscreening, foregrounding, switching desktops, sending keystrokes. xsetwacom for enabling/disabling the touchscreen with the bezel button, to avoid phantom clicks while the laptop was stowed between uses.
- floating html layouts - i just learned how to do these from my coworker henry. just what i needed for building scrollable areas and large fat-finger touchscreen buttons and generally decent control layouts.
- jquery - making the control code easier to write
- xmlhttprequest - for asynchronous controlling of the video player
- video player with an approachable control api - i've used mplayer for years, and delighted to find it had fine full featured named-socket controls, and all the playback features i could ask for (disableable onscreen display with elapsed/total time, volume level, skipping an arbitrary period forward or backward, a/v delay for bluetooth syncing (i eventually gave up on bluetooth audio and decided wired earbuds worked better... some footnotes below)).

this project, though admittedly somewhat humble (there's plenty of mobile video gadgetry out there), was really exciting from beginning to end. it's not the first time i've built something that i knew was going to make my life better. but it was uniquely remarkable when i think about how a dozen different pieces all came together at the right time in the right way. many of the pieces hadn't existed just a couple months or years before, and just as many pieces may have existed but were completely new to me. on top of all this, any engineer knows that the more moving parts, even if they're mature and familiar, the more complex and likely to fail an idea will be. but in this case, with a just a modest amount of engineering and research, i wound up with exactly what i had imagined i wanted.

now, a couple years later, and can say with certainness this project was a total success. i still get excited about not wasting time watching video at home (well, not AS much), and about having tons of interesting and current content available to consume while burning calories at the same time.

here's a youtube video of the rig in action:

http://www.youtube.com/watch?v=K2637RSIvLM

footnotes:

i put a good amount of effort into wireless bluetooth headphones, but eventually gave up on them. i went through 3 different pairs that would break or just not cut it in different ways. the plantronics voyager 855 was too quiet. the motorola rokr lost sync too easily and often. the rocketfish knockoff of the motorola rokr performed the best and was cheapest, but the power switch eventually broke. it was also tricky to keep any of the headphones paired. i found i had to send a silent track at all times, else the headphones would go to sleep after a very short time. it was also a pain to manage the sync by delaying the audio on the player... but i would have put up with this if not for all the other issues.
i use firefox for the interface currently. i tried to use chrome, but it had funky behavior when i was remote and not connected to the network and just connecting to the local web cgi. also it did not blank the pointer the way i wanted, especially since the tx2 had a phantom mouse movement which would be distracting in front of the video if not blanked well.
cost was relatively modest. more than an ipad, but what i have is much more capable for my needs also. $800 for the hp tx2, $100 for 640g hd, and maybe $30 for the earbuds.

tostaa

finally published tostaa. been wanting to do this for a long time, but really wanted to work out some bugs before going public. however, since it works good enough for me, i never seem to find time to fix it up for prime time. so, figured i'd just throw it out there.

sourceforge link: http://sourceforge.net/p/tostaa/home/Home/

youtube screencast demo: http://www.youtube.com/watch?v=XBvMh6zR1Ic