Tag Archives: communication

World Backup Recovery Testing Day?

Yesterday was apparently the widely celebrated World Backup Day. Just like reality, the party ends some time unless you happen to be Andrew W.K and now you have woken up with a splitting headache, a vague sadness and an insatiable desire for eggs benedict. If installing and configuring a new backup system is an event that brings you joy and revelry like a good party, the monotony of testing the recovery of your backups is your hangover that stretches beyond a good greasy breakfast. I propose that today should thus be World Backup Recovery Testing Day.

There is much guidance out there for anyone who does cursory research on how to design a robust backup system so I think I will save you from my “contributions” to that discussion. As much as I would like to relay my personal experience with backups; I do not think it would be wise to air my dirty laundry this publically. In my general experience, backup systems seems to get done wrong all the time. Why?


Backups? We don’t need those. We have snapshots.

AHAHAHAHAHAHA. Oh. Have fun with that.

I am not sure what it is about backup systems but they never seem to make the radar of leadership. Maybe because they are secondary systems so they do not seem as necessary in the day-to-day operations of the business as production systems. Maybe because they are actually more complicated than they may seem. Maybe because the risk to cost ratio does not seem like a good buy from a business perspective, especially if the person making the business decision does not fully understand the risk.

This really just boils down to the same thing: Technical staff, not communicating the true nature of the problem domain to leadership and/or leadership not adequately listening to the technical staff. Notice the and/or. Communication: it goes both ways. If you are constantly bemoaning the fact that management never listens to you, perhaps you should change the way you are communicating with your management? I am not a manager so I have no idea what the corollary to this is (ed. feel free to comment managers!).

Think about it. If you are not technical, the difference between snapshots and a true backup seem superfluous. Why would you pay more money for a duplicate system? If you do not have an accurate grasp of the risk and the potential consequences why would you authorize additional expenditures?


I am in IT. I work with computers not people.

You do not work with people, you say? Sure you do. Who uses computers? People. Generally people that have some silly business mission related to making money. You best talk to them and figure out what is important to them and not you. The two are not always the same. I see this time and time again. Technical staff implements a great backup system but fails to backup the stuff that is critical to the actual business.

Again. Communication. As a technical person, one database looks more or less identical to another one. I need to talk to the people that actually use that application and get some context, otherwise how would I know which one needs a 15 minute Recovery Time Objective and which one is a legacy application that would be fine with a 72 hour Recovery Time Objective. If it was up to me, I would backup everything, with infinite granularity and infinite retention but despite the delusion that many sysadmin’s labour under they are not god and do not have those powers. Your backup system will have limitations and the business context should inform your decision on how you accommodate those limitations. If you have enough storage to retain all your backups for six weeks or half your backups for 4 weeks and half for 4 months and you just make a choice, maybe you will get lucky and get it right. However, the real world is much more complicated than this scenario it is highly likely you will get it wrong and retain the wrong data for too long at the expensive of the right data. These kind of things can be Resume Generating Events.

My favorite version of this is the dreaded Undocumented Legacy Application that is living on some aging workstation tucked away in a forgotten corner. Maybe it is running the company’s timesheet system (people get pissed if they cannot get paid), maybe it is running the HVAC control software (people get pissed if the building is a nice and frosty 48 degrees Fahrenheit), maybe it is something like SCADA control software (engineers get pissed with water/oil/gas does not flow down the right pipes at the right time, also people may get hurt). How is technical staff going to have backup and recovery plans for things like this if they do not even know they exist in the first place?

It is hard to know if you have done it wrong

In some ways, the difficulty of getting backup systems right is that you only know if you have got it right once the shit hits the fan. Think about the failure mechanism for production systems: You screwed up your storage design – stuff runs slow. You screwed up your firewall ACLs – network traffic is blocked. You screwed up your webserver – the website does not work any more. If there is a technical failure you generally know about it rather quickly. Yes, there are whole sets of integration errors that lie in wait in infrastructure and only rear their ugly head when you hit a corner case but whatever, you cannot test everything. #YOLO #DEVOPS

There is no imminent failure mechanism constantly pushing your backup system towards a better and more robust design since you only really test if you need it. Without this Darwinian IT version of natural selection you generally end up with a substandard design and/or implementation. Furthermore, for some reason backups up here are associated with tapes, and junior positions are associated with tape rotation. This cultural prejudice has slowly morphed into junior positions being placed in charge of the backup system; arguably not the right skillset to be wholly responsible for such a critically important piece of infrastructure.

Sooooo . . . we do a lot of things wrong and it seems the best we can do is a simulated recovery test. That’s why I nominate April 1st as World Backup Recovery Testing Day!


Until next time,

Stay Frosty