Monthly Archives: April 2016

Budget Cuts and Consolidation: Taking it to the Danger Zone

For those of you that do not know, Alaska is kind of like the 3rd world of the United States in that we have semi-exploitative love/hate economic relationship with a single industry . . . petroleum. Why does this matter? It matters because two years ago oil was $120 a barrel and now it is floating between $40 and $50. For those of us in public service or who are private industry support services that contract with government and municipal agencies it means that our budget just shrank by 60%. The Legislature is currently struggling with balancing a budget that runs an annual 3.5 to 4 billion dollar deficit, a pretty difficult task if your only revenue stream is oil.

Regardless of where you work and who you work for in Alaska, this means “the times, they are a changin'”. As budgets shrink, so do resources: staff, time, support services, training opportunities, travel, equipment refreshes and so on. Belts tighten but we still have to eat. One way to make the food go further is to consolidate. In IT, especially these days of Everything-as-a-Service, there is more and more momentum in the business to go to centralized, standardized and consolidated service delivery (ITIL buzzword detected! +5 points).

In the last few years, I have been involved in a few of these type of projects. I am here to share a couple of observations.

 

 

Consolidation, Workload and Ops Capacity

 

Above you should find a fairly straight forward management-esque graph with made-up numbers and metrics. Workload is how much stuff you actually have to get done. This is deceptive because Workload can break down in many different types of work: projects, break/fix, work that requires immediate action, and work that can be scheduled. But for the sake of this general 40,000ft view, it can just be deemed work that you and your team do.

Operational Capacity is simply you and your team’s ability to actually do that work. Again, this is deceptive because depending on your team’s skills, personalities, culture, organizational support, and morale, their Operational Capacity can look different even if the total amount of work they do in given time stays constant. But whatever, management-esque talk can be vague.

Consolidation projects can be all over the map as well: combining disparate systems that have the same business function, eliminating duplicate systems and/or services, centralization of services or even something as disruptive as combining business units and teams. Consolidation projects generally require standardization as a prerequisite; how else would you consolidate? The technical piece here is generally the smallest, People, Process, Technology, right?

And from that technical standpoint, especially one from a team somewhere along that Workload vs. Operational Capacity timeline, consolidation and standardization look very, very different.

Standardization has no appreciable long-term Workload increase or reduction. There is an Increased capture of business value for existing work performed. If there is wider use of the same Process and Technology the given business value of a unit of work goes further, for example if it takes 10 hours to patch 200 workstations it may only take 10.2 hours to patch 2000 workstations.

Consolidation brings a long-term Workload increase with a corresponding increase of Operational Capacity due to addition of new resources or re-allocation of existing resources (that’s the dotted orange line on the graph). For example, if there is wide spread adoption of the same Process and Technology, you can take the 10 hours my team spends on patching workstations and combine it with the 10 hours another team spends on patching workstations. You just bought yourself some Operational Capacity in the terms of having twice as many people deal with the patching or maybe it turns out that it only takes 10 hours to patch both team’s workstations and you freed up 10 hours worth of labor that can go to something else. There is still more work than before but that increased Workload is more than offset by increased Operational Capacity.

Both standardization and consolidation projects increase the short-term Workload while the project is on-going (see Spring of ’15 in the graph). They are often triggered by external events like mergers, management decisions, or simply proactive planning in a time of shrinking budgets. In this example, it is a reduction of staff. This obviously reduces the team’s Operational Capacity. The ability to remain proactive at both the strategic and tactical level is reduced. In fact, we are just barely able to get work done. BUT we have (or had) enough surplus capacity to continue to remain proactive even while taking on more projects, hopefully projects that will either reduce our Workload or increase our Operational Capacity or both because things are thin right now.

Boom! Things get worse. Workload increases a few months later. Maybe another position was cut, maybe a new project or requirement from on-high that was unanticipated came down to your team. Now you are in, wait for it… THE DANGER ZONE! You cannot get all the work done inside of the required time frame with what you have. This is a bad, bad, bad place to be for too long. You have to put projects on hold, put maintenance on hold or let the ticket queues grow. Your team works harder, longer and burns out. A steady hand, a calm demeanor and a bit of healthy irreverence are really important here. Your team needs to pick your projects very, very carefully since you are no longer in a position to complete them all. The one’s you do complete damn well better either lower Workload significantly, increase your Operational Capacity or hopefully do both. Mistakes here, cost a lot more than they did a year ago.

The problem here is technical staff does not generally prioritize their projects. Their business leaders do. And in times where budgets are evaporating, priorities seems to settle around a single thing: cost savings. This makes obvious sense but the danger is that there is no reason that the project with the most significant cost savings will also happen to be the project that will help your team decrease their Workload and/or increase their Operational Capacity. I am not saying it won’t happen just that there is no guarantee that it will. So your team is falling apart, you just completed a project that saves the whole business rap star dollars worth of money and you have not done anything to move your team out of THE DANGER ZONE.

In summation, projects that increase your Operational Capacity and/or reduce your Workload have significant long-term savings in terms more efficient allocation of resources but the projects that will get priority will be those that have immediate short-term savings in terms of dollars and cents.

Then a critical team member finds better work. Then it’s over. No more projects with cost savings, no more projects at all. All that maintenance that was put off, all the business leaders that tolerated the “temporary” increase in response time for ticket resolution, all the “I really should verify our backups via simulated recovery” kind of tasks – all those salmon come home to spawn. Your team is in full blown reactive mode. They spend all their time putting out fires. You are just surviving.

Moral of the story? If you go to THE DANGER ZONE, don’t stay to long and make sure you have a plan to get your team out.

 

Documentation or how I wasted an hour

As if confirming my own  tendency to “do as I say, not what I do” I just wasted about an hour this morning trying to figure out why a newly created virtual machine was not correctly registering its hostname with Active Directory via DynamicDNS. Of course, this was a series of errors greatly exasperated by the fact that I had only had two out my required four cups of coffee and I stayed up too late watching the ironically named and absolutely hilarious Workaholics.

Let’s review, shall we?

  • Being tired and trying to do something mildly complicated
  • Allowing myself to become distracted by an interrupt task in the middle of this work
  • Not verifying the accuracy of our documentation prior to assigning the IP address in question to the virtual machine
  • Screwing up and assigning the IP address to the wrong virtual machine (both hostname and subnet octets are very similar)
  • Not reading the instrumentation; the output of ipconfig /all plainly said “(Duplicate)” Duh.

All of these factors made what should of been a 15 minute troubleshooting task stretch out into an hour.

Root cause: The IP address I picked for one of the virtual machines was already in use and the documentation was not updated to reflect this.

Potential solutions: I dunno… how about keeping our documentation updated (easier said than done)? Or better yet, stop using a “documentation system” for IP addresses that relies on discretionary operational practices (i.e., an Excel Spreadsheet stored on SharePoint) and use something like IPAM. Maybe, instead of going down the ol’ “runlist” of potential problems, I should of stopped, gathered a bit more information before I proceeded with troubleshooting? The issue was right there in the ipconfig output. I was looking *right* at it. I guess that is the difference between looking and seeing.

In short . . . happy Monday you jerks.

 

facepalm

 

World Backup Recovery Testing Day?

Yesterday was apparently the widely celebrated World Backup Day. Just like reality, the party ends some time unless you happen to be Andrew W.K and now you have woken up with a splitting headache, a vague sadness and an insatiable desire for eggs benedict. If installing and configuring a new backup system is an event that brings you joy and revelry like a good party, the monotony of testing the recovery of your backups is your hangover that stretches beyond a good greasy breakfast. I propose that today should thus be World Backup Recovery Testing Day.

There is much guidance out there for anyone who does cursory research on how to design a robust backup system so I think I will save you from my “contributions” to that discussion. As much as I would like to relay my personal experience with backups; I do not think it would be wise to air my dirty laundry this publically. In my general experience, backup systems seems to get done wrong all the time. Why?

 

Backups? We don’t need those. We have snapshots.

AHAHAHAHAHAHA. Oh. Have fun with that.

I am not sure what it is about backup systems but they never seem to make the radar of leadership. Maybe because they are secondary systems so they do not seem as necessary in the day-to-day operations of the business as production systems. Maybe because they are actually more complicated than they may seem. Maybe because the risk to cost ratio does not seem like a good buy from a business perspective, especially if the person making the business decision does not fully understand the risk.

This really just boils down to the same thing: Technical staff, not communicating the true nature of the problem domain to leadership and/or leadership not adequately listening to the technical staff. Notice the and/or. Communication: it goes both ways. If you are constantly bemoaning the fact that management never listens to you, perhaps you should change the way you are communicating with your management? I am not a manager so I have no idea what the corollary to this is (ed. feel free to comment managers!).

Think about it. If you are not technical, the difference between snapshots and a true backup seem superfluous. Why would you pay more money for a duplicate system? If you do not have an accurate grasp of the risk and the potential consequences why would you authorize additional expenditures?

 

I am in IT. I work with computers not people.

You do not work with people, you say? Sure you do. Who uses computers? People. Generally people that have some silly business mission related to making money. You best talk to them and figure out what is important to them and not you. The two are not always the same. I see this time and time again. Technical staff implements a great backup system but fails to backup the stuff that is critical to the actual business.

Again. Communication. As a technical person, one database looks more or less identical to another one. I need to talk to the people that actually use that application and get some context, otherwise how would I know which one needs a 15 minute Recovery Time Objective and which one is a legacy application that would be fine with a 72 hour Recovery Time Objective. If it was up to me, I would backup everything, with infinite granularity and infinite retention but despite the delusion that many sysadmin’s labour under they are not god and do not have those powers. Your backup system will have limitations and the business context should inform your decision on how you accommodate those limitations. If you have enough storage to retain all your backups for six weeks or half your backups for 4 weeks and half for 4 months and you just make a choice, maybe you will get lucky and get it right. However, the real world is much more complicated than this scenario it is highly likely you will get it wrong and retain the wrong data for too long at the expensive of the right data. These kind of things can be Resume Generating Events.

My favorite version of this is the dreaded Undocumented Legacy Application that is living on some aging workstation tucked away in a forgotten corner. Maybe it is running the company’s timesheet system (people get pissed if they cannot get paid), maybe it is running the HVAC control software (people get pissed if the building is a nice and frosty 48 degrees Fahrenheit), maybe it is something like SCADA control software (engineers get pissed with water/oil/gas does not flow down the right pipes at the right time, also people may get hurt). How is technical staff going to have backup and recovery plans for things like this if they do not even know they exist in the first place?

It is hard to know if you have done it wrong

In some ways, the difficulty of getting backup systems right is that you only know if you have got it right once the shit hits the fan. Think about the failure mechanism for production systems: You screwed up your storage design – stuff runs slow. You screwed up your firewall ACLs – network traffic is blocked. You screwed up your webserver – the website does not work any more. If there is a technical failure you generally know about it rather quickly. Yes, there are whole sets of integration errors that lie in wait in infrastructure and only rear their ugly head when you hit a corner case but whatever, you cannot test everything. #YOLO #DEVOPS

There is no imminent failure mechanism constantly pushing your backup system towards a better and more robust design since you only really test if you need it. Without this Darwinian IT version of natural selection you generally end up with a substandard design and/or implementation. Furthermore, for some reason backups up here are associated with tapes, and junior positions are associated with tape rotation. This cultural prejudice has slowly morphed into junior positions being placed in charge of the backup system; arguably not the right skillset to be wholly responsible for such a critically important piece of infrastructure.

Sooooo . . . we do a lot of things wrong and it seems the best we can do is a simulated recovery test. That’s why I nominate April 1st as World Backup Recovery Testing Day!

 

Until next time,

Stay Frosty