Tag Archives: process

Prometheus and Sisyphus: A Modern Myth of Developers and Sysadmins

I am going to be upfront with you. You are about to read a long and meandering post that will seem almost a little too whiny at times where I talk some crap about our developers and their burdens (applications). I like our dev teams and I like to think I work really well with their leads so think of this post as a bit of satirical sibling rivalry and underneath the hyperbole and good nature-ed teasing there might be a small, “little-t” truth.

That truth is that operations, whether it’s the database administrator, the network team, the sysadmins or the help desk, always, always, always gets the short straw and that is because collectively we own “the forest” that the developers tend to their “trees” in.

I have a lot to say about the oft-repeated sysadmin myth about “how misunderstood sysadmins are” and how the they just seem to get stepped on all the time and so on and so on. I am not a big fan of the “special snowflake sysadmin syndrome” and I am especially not a fan of it when it is used as an excuse to be rude or unprofessional but that being said, I think it is worth stating that even I know I am half full of crap when I say sysadmins always get the short straw.

OK disclaimers are all done! Lets tell some stories!

 

DevOps – That means I get Local Admin right?

My organization is quite granular and each of our departments more or less maintain their own development teams supporting their own mission-specific applications along with either a developer that essentially fulfils an operations role or a single operations guy doing support solely for that department. The central ops team maintains things like the LAN, Active Directory, the virtualization platform and so on. If the powers that be wanted a new application for their department, the developers would request the required virtual machines, the ops team would spin up a dozen VMs off of a template, join them to AD, give the developers local admin and off we go.

Much like Bob Belcher, all the ops guys could do is “complain the whole time”.

 

This arrangement led to some amazing things that break in ways that are too awesome to truly describe:

  • We have an in-house application that uses SharePoint as a front-end, calls some custom web services tied to a database or two that auto-populates an Excel spreadsheet that is used for timekeeping. Everyone else just fills out the spreadsheet.
  • We have another SharePoint integrated application, used ironically enough for compliance training, that passes your Active Directory credentials in plaintext through two or three servers all hosting different web services.
  • Our deployment process is essentially to copy everything off your workstation onto the IIS servers.
  • Our revision control is: E:\WWW\Site, E:\WWW\Site (Copy), E:\WWW-Site-Dev McDeveloper
  • We have an application that manages account on-boarding, a process of which is already automated by our Active Directory team. Naturally they conflict.
  • We had at one point in time, four or five different backup systems all of which used BackupExec for some insane reason, three of which backed up the same data.
  • We managed to break a production IIS server by restoring a copy of the test database.
  • And then there’s Jenga: Enterprise Edition…

 

Jenga: Enterprise Edition – Not so fun when it needs four nines of uptime.

A satirical (but only just) rendering of one our application’s design pattern that I call “The Spider Web”

What you are looking at is my humorous attempt to scribble out a satirical sketch of one of our line-of-business applications which managed to actually turn out pretty accurate. The Jenga application is so named because all the pieces are interconnected in ways that turn the prospect of upgrading any of it into the project of upgrading all it. Ready? Ere’ we go!

It’s built around a core written in a language that we have not had any on-staff expertise in for the better part of ten years. In order to provide the functionality that the business needed as the application aged, the developers wrote new “modules” in other languages that essentially just call APIs or exposed services and then bolted them on. The database is relatively small, around 6 TBs, but almost 90% of it is static read-only data that we cannot separate out which drastically reduces the things our DBA and myself can do in terms of recovery, backup and replication and performance optimization. There is no truly separate development or testing environments so we use snapshot copies to expose what appear to be “atomic” copies of the production data (which contains PII!) on two or three other servers so our developers can validate application operations against it. We used to do this with manual fricking database restores, which was god damned expensive in terms of time and storage. There are no less than eight database servers involved but the application cannot be distributed or setup in some kind of multi-master deployment with convergence so staff at remote sites suffer abysmal performance if anything resembling contention happens on their shared last-mile connections.  The “service accounts” are literally user accounts that the developers use to RDP to the servers, start the application’s GUI, and then enable the application’s various services via interacting with above mentioned GUI (any hick-up in the RDP session and *poof* there goes that service). The public facing web server directly queries the production database). The internally consumed pieces of the application and the externally consumed pieces are co-mingled, meaning an outage anywhere is an outage everywhere. It also means we cannot segment the application in public and internally facing pieces. The client requires a hard-coded drive map to run since application upgrades are handled internally with copy jobs which essential replace all the local .DLLs on a workstation when new ones are detected and last but not least it runs on an EOL version of MSSQL.

Whew. That’s was a lot. Sorry about that. Despite that the fact that a whole department pretty much lives or dies by this application’s continued functionality our devs have not made much progress in re-architecturing and modernizing it. This really is not their fault but it does not change the fact that my team has an increasingly hard time keeping this thing running in a satisfactory manner.

 

Operations: The Digital Custodian Team.

In the middle of a brain storming session where we were trying to figure out how to move Jenga to a new virtualization infrastructure, all on a weekend when I will be traveling in order to squeeze the outage into the only period within the next two months that was not going to be unduly disruptive I began to feel like my team was getting screwed. They have more developers supporting this application than we have in our whole operations team and it is on us to figure out how to move Jenga without losing any blocks or having any lengthy service windows? What are those guys actually working on over there? Why am I trying to figure out which missing .DLL from .NET 1.0 needs be imported onto the new IIS 8.5 web servers so some obscure service that no really one understands runs in a supported environment? Why does operations own the life-cycle management? Why aren’t the developers updating and re-writing code to reflect the underlying environmental and API changes each time a new server OS is released with a new set of libraries? Why are our business expectations for application reliability so widely out-of-sync with what the architecture can actually deliver? Just what in the hell is going on here!

Honestly. I don’t know but it sucks. It sucks for the customers, it sucks for the devs but mostly I feel like it sucks for my team because we have to support four other line-of-business applications. We own the forest right? So when a particular tree catches on fire they call us to figure out what to do. No one mentions that we probably should expect trees wrapped in paraffin wax and then doused in diesel fuel to catch on fire. When we point out that tending trees in this manner probably won’t deliver the best results if you want something other than a bonfire we get met with a vague shrug.

Is this how it works? Your team of rockstar, “creative-type”, code-poets whip up some kind of amazing business application, celebrate and then hand it off to operations where we have to figure out how to keep it alive as the platform and code base age into senility for the next 20 years? I mean who owns the on-call phone for all these applications… hint: it’s not the dev team.

I understand that sometimes messes happen… just why does it feel like we are the only ones cleaning it up?

 

You’re not my Supervisor! Organizational Structure and Silos!

Bureaucratium ad infinitum.

 

At first blush I was going to blame my favorite patsy, Process Improvement and the inspid industry around it for this current state of affairs but after some thought I think the real answer here is something much simpler: the dev team and my team don’t work for the same person. Not even close. If we play a little game of “trace the organizational chart” we have five layers of management before we reach a position that has direct reports that eventually lead to both teams. Each one of those layers is a person – with their own concerns, motivations, proclivities and spin they put on any given situation. The developers and operations team (“dudes that work”), more or less, agree that the design of the Jenga application is Not a Good Thing (TM). But as each team gets told to move in a certain direction by each layer of management our efforts and goals diverge. No amount of fuzzy-wuzzy DevOps or new-fangled Agile Standup Kanban Continuous Integration Gamefication Buzzword Compliant bullshit is ever going to change that. Nothing makes “enemies” out of friends faster than two (or three or four) managers maneuvering for leverage and dragging their teams along with them. I cannot help but wonder what our culture would be like if the lead devs sat right next to me and we established project teams out of our combined pool of developer and operations talent as individual department’s put forth work. What would things be like if our developers were not chained to some stupid line-of-business application from the late ’80s, toiling away to polish a turd and implement feature requests like some kind of modern Promethian myth? What would things be like if our operations team was not constantly trying to figure out how to make old crap run while our budgets and staff are whittled away, snatching victory from defeat time and time again only to watch the cycle of mistakes repeat itself again and again like some kind Sisyphean dystopia with cubicles? What if we could sit down together and I dunno… fix things?

Sorry there are no great conclusions or flashes of prophetic insight here, I am just as uninformed as the rest of the masses, but I cannot help but think, maybe, maybe we have too many chefs in the kitchen arguing about the menu. But then again, what do I know? I’m just the custodian.

Until next time, stay frosty.

Kafka in IT: How a Simple Change Can Take a Year to Implement

Public sector IT has never had a reputation of being particularly fast-moving or responsive. In fact, it seems to have a reputation for being staffed by apathetic under-skilled workers toiling away in basements and boiler rooms supporting legacy, “mission-critical”, monolithic applications that sit half-finished and half-deployed by their long-gone and erstwhile overpaid contractors (*cough* Deloitte, CGI *cough*). This topic might seem familiar… Budget Cuts and Consolidation and Are GOV IT teams apathetic?

Why do things move so slow, especially in a field that demands the opposite? I don’t have an answer to that larger question but I do have an object lesson, well maybe what I really have is part-apology, part-explanation and part-catharsis. Gather around and hear The tale of how a single change to our organization’s perimeter proxy devices took a year!

 

03/10

We get a ticket stating that one of our teams’ development servers is no longer letting them access it via UNC share or RDP. I assign one of our tier-2 guys to take a look and a few days later it gets escalated to me. The server will not respond to any incoming network traffic, but if I access it via console and send traffic out it magically works. This smells suspiciously like a host-based firewall acting up but our security team swears up and down our Host Intrusion Protection software is in “detect” mode and I verified that we have disabled the native Windows firewall. I open up a few support tickets with our vendors and start chasing such wild geese as a layer-2 disjoint in our UCS fabric and “asymmetric routing” issues. No dice. Eventually someone gets the smart idea to move the IP address to another VM to try and narrow the issue down to either the VM or the environment. It’s the VM (of course it is)! These shenanigans take two weeks.

04/01

I finish re-platforming the development server onto a new Server 2012 R2 virtual machine. This in-of-itself would be worth a post since the best way I can summarize our deployment methodology is “guess-and-check”. Anyway, the immediate issue is now resolved. YAY!

05/01

I rebuild the entire development, testing, staging and production stack and migrate everything over except the production server which is publically accessible. The dev team wants to do a soft cutover instead of just moving the IP address to the new server. This means we will need have our networking team make some changes to the proxy perimeter devices.

05/15

I catch up on other work and finish the roughly ten pages of forms, diagrams and a security plan that are required for a perimeter device change request.

06/06

I open a ticket upstream, discuss the change with the network team and make some minor modifications to the ticket.

06/08

I filled out the wrong forms and/or I filled them out incorrectly. Whoops.

06/17

After a few tries I get the right forms and diagrams filled out. The ticket gets assigned to the security team for approval.

06/20

Someone from the security team picks up the ticket and begins to review it.

07/06

Sweet! Two weeks later my change request gets approval from the security team (that’s actually pretty fast). The ticket gets transfered back to the networking team which begins to work on implementation.

07/18

I create a separate ticket to track the required SSL/TLS certificate I will need for the HTTPS-enabled services on the server. This ticket follows a similar parallel process, documentation is filled out and validated, goes to the security team for approval and then back to the networking team for implementation. My original ticket for the perimeter change is still being worked on.

08/01

A firmware upgrade on the perimeter devices breaks high availability. The network team freezes all new work until the issue is corrected (they start their internal change control process for emergency break/fix issues).

08/24

The server’s HTTPS certificate has to be replaced before it expires at the end of the month. Our dev’s business group coughs up the few hundred dollars. We had planned to use the perimeter proxies’ wildcard certificate for no extra cost but oh well, too late.

09/06

HA restored! Wonderful! New configuration changes are released to the networking team for implementation.

10/01

Nothing happens upstream… I am not sure why.  I call about once a week and hear, we are swamped, two weeks until implementation. Should be soon.

11/03

The ticket gets transferred to another member of the network team and within a week the configuration change is ready for testing.

11/07

The dev team discovers an issue. Their application is relying on the originating client IP address for logging and what basically amounts to “two-factor authentication” (i.e., a username is tied to an IP address). This breaks fantastically once the service gets moved behind a web proxy. Neat.

11/09

I work with the dev lead and the networking team to come up with a solution. Turns out we can pass the originating IP address through the proxies but it changes the variable server-side that their code needs to reference.

11/28

Business leaders say that the code change is a no-go. We are about to hit their “code/infrastructure freeze” period that last from December to April. Fair enough.

12/01

We hit the code freeze. Things open back up again in mid-April. Looking ahead, I already have infrastructure work scheduled late April and early May which brings us right around to June: one year.

EDIT: The change was committed on 05/30 and we passed our rollback period on 06/14. As of 06/19 I just submitted the last ticket to our networking team to remove the legacy configuration.

 

*WHEW* Let’s take a break. Here’s doge to entertain you during the intermission:

 

My team is asking for a change that involves taking approximately six services that are already publically accessible via a legacy implementation, moving those services to a single IP address and placing an application proxy between the Big Bad Internet and the hosting servers. Nothing too crazy here.

Here’s some parting thoughts to ponder.

  • ITIL. Love it or hate it ITIL adds a lot of drag. I hope it adds some value.
  • I don’t mean to pick on the other teams but it clearly seems like they don’t have enough resources (expertise, team members, whatever they need they don’t have enough of it).
  • I could have done better with all the process stuff on my end. Momentum is important so I probably should not of let some of that paperwork sit for as long as it did.
  • The specialization of teams cuts both ways. It is easy to slip from being isolated and silo-ed to just basic outright distrust, and when you assume that everyone is out to get you (probably because that’s what experience has taught you) then you C.Y.A. ten times till Sunday to protect yourself and your team. Combine this with ITIL for a particularly potent blend of bureaucratic misery.
  • Centralized teams like networking and security that are not embedded in different business groups end up serving a whole bunch of different masters. All of whom are going in different directions and want different things. In our organization this seems to mean that the loudest, meanest person who is holding their feet to the SLA gets whatever they want at the expense of their quieter customers like myself.
  • Little time lags magnify delay as the project goes on. Two weeks in security approval limbo puts me four weeks behind a few months down the road which means I then miss my certificate expiry deadline which then means I need to fill out another ticket which then puts me further behind and so on ad infinitum.
  • This kind of shit is why developers are just saying “#YOLO! Screw you Ops! LEEEEROY JENKINS! We are moving to the Cloud!” and ignoring all this on-prem, organizational pain and doing DevDev (it’s like DevOps but it leads to hilarious brokenness in other new and exciting ways).
  • Public Sector IT runs on chaos, disorder and the frustrations of people just trying to Do Things. See anything ever written by Kafka.
  • ITIL. I thought it was worth mentioning twice because that’s how much overhead it adds (by design).

 

Until next time, may your tickets be speedily resolved.

Don’t Build Private Clouds? Then What Do We Build?

Give Subbu Allamaraju’s blog post Don’t Build Private Clouds a read if you have not yet. I think it is rather compelling but also wrong in a sense. In summation: 1) Your workload is not as special as you think it is, 2) your private cloud isn’t really a “cloud” since it lacks the defining scale, resiliency, automation framework, PaaS/SaaS and self-service on-demand functionality that a true cloud offering like AWS, Azure or Google has and 3) your organization is probably doing a poor job of building a private cloud anyway.

Now lets look at my team – we maintain a small Cisco FlexPod environment – about 14 ESXi hosts, 1.5TBs RAM and about 250TBs of storage. We support about 600 users and I am primary for the following:

  • Datacenter Virtualization: Cisco UCS, Nexus 5Ks, vSphere, NetApp and CheckPoint firewalls
  • Server Infrastructure: Platform support for 150 VMs, running mostly either IIS or SQL
  • SCCM Administration (although one of our juniors has taken over the day to day tasks)
  • Active Directory Maintenance and Configuration Management through GPOs
  • Team lead responsibilities under the discretion of my manager for larger projects with multiple groups and stakeholders
  • Escalation point for the team, point-of-contact for developer teams
  • Automation and monitoring of infrastructure and services

My-day-to-day consists of work supporting these focus areas – assisting team members with a particularly thorny issue, migrating in-house applications onto new VMs, working with our developer teams to address application issues, existing platform maintenance, holding meetings talking about all this work with my team, attending meetings talking about all this work with my managers, sending emails about all this work to the business stakeholders and a surprising amount of tier-1 support (see here and here).

If we waved our magic wand and moved everything into the cloud tomorrow, particularly into PaaS where the real value to cost sweet spot seems to be, what would I have left to do? What would I have left to build and maintain?

Nothing. I would have nothing left to build.

Almost all of my job is working on back-end infrastructure, doing platform support or acting as an human API/”automation framework”. As Subbu states, I am a part of the cycle of “brittle, time-consuming, human-operator driven, ticket based on-premises infrastructure [that] brews a culture of mistrust, centralization, dependency and control“.

I take a ticket saying, “Hey, we need a new VM.” and I run some PowerShell scripts to create and provision above said new VM in a semi-automated fashion, I then copy the contents of the older VM’s IIS directory over. I then notice that our developers are passing credentials in plaintext back and forth through web forms and .XML files between different web services which kicks off a whole week’s worth of work to re-do all their sites in HTTPS. I then setup a meeting to talk about these changes with my team (cross training) and if we are lucky  someone upstream actually gets to my ticket and these changes go live. This takes about three to four weeks optimistically.

In the new world our intrepid developer tweaks his Visual Studio deployment settings and his application gets pushed to an Azure WebApp which comes baked in with geographical redundancy, automatic scale-out/scale-up, load-balancing, a dizzying array of backup and recovery options, integration with SaaS authentication providers, PCI/OSI/SOC compliance and the list goes on. This takes all of five minutes.

However here is where I think Subbu get its wrong: Of our 150 VMs, about 50% of them belong to those “stateful monoliths”. They are primarily composed of line-of-business applications with proprietary code bases that we don’t have access to or they are legacy applications built on things like PowerBuilder and no one understands how they work anymore. They are spread out across 10 to 20 VMs to provide segmentation but have huge monolithic database designs. It would cost us millions of dollars to re-factor the application into a design that could truly take advantage of cloud services in their PaaS form. Our other option would be cloud-based IaaS which is not that different from the developer’s perspective than what we are currently doing except that it costs more.

I am not even going to touch on our largest piece of IT spend which is a line-of-business application that has “large monolithic databases running on handcrafted hardware.” in the form of an IBM z/OS mainframe. Now our refactoring cost is in the ten of millions of dollars.

 

If this magical cloud world comes to pass what do I build? What do I do?

  • Like some kind of carrion lord, I rule over my decaying infrastructure and accumulated technical debt until everything legacy has been deprecated and I am no longer needed.
  • I go full retar… err… endpoint management. I don’t see desktops going away anytime soon despite all this talk of tablets, mobile devices and BYOD.
  • On-prem LAN networking will probably stick around but unfortunately this is all contracted out in my organization.
  • I could become a developer.
  • I could become a manager.
  • I could find another field of work.

 

Will this magical cloud world come to pass?

Maybe in the real world but I have a hard time imaging how it work for us. We are so far behind in terms of technology and so organizationally dysfunctional that I cannot see how moving 60% of our services from on-prem IaaS to cloud-based IaaS would make sense, even if leadership could lay off all of the infrastructure support people like myself.

Our workloads aren’t special. They’re just stupid and it would cost a lot of money to make them less stupid.

 

The real pearl of wisdom…

The state of [your] infrastructure influences your organizational culture.Of all things in that post, I think this is the most perceptive as it is in direct opposition to everything our leadership has been saying about IT consolidation. The message we have continually been hearing for the last year and a half is that IT Operations is a commodity service – the technology doesn’t matter, the institutional knowledge doesn’t matter, the choice of vendor doesn’t matter, the talent doesn’t matter: It is all essentially the same and it is just a numbers game to find the implementation that is the most affordable.

As a nerd-at-heart I have always disagreed with this position because I believe your technology choices determine what is possible (i.e., if you need a plane but you get a boat that isn’t going to work out for you) but the insight here that I have never really deeply considered is that your choice of technology drastically effects how you do things. It effects your organization’s cultural orientation to IT. If you are a Linux shop, does that technology choice precede your dedication to continuous integration, platform-as-code and remote collaboration? If you are a Windows shop, does that technology choice precede your stuffy corporate culture of ITIL misery and on-premise commuter hell? How much does our technological means of accomplishing our economic goals shape our culture? How much indeed?

 

Until next time, keep your stick on the ice.

A Ticket Too Far… Breaking the Broken

A funny thing happened a while back, one of my manager’s asked me to stop creating tickets on the behalf of customers. This, uh, well, this kind of made me pause for a few reasons. The first and most obvious one is that I cannot remember shit. I always feel terrible when I forget someone’s request and I feel doubly terrible when I forgot it due to an oversight as simple as getting a ticket. The second, is that it is generally considered a Good Thing (TM) to track your customer requests. I won’t even bother supporting that proposition because Tom Limoncelli has pretty much got that covered in Time Management for System Administrators.

The justification for this directive is pretty simple and common-sense and is a great example of how a technical person like me with the best of intentions can actually develop some self-sabotaging behavior.

  • Tickets created for customers by myself with my notes in them are confusing to Tier-1/Tier-2 support folks. It looks like I created the ticket but forgot to own it and I am still working the issue where in actuality I bumped the request all the way back down to Tier-1 where it should of started. Nothing makes a ticket linger in limbo longer, than looking like someone is working it but not being owned by anyone. This tendency for tickets to live in limbo is exacerbated because our ticket system does not support email notification.
  • Customers are confused when a Tier-1/Tier-2 person calls them after picking up a ticket from the queue and asks, “Hey there, I am calling about Request #234901 and your <insert issue here>”.
  • Finally and most importantly, it does nothing to help correct the behavior of customers and teach them the one true way to request assistance from IT by submitting a ticket.

OK. Rebuttal time! (Which sounds kind of weird when you say it out loud). The first two points are largely an artifact of our ticketing system and/or its implementation.

The ticket queue is actually a generic user in the ticket system that tickets can get assigned to by customers. There is no notification when a ticket is created and assigned to this queue, nor any when a ticket is assigned to you. The lack of notification requires a manager or a lead on our team to police the queue, assign tickets to line staff based on who they think is best suited to work a particular issue and then finally notify them via email, phone or in-person.

The arguably confusing series of events where a ticket is created on behalf of a user is again, mainly a technical fault of the system. The requester is set to the customer but line staff that picks up the ticket may just read the notes which has my grubby hands all over them… so whose issue is it? Mine or the customer’s?

That being said – both of these points could largely be alleviated by a smarter ticket system that had proper notification and our Tier-1 guys reading the notes a little more carefully. I can forgive them their trespass since they are extremely interrupt driven and have a tendency to shoot tickets first and ask questions later but still, the appropriate context is there.

The last point, the idea that creating tickets re-enforces bad end-user behavior, is by far the most salient one in my opinion. If you let people get away with not submitting tickets you are short-changing yourself and them. I won’t get credit for the work, the work won’t be documented, we won’t have accurate metrics and I am about 1000% more likely to forget the request and never do it.

Problem: We don’t have a policy requiring users to submit a ticket for a request, it’s more like a guideline. And the further up the support tiers you go, the fewer and fewer requests have tickets. This leaves my team in an interesting spot, we either create the ticket for the customer, tell the customer we won’t work the issue if they don’t create a ticket first (kind of a dick move, especially when there is no teeth to our policy) or not create a ticket at all.

Conclusion: Right idea but we are still focused on the symptom and not the cause. Let’s review.

  • The ticket system has technical deficiencies that lead to less than ideal outcomes. It makes it cumbersome for both technical staff and customers to use it and relies on staff doing the very thing ticket systems are supposed to reduce, interrupt people to let them know they have work assigned to them.
  • A policy is not useful if it does not have teeth. I already feel like a jerk telling a customer “Hey. I am working with another team/customer/whatever but if you submit a ticket someone will take a look”, when they are standing in my cubicle with big old doe eyes. I especially feel like a jerk when I do not even have a policy backing me up. Paraphrasing, Tom Limoncelli, “Your customers judge your competency by your availability. Your manager judges it by your completion of projects. These two dual requirements are directly opposed and balancing them is incredibly important.”
  • By the time I am creating a ticket on behalf of a customer the battle is already lost. I’ve already been interrupted with all the lost efficiency and the danger of mistakes that comes with it.
  • The customers that do not submit tickets get preferential treatment. They get to jump ahead of all the people that actually did submit tickets which hardly seems fair. All that is happening here is that we are encouraging the squeaky wheels to squeak louder.
  • The escalation chain gets skipped. A bunch of these kind of issues should be caught at Tier-1 and Tier-2. By skipping right to Tier-3, we are not applying our skills optimally and also depriving the Tier-1 and Tier-2 guys the chance to chew on a meatier problem. A large part of the reason I am creating tickets for customers is to bump the request back down to Tier-1 and Tier-2 where it should have been dealt with to begin with.

Creating tickets on the behalf of customers is not the problem. It is a symptom of deeper issues in Process and Technology. These issues will not be resolved by no longer generating tickets for customers. Customers will still skip the escalation chain, we will continue to re-enforce bad behavior, less issues will get recorded, our Tier-2 and Tier-3 will still be interrupt driven regardless of whether there is a ticket or not. All that will change is that we will be more likely to forget requests.

The technical problems can be resolved by implementing a new ticket system or by fixing our existing one. The policy problems can be solved by creating a standardized policy for all our customers and then actually ensuring that it has teeth. The people problems can be fixed by consistent and repeated re-training.

That covers the root cause but what about now? What do we do?

  • We create the ticket for the customer – We cannot really do that. It disobeys a directive from leadership and it has all the problems discussed above.
  • We tell the customer to come back with a ticket – This does not really address the root cause, annoys the customers and we do not have policy backing it up. It is not really an option.
  • Do not use a ticket to track the request – And here we are by process of elimination. If things are broken sometimes the best way to fix them is let them break even further.

Until next time . . .

When the world gets bad enough, the good go crazy, but the smart…they go bad.Evil Abed

World Backup Recovery Testing Day?

Yesterday was apparently the widely celebrated World Backup Day. Just like reality, the party ends some time unless you happen to be Andrew W.K and now you have woken up with a splitting headache, a vague sadness and an insatiable desire for eggs benedict. If installing and configuring a new backup system is an event that brings you joy and revelry like a good party, the monotony of testing the recovery of your backups is your hangover that stretches beyond a good greasy breakfast. I propose that today should thus be World Backup Recovery Testing Day.

There is much guidance out there for anyone who does cursory research on how to design a robust backup system so I think I will save you from my “contributions” to that discussion. As much as I would like to relay my personal experience with backups; I do not think it would be wise to air my dirty laundry this publically. In my general experience, backup systems seems to get done wrong all the time. Why?

 

Backups? We don’t need those. We have snapshots.

AHAHAHAHAHAHA. Oh. Have fun with that.

I am not sure what it is about backup systems but they never seem to make the radar of leadership. Maybe because they are secondary systems so they do not seem as necessary in the day-to-day operations of the business as production systems. Maybe because they are actually more complicated than they may seem. Maybe because the risk to cost ratio does not seem like a good buy from a business perspective, especially if the person making the business decision does not fully understand the risk.

This really just boils down to the same thing: Technical staff, not communicating the true nature of the problem domain to leadership and/or leadership not adequately listening to the technical staff. Notice the and/or. Communication: it goes both ways. If you are constantly bemoaning the fact that management never listens to you, perhaps you should change the way you are communicating with your management? I am not a manager so I have no idea what the corollary to this is (ed. feel free to comment managers!).

Think about it. If you are not technical, the difference between snapshots and a true backup seem superfluous. Why would you pay more money for a duplicate system? If you do not have an accurate grasp of the risk and the potential consequences why would you authorize additional expenditures?

 

I am in IT. I work with computers not people.

You do not work with people, you say? Sure you do. Who uses computers? People. Generally people that have some silly business mission related to making money. You best talk to them and figure out what is important to them and not you. The two are not always the same. I see this time and time again. Technical staff implements a great backup system but fails to backup the stuff that is critical to the actual business.

Again. Communication. As a technical person, one database looks more or less identical to another one. I need to talk to the people that actually use that application and get some context, otherwise how would I know which one needs a 15 minute Recovery Time Objective and which one is a legacy application that would be fine with a 72 hour Recovery Time Objective. If it was up to me, I would backup everything, with infinite granularity and infinite retention but despite the delusion that many sysadmin’s labour under they are not god and do not have those powers. Your backup system will have limitations and the business context should inform your decision on how you accommodate those limitations. If you have enough storage to retain all your backups for six weeks or half your backups for 4 weeks and half for 4 months and you just make a choice, maybe you will get lucky and get it right. However, the real world is much more complicated than this scenario it is highly likely you will get it wrong and retain the wrong data for too long at the expensive of the right data. These kind of things can be Resume Generating Events.

My favorite version of this is the dreaded Undocumented Legacy Application that is living on some aging workstation tucked away in a forgotten corner. Maybe it is running the company’s timesheet system (people get pissed if they cannot get paid), maybe it is running the HVAC control software (people get pissed if the building is a nice and frosty 48 degrees Fahrenheit), maybe it is something like SCADA control software (engineers get pissed with water/oil/gas does not flow down the right pipes at the right time, also people may get hurt). How is technical staff going to have backup and recovery plans for things like this if they do not even know they exist in the first place?

It is hard to know if you have done it wrong

In some ways, the difficulty of getting backup systems right is that you only know if you have got it right once the shit hits the fan. Think about the failure mechanism for production systems: You screwed up your storage design – stuff runs slow. You screwed up your firewall ACLs – network traffic is blocked. You screwed up your webserver – the website does not work any more. If there is a technical failure you generally know about it rather quickly. Yes, there are whole sets of integration errors that lie in wait in infrastructure and only rear their ugly head when you hit a corner case but whatever, you cannot test everything. #YOLO #DEVOPS

There is no imminent failure mechanism constantly pushing your backup system towards a better and more robust design since you only really test if you need it. Without this Darwinian IT version of natural selection you generally end up with a substandard design and/or implementation. Furthermore, for some reason backups up here are associated with tapes, and junior positions are associated with tape rotation. This cultural prejudice has slowly morphed into junior positions being placed in charge of the backup system; arguably not the right skillset to be wholly responsible for such a critically important piece of infrastructure.

Sooooo . . . we do a lot of things wrong and it seems the best we can do is a simulated recovery test. That’s why I nominate April 1st as World Backup Recovery Testing Day!

 

Until next time,

Stay Frosty

The Art and Burden of Documentation

I have been thinking a lot about documentation lately, mostly about my own shortcomings and trying to understand why the act of documenting seems so difficult and why the quality of the documentation that does get done is often found lacking. Good documentation and good documentation practices are such a fundamental part of the health of an IT shop you would think we as a field would be better at it. My experience is limited and anecdotal (whose is not?) but I have yet to see a shop with solid documentation and solid documentation practices. This extends to myself as well. I can look back at my various positions and roles and there are very few where I actually felt satisfied with the quality of my documentation.

Read on for “aksysadmin’s made-up principles of how to not suck at documentation and do other things good too”.

 

1. Develop a standardized format, platform and process from the bottom up. Your team uses this, not you.

Leadership has a tendency to standardize on a single standard, platform or process. This is generally considered a good thing. The problem is, leadership does not write technical documentation. We do. And what standard makes sense to them, may not make any sense to the technical staff (*cough* ITIL *cough*). What platform seems adequate to them, may seem unwieldy to sysadmins (*cough* SharePoint *cough*).  Standardization may generally be considered a good thing but forcing a standard, platform or process on a team without input or understanding their problem domain and use case is generally considered a bad thing. The harder you make it for your team to document, the less likely they will be to perform a task they are already unlikely to perform.

2. Don’t document how, document why

This is part an internal challenge (IT staff documenting the wrong things) and an external challenge (leadership requiring the wrong kind of things to be documented). I see lots of documentation that is essentially a re-hashed version of a vendor’s manual. Ninety-nine percent of the time your vendor has exhaustive resources on how to do something. It is right there. In the manual. Go read it. Unless it is incredibly unintuitive, and sometimes it is, why would you waste your precious time re-writing an authoritative set of information into a non-authoritative set that requires your team to maintain it? Reading and understanding vendor documentation should be considered a fundamental skill, if your guys cannot read or are unwilling to read vendor manuals you have other problems that need addressing.

What you should document, is why you did things. You will not remember why this particular group was setup or why things are this way instead of that way in six months and your successor certainly will have no idea. Use your documentation to provide context and meaning.

3. Document where to find things

Documenting why something is the way it is great but it is also important to document where things are. I am talking about, things like IP Addresses, Organizational Charts, Passwords and so on. This is another opportunity to avoid work, err work more efficiently. Chances are many of these things have authoritative sources maintained by other people or tools. Why write your IP Addresses down manually in an Excel Spreadsheet when you can use a tool like IPAM to track them? Why track the phone tree for your different workgroups when Active Directory can do that for you? Why spend time doing stuff that is already done? Why indeed?

Figuring out what stuff to document in the where category can be hard to do. I have found the easiest way to do this is to pretend you are brand new. Better yet if you have a brand new team member ask him to track these kinds of information requests as he acquaints himself to your particular little piece of hell. What does he need to know right now to do your job? That is what your replacement will be asking himself after you have ascended.

4. Don’t document break/fix issues

Do not fill up your Wiki, SharePoint, OneNote or file share full of Word .docx with break/fix issues. Your infrastructure and process documentation should be broad and “provide context and meaning” which is pretty much the opposite kind of information than the kind break/fix issues are about – specific configurations, systems or problems.

You already have a place to “document” break/fix issues – it is called your damn ticket system. Use it. Document your fixes in your tickets. If your Tier-1 guys have a habit of closing all but the simplest of tickets with “done” or “fixed”, slap them (and probably yourself as well) and say that their future self just came back in time to hit them for making their job harder. If you do not have ticket system then you have other problems that need to be addressed.

5. Have a panic page

Take the really important stuff from the why documentation and the where documentation and make it into a panic page. A panic page is a short piece of documentation that contains all the information you or anyone else would need to have in order to deal with a “whoops” situation. Think things like vendor contact phone numbers, contract support entitlements, how to file and escalate a case and maybe where to find your co-worker’s emergency scotch. I borrowed this one from my supervisor and it turned out to be prescient suggestion on his part.

6. Have a hard copy

This is an extension of the panic page principle. Panic situations have a way of making electronic documentation inaccessible. “Oh but wait my documentation is in the cloud, I can get to it anywhere with my mobile device, oh I am so smart” you say, yeah well, you will be screwed when you drop your iPad or it runs out of batteries or you happen to live in Alaska which has comparable infrastructure to, say Afghanistan. Have a paper copy on hand, preferably two. Yes it will be harder to maintain but it will be a lot better than having no documentation if your file server explodes or the polar bears take over your data center.

7. Have designated documentation days

As a sysadmin, you generally do not have luxury of setting your own priorities. If your leadership wants documentation, instead of just saying “Hey, we need to document better” at your weekly staff meeting, they need to make it a priority. Nothing does this better than designating a day for documentation. Read-Only Friday is a good one because you are not making changes on Friday anyway, right… righhhhtttt? Of course, you are still going to get interruptions and tickets so designate one person as the interruption blocker and another team member as the documenter (borrowed from the Tom Limoncelli’s excellent Time Management for System Administrators). Rotate individuals as appropriate. These designated documentation days are your time, to make time, to actually Get Shit Done. All those little notes you meant to flesh out with some more context but never had time to… do it now. Organize your stuff. Clean it up. Review it for accuracy. Do this with a frequency related to how fast things change. Until leadership makes it a priority, you will always have another one that trumps it.

 

These ideas address some of those external and internal challenges you may have and that I know I have. I am more inclined to document stuff if I am not documenting dumb stuff that is already documented elsewhere. I will have an easier time finding the documentation we do have, if it is organized in a way that works for me and my team, both the consumer and producer of it. If I have dedicated time to actually perform the act of documenting it will probably get done. If not, then I am answerable to someone. Of course there can be a large gap between knowing what needs improvement and actually fixing it. Until then.

Stay frosty.