Things in our Datacenter that Annoy Me

Or alternatively how I learned to stop worrying about the little things…

In this post, I complain about little details that show my true colors as a some kind of pedantic, semi-obsessive, detail oriented system administrator. I mean, I try to play it cool but inside I am really freaking out, man! Not really but also kind of yes. More on that later.

 

Our racks are not deep or wide enough

Our racks were not sized correctly initially. They are quite “shallow”. A Dell 730 on ReadyRails is about 28″ deep. This is a pretty standard rack mounting depth for full-size rackmount equipment. In our racks, we only have about 4-6″ of space remaining between the posts and the door in the back of the rack. This complicates cabling since we do not have a lot of room to work with but it really gets annoying with PDUs. See below.

The combination of the shallow depth and lack of width, leads to weird PDU configurations

PDU Setup

The racks are too shallow to mount the PDUs parallel with the posts with the plugs facing out towards the door and too narrow to stack both PDUs on one side. The PDUs end up being mounted sideways where they stick out into the area between the posts, blocking airflow and making cabling a pain in the ass.

Check out u/tasysadmin’s approach which is much improved over ours. The extra depth and width allows both power circuits (you do have two redundant power circuits, right?) to move over to one side of the rack and slide into the gap between the posts and casing. This has a whole bunch of benefits: airflow is not restricted, you have more working space for cabling, your power does not have to cross the back of the rack and you can separate your data and your power.

Beautiful Rack Cabling

This also means that some of our racks have the posts moved in beyond the standard rack mounting depth of 28″ in order to better accommodate our PDUs, the result of which is that I only have two out five racks that can accommodate a Dell PowerEdge.

Data and power not separated

You ideally want to run power on one side of the rack and data on the other. Most people will cite electromagnetic interference as a good reason for doing this but I have yet to see a problem caused by it (knock on wood). That being said, it is still a good idea to put some distance between the two, much like your recently divorced aunt and uncle at family functions. There are plenty of other good reasons for keeping data and power separate, most of which center around cabling hygiene – it helps keep things much cleaner because your data cables tend to run up and down the rack posts headed for things like your top-of-rack switch, whereas your power needs to go other places (i.e., equipment). It is a lot easier to bundle cables if they more or less share the same cable path.

Cannot access cable tray because of PDU cables

Cable Tray

This is just another version of “data and power are not separated”. Our power and data both come in at the top of the rack. This means our 10 4/C AWG feeds for each PDU which are about .5″ in diameter are draped across our cabling tray which just sits on top of the racks instead of being suspended by a ladder bar (another great injustice!). I bet these guys generate quite the electromagnetic field. It would nice if they were more than 4″ away from some of 10 Gbps interconnects, huh? This arrangement also means the cable tray is huge pain to use. You have to move all the PDU power cables off of it, then pop the lid off in segments to move your cables around. Or you can just run them all over the top of the rack and hope that the fiber survives like we do. Again. Not ideal.

Inconsistent fastener use for mounting equipment

This one sounds kind of innocuous but it is one of those small little details that makes your life so much easier. Pick a fastener type and size and stay with it. I am partial to M6 because the larger threads are harder to strip out and the head has more surface area for your driver’s bit to contact with. It is pretty annoying to change tools each time the fastener type is different instead of just setting your torque level on your driver and going for it. Also – don’t even think of using self-tapping fasteners. They make cage nuts and square holes in rack posts for a reason.

Improper rail mounting and/or retention

Your equipment comes with mounting instructions and you should probably follow them. Engineers calculate how much weight a particular rail can bear and then figure out that you need four fasteners of grade X on each rail to adequately support the equipment. This is all condensed into some terrible IKEA-level instructions which makes you shake your head as you wonder why your vendor could not afford a better technical writer for the obscene price of whatever equipment you are racking. Once you decipher these arcane incantations you should probably follow them. Don’t skip installing cage nuts and fasteners – if they say you need four, then you need four. It only takes two more minutes to do the job right.

AND FOR THE LOVE OF $DIETY INSTALL WHATEVER HARDWARE IS REQUIRED TO RETAIN THE EQUIPMENT IN THE RAILS! Seriously. This is a safety issue. I am not sure why this step is skipped and people just set things on the rails without using the screws to retain it to the posts but racks move, earthquakes happen and this shit is heavy. I think most our disk shelves are about 50 pounds. You do not want that falling out of the rack and onto your intern’s head.

Use ReadyRails (or vendor equivalent)

For about $80 dollars you can have a universal, tool-less rail that installs in about 30 seconds. I would call that a good investment.

Inconsistent inventory tagging locations

I am guessing your shop maintains an inventory system and you probably have little inventory tags you affix to equipment. Do your best to make the place where the inventory tag goes consistent and readable once everything is rack and stacked. The last thing you want to do is pull an entire rack apart because some auditor wants you find the magical inventory tag stuck on some disk shelf in the middle of 12 shelf aggregate.

It would also be a good idea to put your inventory tag in your documentation so you do not have to play a yearly game of “find the missing inventory tag”.

Cable labeling is not consistent (just use serialized cables)

I suck at cable labeling and documentation in general (see here) so this is a bit hypocritical, nevertheless I find that there are four stages in cable labeling: nothing, consistent labeling of port and device on each end, confusion as labeled cables are reused but the label is not changed and finally adoption of serialized cables where each end has a unique tag that is documented.

This is largely personal preference but the general rules are simple: keep it clean, keep it consistent and keep it current (your documentation that is). The only thing worse than a unlabeled cable is a mislabeled cable.

Gaps in rack mount devices

Shelf Gap

Why? Just why? I don’t know and will probably never know… but my best guess is the rail on the top shelf was slightly bent during installation and then when we needed to add another shelf later the rail interfered with it. 10 minutes originally could of have saved 10 hours down the road. If it turns out I am one post hole short of being able to install another shelf I get to move all the workloads off this aggregate, pull out all the disk shelves until I reach this one, fix or replace the rail, re-rack and re-cable everything, re-create the aggregate and then move the workloads back.

 

Now that I have complained a bit (I am sure r/sysadmin will say that I have it way to easy)  I get to talk about the real lesson here: none of this shit matters.

On one level they do. All these little oversights accumulate technical debt that eventually comes back and bites you on the ass and doing it right the first time is the easiest and most efficient way. On other hand, none of this stuff directly breaks things. The fact that the power and data cabling are too close together for my comfort or that there is small gap in one of the disk shelf stacks does not cause outages. We have plenty of things that do however and those demand my attention. So collectively, lets take a deep breath, let it go, and stop worrying about it. It’ll get fixed someday.

The other lesson here is that nothing is temporary. If you cut a corner, particularly with physical equipment, that corner will remain cut until the equipment is retired. It is just too hard and costly to correct these kinds of oversights once you are in production. If you are putting a new system up, take some time to plan it out – consider the failure domain, how much fault tolerance and redundancy you need, labeling, inventory and all those other little things. You only get to stand this system up once. Go slow and give it some forethought, you may thank yourself one day.

A Ticket Too Far… Part II

Part I:  A Ticket Too Far… Breaking the Broken

OK. So I screwed up. If you read the above post, I make a lot of claims about ticket systems and process. Many of those claims were based on the idea that we did not have an approved and enforced policy in place. Turns out I was wrong, sort of.

I did a little digging and reviewed the policy. Customers are asked to submit a ticket if their issue is not immediately preventing their work, otherwise they can either call the help desk number, an IT staff member directly or visit them in person. There is a lot I could say about this policy and what provisions I agree with and which I do not, but policy is policy and I misrepresented the strength of it – it is very much vetted and approved.

I see the goals of a policy on customer facing support as follows:

  • Prevent sysadmins from forgetting customer requests
  • Allow sysadmins to control their interrupt-based workflow, prioritize and not have their workflow control them
  • Track customer requests and incidents so pain-points can be discovered and resolved proactively
  • Build a database of break/fix-based documentation
  • Create an acknowledgement and feedback mechanism for customer issues (i.e., “you have been assigned ticket #232”), backed up by mechanisms that forces sysadmin action (i.e., “this ticket has not been touched in three days, either close it or reply”). This feedback loop ensures that issues are acknowledge and resolved either one way or another.

The details may be wrong but the bigger point of my last post remains the same; the combination of our policy and ticket system’s technical limitations does not accomplish those goals or lead to ideal outcomes for either customers or IT staff.

But does it really? Perception and reality are not always the same so I started tracking how often I was interrupted by a customer or a team member over a period of about four weeks. It is important to mention this was not a particularly rigorous study, I just kept an Excel spreadsheet and anytime I diverted my attention for more than a few minutes from my current task I made a quick note of it. If anything, I was consistent in my inconsistency. I also kept track of what kind of interrupt it was, what group it came from and whether or not it had ticket attached to it.

 

a graph showing the number of interrupts per day

A couple of interesting discoveries here:

  • I am not interrupted nearly as much as I think I am. If you throw out the obvious outlier of Day 12, the median is two interrupts per day. Not as bad as I would have thought… but it is not that great either considering with meetings and other obligations, I probably only have one period per day of uninterrupted time to focus on complex projects that is longer than two hours. Getting an interrupt during that time period is a pretty serious set back.
  • 48% of the interrupts were related to break/fix issues. The other 52% were what I call “communication interrupts”. More on these later.
  • Of the 31 break/fix interrupts I recorded, only one actually had a ticket already associated with it. This is mind-boggling terrible as far as accepted time management best practices go.
  • Only 59% of the interrupts required immediate action versus action that could be queued and prioritized later. This means 59% of these interruptions really did not need to be interruptions at all, they needed to be tickets or agenda items in a meeting.

Even with my back-of-the-napkin math these are pretty damning conclusions. Our ticket system capture rate, at least at Tier-3 where I spend most of my time, is laughably non-existent. Going back to my previous post about creating tickets for customers, there would be no reason for me to do so considering I would be creating 97% of the tickets myself. Interestingly enough, it is not like these requests just vanish into thin air. They get tracked and recorded somehow, albeit most likely by individual staff in a manner that is not portable or visible. The work required to track issues is still being done, just the organization is not getting a lot of value out of it since it is just recorded in some crusty senior sysadmin’s logbook.

As an aside, It would be really interesting to perform the same experiment at the Tier-1 and Tier-2 support levels and see what the ratio is. Maybe it is higher and it is more the kind of issues and/or customers that Tier-3 deals with that lead to low ticket system capture. Or worse maybe it is the same and ticket system capture is just really bad everywhere.

Almost two-thirds of these interrupts did not have to be interrupts. They did not require immediate action. This is also pretty terrible because interrupts cost a lot. The accepted wisdom is that it takes between 10 and 20 minutes to get back in “the zone” (source) after being interrupted. For myself that is about 40 minutes lost on average per day just in dealing with context switching from one task to another. There is also a greater risk of mistakes being made trying regain focus on your task. It is harder to calculate this cost but it is assuredly there.

Finally, a bit over half of these interrupts were “communication interrupts”. These are hard to pin down but mostly they were a team member or a customer wanting to communicate something to me, “Hey, I fixed item A on Server 1, but item B and item C are still broken” or a request for information, “Hey, how do items Z and Y work on Server 2 and Server 3?”. These clearly have a interrupt cost but on the other hand, they also have a cost in not immediately jumping to the top of my to-do pile. If someone needs information from me it is likely because they are in a wait-cycle – they need to know something to continue their current task. It feels like it boils down to a “whose time is more valuable?” argument. Is it better for a Tier-2 team member to burn 45 minutes instead of 10 on a task because she had to dig up some information instead of asking a Tier-3 sysadmin? Or is better for the Tier-3 sysadmin to burn 20 minutes to save his Tier-2 team member 35? I do not really know if there is answer here but it is an interesting thread to pull on…

The interrupts that served to communicate information to me seem a little more clear cut. None of them required immediate action and most of them were essentially status updates. Implementing a daily standup meeting or something similar would be a prefect format for these kinds of interactions.

 

Well. It was an interesting little experiment. I am not sure if I am smarter or better off for it but curiosity is an itch that sometimes just needs to be scratched.

Until next time, stay frosty.

 

Are GOV IT teams apathetic?

I have been stewing about this post on r/sysadminIs apathy a problem in most government IT teams?, for a while and felt like it was worth a quick write-up since most of my short IT career has been spent in the public sector.

First off, apathy and team dysfunction is a problem everywhere. There is nothing unique about government employees versus private employees in that respect. What I think the poster is really asking is, “Is there something about government IT that produces apathetic teams?” and if you read a little deeper it seems like apathy really means “permanent discouragement”; that is to say, the condition where change, “doing things right or better”, greater efficiency are or seemly are impossible. When you read something like, “…trying to make things more efficient is met with reactions like ‘oh you naive boy’ and finger pointing,” it is hard to think of just plain old vanilla apathy.

Government is not a business (despite what some people think). Programs operate at a loss, are subsidized in many cases entirely, by taxes because the public and/or their representatives deems those programs worthy. The failure mechanism of market competition doesn’t exist. Incredibly effective programs can be cancelled because they are no longer politically favorable and incredibly ineffective programs can continue or expand because they have political support. Furthermore, in all things public servants need to remain impartial, unbiased and above impropriety. This leads to vast and byzantine processes, the components of which singularly make imminent good sense (for example, the prohibition of no-bid contracts) but collectively all these well-intentioned barnacles slow the ship-of-state dramatically. Success is not rewarded with growth either. Implementing a more efficient process, a more cost effective infrastructure and saving money generally results in less money. This tendency of budget reduction (“Hey, if you saved it, you did not need it to begin with, right?”) turns highly functioning teams into disasters overtime as they lose resources. Paradoxically, the better you are at utilizing your existing resources, the less you get. Finally, your entire leadership changes with every administration change. You may still be shoveling coal down in the engine room, but the new skipper just sent down word to reduce steam and come about hard in order to head in the opposite direction. Generally private companies that do this kind of thing, with this frequency, do not last long.

How does all this apply to Information Technology? It means that your organization will move very, very slow and technology moves very, very fast. Not a good combo.

 

Those are the challenges that a team faces but what about the other half of the equation… the people facing them?

Job classes are just one small part of this picture but they are emblematic of some of the challenges that face team leads and managers when dealing with the ‘People’, piece of People, Process and Technology (ITIL buzzword detected! +5 points). The idea of job classes is that across the organization people doing similar work should be paid the same. The problem lies in that updating a job class is beyond onerous and the time to completion is measured in years. Do you know how quickly Information Technology reinvents itself? Really quick. This means that job classes and their associated salaries tend to drift away from the actual on-the-ground work being done and the appropriate compensation level over time, making recruitment of new staff and retention of your best staff very difficult (The Dead Sea Effect). If you combine this with a lack of training and professional development, staff has a tendency to get pigeon-holed into a particular role without a clear promotion path. Furthermore, many of the job class series are disjointed in such a way as working at the top of one job series will not meet the prerequisites for another job series, making advancement difficult, and at least on paper sometimes impossible. For example: you could work as a Lead Programmer for three years leading a team of five people and not qualify, at least on paper, for an entry level IT Manager position.

How does all this apply to Information Technology? People get stuck doing one job, for too long, with no professional training or mentorship. Their skillsets decline towards obsolescence and they become frustrated and discouraged.

 

I have never met anyone in the public sector that just straight up did not give a crap. I have met people that feel stuck, discouraged, marginalized and ignored. And rightly so. Getting stuff done is very hard. It is like everyone has one ingredient necessary to make a cake, and you all more, or less, agree on the recipe. You are all trained and experienced bakers. You can easily make a cake but you each have 100 pieces of paperwork you have to fill out and wait on, sometimes for months, before you can do your part of the cake-baking process. You have 10 different bosses, each telling you to make a different desert when you know that cakes are by far the best desert for your particular bakery. Then you get yelled at for not making a cake in a timely manner, and then you all fired and replaced by food service contractors whose parent company charges an exorbitant hourly rate. But hey, the public eventually got their cake right? Or at least a donut. Not exactly what they ordered but better than nothing … right?

If IT is a thankless job (and I am not sure I agree with that piece of Sysadmin mythology), then Public Sector IT is even more thankless. You will face a Kafkaesque bureaucracy. You will likely be very underpaid and have a difficult time seeking promotion. You will never be able to accept a vendor-provided gift or meal over the price of $25. You will laugh when people ask if you plan on attending VMworld. The public will stereotype you as lazy, ineffective and overpaid. But you will preserve. You have a duty to the body politic to do your best with what you have. You will keep the lights on, you will keep the ship afloat even as more and more water pours in. You have to. Because that’s what Systems Administrators do.

And all you wanted was to simply make a cake.

 

 

One Year of Solitude: My Learning Experience as a Lead

It has been a little over a year since I stepped into a role as a technical lead and I thought this might be a good time to reflect on some of the lessons I have learned as I transition from being focused entirely on technical problems to trying to under how those technical pieces fit into a larger picture.

 

Tech is easy. People are hard. And I have no idea how to deal with them.

It is hard to understate this. People are really, really difficult to deal with compared to technology and I have so much to learn about this piece of sysadmin craft. I do not necessarily mean people are difficult in the sense that they are oppositional or hard to work with (although often they are) just that team dynamics are very complicated and the people composing your team have a huge spread in terms of experience, skills, motivations, personalities and goals. These underlying “attributes” are not static either, they change based on the day, the mood and the project making identifying, understanding them and planning around them even harder. The awareness of this underlying milieu composing your team members and thus your team is paramount to your project’s success.

All I can say is that I have just begun to develop an awareness of these “attributes” and am just getting the basics of recognizing different communication styles (person and instance dependent). I can just begin to tell whose motivations align with mine and whose do not. In hunting we call this the difference between “looking and seeing”. It takes a lot of practice to truly “see”, especially if like me, you are not that socially adept.

My homework in this category is to build an RPG-like “character sheet” for each team member, myself included,  and think about what their “attributes” are and where those attributes are strengths and where they can be weaknesses.

 

Everyone will hate you. Not really. But kinda yes.

One the hardest parts of being a team lead, is you are now “in-charge” of technical projects with a project team made up of many different members who are not within your direct “chain-of-command” (at least this is how it works in my world). This means you own the responsibility for the project but any authority you have is granted to you by a manager somewhere higher up the byzantine ladder of bureaucracy. Nominally, this authority allows you to assign and direct work directly related to the project but in practice this authority is entirely discretionary. You can ask team member A to work on item Z but it is really up to her and her direct supervisor if that is what she is going to do. In the hierarchical, authority-based culture and process driven business world that most of us of work in this means you need to be exceedingly careful about whose toes you step on. Authority on paper is one thing, authority in practice is entirely another.

 

Mo’ People, Mo’ Problems

My handful of project have thus far been composed of team members that kind of fall into these rough archetypes.

A portion of the team will be hesitant to take up the project and the work you are asking them to do since you are not strictly speaking their supervisor. They will passively help the project along and frequently you will be required to directly meet with them and/or their supervisor to make sure they are “cleared” for the work you assigned them and to make sure they feel OK about doing it. These guys want to be helpful but they don’t want to work beyond what their supervisor has designated. Get them “cleared” and make sure they feel safe doing the work and you have made a lot of progress.

Another portion of the team will be outright hostile. Either their goals or motivations do not align with the project or even worse their supervisor’s goals or motivations do not align with the project but someone higher up leaned on them and so they are playing along. This is tough. The best you can hope for here is to move these folks from actively resisting to passively resisting. They might be “dead weight” but at least they aren’t actively trying to slow things down any more. I don’t have much a working strategy here – an appeal to authority is rarely effective. Authority does not wanted to bothered by your little squabbles and arguably it has already failed because chain-of-command can make someone play along, but it cannot make they play nice. I try to tailor my communication style to whatever I am picking up from these team members (see the poorly named, Dealing with People You Can’t Stand), do my best to include them (trying to end-run them makes things ten times worse) and inoculate the team against their toxicity. I am fan of saying I deal with problems and not complaints because problems can actually be solved but a lot of times these folks just want to complain. Give them a soap box so they can get it out of their system so you can move on get work done but don’t let them stand on it for too long.

Another group will be unengaged. These poor souls were probably assigned to the project because their supervisor had to put someone on it. A lot of times the project will be outside their normal technical area of operations, the project will only marginally effect them, or both. They will passively assist where they can. The best strategy I have found here is to be concise, do your best not to waste their time, and use their experience and knowledge of the surrounding business processes and people as much as you can. These guys can generate some great ideas or see problems that you would never otherwise see. You just have to find a way to engage them.

The last group will be actively engaged and strongly motivated to see the project succeed. These folks will be doing the heavy lifting and 90% of the actual technical work required to actually accomplish the project. You have to be careful to not let these guys lean to hard on the other team members out of frustration and you have to not overly rely on them or burn them out otherwise you will be really screwed since they are actually the only people truly putting in the nuts-and-bolts work required for the project’s success.

A quick aside, if you do not have enough people in this last group the project is doomed to failure. There is no way a project composed mostly of people actively resisting its stated goals will succeed, at least not under my junior leadership.

Dysfunctional? Yes. But all teams are dysfunctional in certain ways and at certain times. Understanding and adapting to the nature of your team’s dysfunction lets you mitigate it and maybe, just maybe, help move it towards a healthier place.

Until next time, good luck!

A Ticket Too Far… Breaking the Broken

A funny thing happened a while back, one of my manager’s asked me to stop creating tickets on the behalf of customers. This, uh, well, this kind of made me pause for a few reasons. The first and most obvious one is that I cannot remember shit. I always feel terrible when I forget someone’s request and I feel doubly terrible when I forgot it due to an oversight as simple as getting a ticket. The second, is that it is generally considered a Good Thing (TM) to track your customer requests. I won’t even bother supporting that proposition because Tom Limoncelli has pretty much got that covered in Time Management for System Administrators.

The justification for this directive is pretty simple and common-sense and is a great example of how a technical person like me with the best of intentions can actually develop some self-sabotaging behavior.

  • Tickets created for customers by myself with my notes in them are confusing to Tier-1/Tier-2 support folks. It looks like I created the ticket but forgot to own it and I am still working the issue where in actuality I bumped the request all the way back down to Tier-1 where it should of started. Nothing makes a ticket linger in limbo longer, than looking like someone is working it but not being owned by anyone. This tendency for tickets to live in limbo is exacerbated because our ticket system does not support email notification.
  • Customers are confused when a Tier-1/Tier-2 person calls them after picking up a ticket from the queue and asks, “Hey there, I am calling about Request #234901 and your <insert issue here>”.
  • Finally and most importantly, it does nothing to help correct the behavior of customers and teach them the one true way to request assistance from IT by submitting a ticket.

OK. Rebuttal time! (Which sounds kind of weird when you say it out loud). The first two points are largely an artifact of our ticketing system and/or its implementation.

The ticket queue is actually a generic user in the ticket system that tickets can get assigned to by customers. There is no notification when a ticket is created and assigned to this queue, nor any when a ticket is assigned to you. The lack of notification requires a manager or a lead on our team to police the queue, assign tickets to line staff based on who they think is best suited to work a particular issue and then finally notify them via email, phone or in-person.

The arguably confusing series of events where a ticket is created on behalf of a user is again, mainly a technical fault of the system. The requester is set to the customer but line staff that picks up the ticket may just read the notes which has my grubby hands all over them… so whose issue is it? Mine or the customer’s?

That being said – both of these points could largely be alleviated by a smarter ticket system that had proper notification and our Tier-1 guys reading the notes a little more carefully. I can forgive them their trespass since they are extremely interrupt driven and have a tendency to shoot tickets first and ask questions later but still, the appropriate context is there.

The last point, the idea that creating tickets re-enforces bad end-user behavior, is by far the most salient one in my opinion. If you let people get away with not submitting tickets you are short-changing yourself and them. I won’t get credit for the work, the work won’t be documented, we won’t have accurate metrics and I am about 1000% more likely to forget the request and never do it.

Problem: We don’t have a policy requiring users to submit a ticket for a request, it’s more like a guideline. And the further up the support tiers you go, the fewer and fewer requests have tickets. This leaves my team in an interesting spot, we either create the ticket for the customer, tell the customer we won’t work the issue if they don’t create a ticket first (kind of a dick move, especially when there is no teeth to our policy) or not create a ticket at all.

Conclusion: Right idea but we are still focused on the symptom and not the cause. Let’s review.

  • The ticket system has technical deficiencies that lead to less than ideal outcomes. It makes it cumbersome for both technical staff and customers to use it and relies on staff doing the very thing ticket systems are supposed to reduce, interrupt people to let them know they have work assigned to them.
  • A policy is not useful if it does not have teeth. I already feel like a jerk telling a customer “Hey. I am working with another team/customer/whatever but if you submit a ticket someone will take a look”, when they are standing in my cubicle with big old doe eyes. I especially feel like a jerk when I do not even have a policy backing me up. Paraphrasing, Tom Limoncelli, “Your customers judge your competency by your availability. Your manager judges it by your completion of projects. These two dual requirements are directly opposed and balancing them is incredibly important.”
  • By the time I am creating a ticket on behalf of a customer the battle is already lost. I’ve already been interrupted with all the lost efficiency and the danger of mistakes that comes with it.
  • The customers that do not submit tickets get preferential treatment. They get to jump ahead of all the people that actually did submit tickets which hardly seems fair. All that is happening here is that we are encouraging the squeaky wheels to squeak louder.
  • The escalation chain gets skipped. A bunch of these kind of issues should be caught at Tier-1 and Tier-2. By skipping right to Tier-3, we are not applying our skills optimally and also depriving the Tier-1 and Tier-2 guys the chance to chew on a meatier problem. A large part of the reason I am creating tickets for customers is to bump the request back down to Tier-1 and Tier-2 where it should have been dealt with to begin with.

Creating tickets on the behalf of customers is not the problem. It is a symptom of deeper issues in Process and Technology. These issues will not be resolved by no longer generating tickets for customers. Customers will still skip the escalation chain, we will continue to re-enforce bad behavior, less issues will get recorded, our Tier-2 and Tier-3 will still be interrupt driven regardless of whether there is a ticket or not. All that will change is that we will be more likely to forget requests.

The technical problems can be resolved by implementing a new ticket system or by fixing our existing one. The policy problems can be solved by creating a standardized policy for all our customers and then actually ensuring that it has teeth. The people problems can be fixed by consistent and repeated re-training.

That covers the root cause but what about now? What do we do?

  • We create the ticket for the customer – We cannot really do that. It disobeys a directive from leadership and it has all the problems discussed above.
  • We tell the customer to come back with a ticket – This does not really address the root cause, annoys the customers and we do not have policy backing it up. It is not really an option.
  • Do not use a ticket to track the request – And here we are by process of elimination. If things are broken sometimes the best way to fix them is let them break even further.

Until next time . . .

When the world gets bad enough, the good go crazy, but the smart…they go bad.Evil Abed

Budget Cuts and Consolidation: Taking it to the Danger Zone

For those of you that do not know, Alaska is kind of like the 3rd world of the United States in that we have semi-exploitative love/hate economic relationship with a single industry . . . petroleum. Why does this matter? It matters because two years ago oil was $120 a barrel and now it is floating between $40 and $50. For those of us in public service or who are private industry support services that contract with government and municipal agencies it means that our budget just shrank by 60%. The Legislature is currently struggling with balancing a budget that runs an annual 3.5 to 4 billion dollar deficit, a pretty difficult task if your only revenue stream is oil.

Regardless of where you work and who you work for in Alaska, this means “the times, they are a changin'”. As budgets shrink, so do resources: staff, time, support services, training opportunities, travel, equipment refreshes and so on. Belts tighten but we still have to eat. One way to make the food go further is to consolidate. In IT, especially these days of Everything-as-a-Service, there is more and more momentum in the business to go to centralized, standardized and consolidated service delivery (ITIL buzzword detected! +5 points).

In the last few years, I have been involved in a few of these type of projects. I am here to share a couple of observations.

 

 

Consolidation, Workload and Ops Capacity

 

Above you should find a fairly straight forward management-esque graph with made-up numbers and metrics. Workload is how much stuff you actually have to get done. This is deceptive because Workload can break down in many different types of work: projects, break/fix, work that requires immediate action, and work that can be scheduled. But for the sake of this general 40,000ft view, it can just be deemed work that you and your team do.

Operational Capacity is simply you and your team’s ability to actually do that work. Again, this is deceptive because depending on your team’s skills, personalities, culture, organizational support, and morale, their Operational Capacity can look different even if the total amount of work they do in given time stays constant. But whatever, management-esque talk can be vague.

Consolidation projects can be all over the map as well: combining disparate systems that have the same business function, eliminating duplicate systems and/or services, centralization of services or even something as disruptive as combining business units and teams. Consolidation projects generally require standardization as a prerequisite; how else would you consolidate? The technical piece here is generally the smallest, People, Process, Technology, right?

And from that technical standpoint, especially one from a team somewhere along that Workload vs. Operational Capacity timeline, consolidation and standardization look very, very different.

Standardization has no appreciable long-term Workload increase or reduction. There is an Increased capture of business value for existing work performed. If there is wider use of the same Process and Technology the given business value of a unit of work goes further, for example if it takes 10 hours to patch 200 workstations it may only take 10.2 hours to patch 2000 workstations.

Consolidation brings a long-term Workload increase with a corresponding increase of Operational Capacity due to addition of new resources or re-allocation of existing resources (that’s the dotted orange line on the graph). For example, if there is wide spread adoption of the same Process and Technology, you can take the 10 hours my team spends on patching workstations and combine it with the 10 hours another team spends on patching workstations. You just bought yourself some Operational Capacity in the terms of having twice as many people deal with the patching or maybe it turns out that it only takes 10 hours to patch both team’s workstations and you freed up 10 hours worth of labor that can go to something else. There is still more work than before but that increased Workload is more than offset by increased Operational Capacity.

Both standardization and consolidation projects increase the short-term Workload while the project is on-going (see Spring of ’15 in the graph). They are often triggered by external events like mergers, management decisions, or simply proactive planning in a time of shrinking budgets. In this example, it is a reduction of staff. This obviously reduces the team’s Operational Capacity. The ability to remain proactive at both the strategic and tactical level is reduced. In fact, we are just barely able to get work done. BUT we have (or had) enough surplus capacity to continue to remain proactive even while taking on more projects, hopefully projects that will either reduce our Workload or increase our Operational Capacity or both because things are thin right now.

Boom! Things get worse. Workload increases a few months later. Maybe another position was cut, maybe a new project or requirement from on-high that was unanticipated came down to your team. Now you are in, wait for it… THE DANGER ZONE! You cannot get all the work done inside of the required time frame with what you have. This is a bad, bad, bad place to be for too long. You have to put projects on hold, put maintenance on hold or let the ticket queues grow. Your team works harder, longer and burns out. A steady hand, a calm demeanor and a bit of healthy irreverence are really important here. Your team needs to pick your projects very, very carefully since you are no longer in a position to complete them all. The one’s you do complete damn well better either lower Workload significantly, increase your Operational Capacity or hopefully do both. Mistakes here, cost a lot more than they did a year ago.

The problem here is technical staff does not generally prioritize their projects. Their business leaders do. And in times where budgets are evaporating, priorities seems to settle around a single thing: cost savings. This makes obvious sense but the danger is that there is no reason that the project with the most significant cost savings will also happen to be the project that will help your team decrease their Workload and/or increase their Operational Capacity. I am not saying it won’t happen just that there is no guarantee that it will. So your team is falling apart, you just completed a project that saves the whole business rap star dollars worth of money and you have not done anything to move your team out of THE DANGER ZONE.

In summation, projects that increase your Operational Capacity and/or reduce your Workload have significant long-term savings in terms more efficient allocation of resources but the projects that will get priority will be those that have immediate short-term savings in terms of dollars and cents.

Then a critical team member finds better work. Then it’s over. No more projects with cost savings, no more projects at all. All that maintenance that was put off, all the business leaders that tolerated the “temporary” increase in response time for ticket resolution, all the “I really should verify our backups via simulated recovery” kind of tasks – all those salmon come home to spawn. Your team is in full blown reactive mode. They spend all their time putting out fires. You are just surviving.

Moral of the story? If you go to THE DANGER ZONE, don’t stay to long and make sure you have a plan to get your team out.

 

Documentation or how I wasted an hour

As if confirming my own  tendency to “do as I say, not what I do” I just wasted about an hour this morning trying to figure out why a newly created virtual machine was not correctly registering its hostname with Active Directory via DynamicDNS. Of course, this was a series of errors greatly exasperated by the fact that I had only had two out my required four cups of coffee and I stayed up too late watching the ironically named and absolutely hilarious Workaholics.

Let’s review, shall we?

  • Being tired and trying to do something mildly complicated
  • Allowing myself to become distracted by an interrupt task in the middle of this work
  • Not verifying the accuracy of our documentation prior to assigning the IP address in question to the virtual machine
  • Screwing up and assigning the IP address to the wrong virtual machine (both hostname and subnet octets are very similar)
  • Not reading the instrumentation; the output of ipconfig /all plainly said “(Duplicate)” Duh.

All of these factors made what should of been a 15 minute troubleshooting task stretch out into an hour.

Root cause: The IP address I picked for one of the virtual machines was already in use and the documentation was not updated to reflect this.

Potential solutions: I dunno… how about keeping our documentation updated (easier said than done)? Or better yet, stop using a “documentation system” for IP addresses that relies on discretionary operational practices (i.e., an Excel Spreadsheet stored on SharePoint) and use something like IPAM. Maybe, instead of going down the ol’ “runlist” of potential problems, I should of stopped, gathered a bit more information before I proceeded with troubleshooting? The issue was right there in the ipconfig output. I was looking *right* at it. I guess that is the difference between looking and seeing.

In short . . . happy Monday you jerks.

 

facepalm

 

World Backup Recovery Testing Day?

Yesterday was apparently the widely celebrated World Backup Day. Just like reality, the party ends some time unless you happen to be Andrew W.K and now you have woken up with a splitting headache, a vague sadness and an insatiable desire for eggs benedict. If installing and configuring a new backup system is an event that brings you joy and revelry like a good party, the monotony of testing the recovery of your backups is your hangover that stretches beyond a good greasy breakfast. I propose that today should thus be World Backup Recovery Testing Day.

There is much guidance out there for anyone who does cursory research on how to design a robust backup system so I think I will save you from my “contributions” to that discussion. As much as I would like to relay my personal experience with backups; I do not think it would be wise to air my dirty laundry this publically. In my general experience, backup systems seems to get done wrong all the time. Why?

 

Backups? We don’t need those. We have snapshots.

AHAHAHAHAHAHA. Oh. Have fun with that.

I am not sure what it is about backup systems but they never seem to make the radar of leadership. Maybe because they are secondary systems so they do not seem as necessary in the day-to-day operations of the business as production systems. Maybe because they are actually more complicated than they may seem. Maybe because the risk to cost ratio does not seem like a good buy from a business perspective, especially if the person making the business decision does not fully understand the risk.

This really just boils down to the same thing: Technical staff, not communicating the true nature of the problem domain to leadership and/or leadership not adequately listening to the technical staff. Notice the and/or. Communication: it goes both ways. If you are constantly bemoaning the fact that management never listens to you, perhaps you should change the way you are communicating with your management? I am not a manager so I have no idea what the corollary to this is (ed. feel free to comment managers!).

Think about it. If you are not technical, the difference between snapshots and a true backup seem superfluous. Why would you pay more money for a duplicate system? If you do not have an accurate grasp of the risk and the potential consequences why would you authorize additional expenditures?

 

I am in IT. I work with computers not people.

You do not work with people, you say? Sure you do. Who uses computers? People. Generally people that have some silly business mission related to making money. You best talk to them and figure out what is important to them and not you. The two are not always the same. I see this time and time again. Technical staff implements a great backup system but fails to backup the stuff that is critical to the actual business.

Again. Communication. As a technical person, one database looks more or less identical to another one. I need to talk to the people that actually use that application and get some context, otherwise how would I know which one needs a 15 minute Recovery Time Objective and which one is a legacy application that would be fine with a 72 hour Recovery Time Objective. If it was up to me, I would backup everything, with infinite granularity and infinite retention but despite the delusion that many sysadmin’s labour under they are not god and do not have those powers. Your backup system will have limitations and the business context should inform your decision on how you accommodate those limitations. If you have enough storage to retain all your backups for six weeks or half your backups for 4 weeks and half for 4 months and you just make a choice, maybe you will get lucky and get it right. However, the real world is much more complicated than this scenario it is highly likely you will get it wrong and retain the wrong data for too long at the expensive of the right data. These kind of things can be Resume Generating Events.

My favorite version of this is the dreaded Undocumented Legacy Application that is living on some aging workstation tucked away in a forgotten corner. Maybe it is running the company’s timesheet system (people get pissed if they cannot get paid), maybe it is running the HVAC control software (people get pissed if the building is a nice and frosty 48 degrees Fahrenheit), maybe it is something like SCADA control software (engineers get pissed with water/oil/gas does not flow down the right pipes at the right time, also people may get hurt). How is technical staff going to have backup and recovery plans for things like this if they do not even know they exist in the first place?

It is hard to know if you have done it wrong

In some ways, the difficulty of getting backup systems right is that you only know if you have got it right once the shit hits the fan. Think about the failure mechanism for production systems: You screwed up your storage design – stuff runs slow. You screwed up your firewall ACLs – network traffic is blocked. You screwed up your webserver – the website does not work any more. If there is a technical failure you generally know about it rather quickly. Yes, there are whole sets of integration errors that lie in wait in infrastructure and only rear their ugly head when you hit a corner case but whatever, you cannot test everything. #YOLO #DEVOPS

There is no imminent failure mechanism constantly pushing your backup system towards a better and more robust design since you only really test if you need it. Without this Darwinian IT version of natural selection you generally end up with a substandard design and/or implementation. Furthermore, for some reason backups up here are associated with tapes, and junior positions are associated with tape rotation. This cultural prejudice has slowly morphed into junior positions being placed in charge of the backup system; arguably not the right skillset to be wholly responsible for such a critically important piece of infrastructure.

Sooooo . . . we do a lot of things wrong and it seems the best we can do is a simulated recovery test. That’s why I nominate April 1st as World Backup Recovery Testing Day!

 

Until next time,

Stay Frosty

The Art and Burden of Documentation

I have been thinking a lot about documentation lately, mostly about my own shortcomings and trying to understand why the act of documenting seems so difficult and why the quality of the documentation that does get done is often found lacking. Good documentation and good documentation practices are such a fundamental part of the health of an IT shop you would think we as a field would be better at it. My experience is limited and anecdotal (whose is not?) but I have yet to see a shop with solid documentation and solid documentation practices. This extends to myself as well. I can look back at my various positions and roles and there are very few where I actually felt satisfied with the quality of my documentation.

Read on for “aksysadmin’s made-up principles of how to not suck at documentation and do other things good too”.

 

1. Develop a standardized format, platform and process from the bottom up. Your team uses this, not you.

Leadership has a tendency to standardize on a single standard, platform or process. This is generally considered a good thing. The problem is, leadership does not write technical documentation. We do. And what standard makes sense to them, may not make any sense to the technical staff (*cough* ITIL *cough*). What platform seems adequate to them, may seem unwieldy to sysadmins (*cough* SharePoint *cough*).  Standardization may generally be considered a good thing but forcing a standard, platform or process on a team without input or understanding their problem domain and use case is generally considered a bad thing. The harder you make it for your team to document, the less likely they will be to perform a task they are already unlikely to perform.

2. Don’t document how, document why

This is part an internal challenge (IT staff documenting the wrong things) and an external challenge (leadership requiring the wrong kind of things to be documented). I see lots of documentation that is essentially a re-hashed version of a vendor’s manual. Ninety-nine percent of the time your vendor has exhaustive resources on how to do something. It is right there. In the manual. Go read it. Unless it is incredibly unintuitive, and sometimes it is, why would you waste your precious time re-writing an authoritative set of information into a non-authoritative set that requires your team to maintain it? Reading and understanding vendor documentation should be considered a fundamental skill, if your guys cannot read or are unwilling to read vendor manuals you have other problems that need addressing.

What you should document, is why you did things. You will not remember why this particular group was setup or why things are this way instead of that way in six months and your successor certainly will have no idea. Use your documentation to provide context and meaning.

3. Document where to find things

Documenting why something is the way it is great but it is also important to document where things are. I am talking about, things like IP Addresses, Organizational Charts, Passwords and so on. This is another opportunity to avoid work, err work more efficiently. Chances are many of these things have authoritative sources maintained by other people or tools. Why write your IP Addresses down manually in an Excel Spreadsheet when you can use a tool like IPAM to track them? Why track the phone tree for your different workgroups when Active Directory can do that for you? Why spend time doing stuff that is already done? Why indeed?

Figuring out what stuff to document in the where category can be hard to do. I have found the easiest way to do this is to pretend you are brand new. Better yet if you have a brand new team member ask him to track these kinds of information requests as he acquaints himself to your particular little piece of hell. What does he need to know right now to do your job? That is what your replacement will be asking himself after you have ascended.

4. Don’t document break/fix issues

Do not fill up your Wiki, SharePoint, OneNote or file share full of Word .docx with break/fix issues. Your infrastructure and process documentation should be broad and “provide context and meaning” which is pretty much the opposite kind of information than the kind break/fix issues are about – specific configurations, systems or problems.

You already have a place to “document” break/fix issues – it is called your damn ticket system. Use it. Document your fixes in your tickets. If your Tier-1 guys have a habit of closing all but the simplest of tickets with “done” or “fixed”, slap them (and probably yourself as well) and say that their future self just came back in time to hit them for making their job harder. If you do not have ticket system then you have other problems that need to be addressed.

5. Have a panic page

Take the really important stuff from the why documentation and the where documentation and make it into a panic page. A panic page is a short piece of documentation that contains all the information you or anyone else would need to have in order to deal with a “whoops” situation. Think things like vendor contact phone numbers, contract support entitlements, how to file and escalate a case and maybe where to find your co-worker’s emergency scotch. I borrowed this one from my supervisor and it turned out to be prescient suggestion on his part.

6. Have a hard copy

This is an extension of the panic page principle. Panic situations have a way of making electronic documentation inaccessible. “Oh but wait my documentation is in the cloud, I can get to it anywhere with my mobile device, oh I am so smart” you say, yeah well, you will be screwed when you drop your iPad or it runs out of batteries or you happen to live in Alaska which has comparable infrastructure to, say Afghanistan. Have a paper copy on hand, preferably two. Yes it will be harder to maintain but it will be a lot better than having no documentation if your file server explodes or the polar bears take over your data center.

7. Have designated documentation days

As a sysadmin, you generally do not have luxury of setting your own priorities. If your leadership wants documentation, instead of just saying “Hey, we need to document better” at your weekly staff meeting, they need to make it a priority. Nothing does this better than designating a day for documentation. Read-Only Friday is a good one because you are not making changes on Friday anyway, right… righhhhtttt? Of course, you are still going to get interruptions and tickets so designate one person as the interruption blocker and another team member as the documenter (borrowed from the Tom Limoncelli’s excellent Time Management for System Administrators). Rotate individuals as appropriate. These designated documentation days are your time, to make time, to actually Get Shit Done. All those little notes you meant to flesh out with some more context but never had time to… do it now. Organize your stuff. Clean it up. Review it for accuracy. Do this with a frequency related to how fast things change. Until leadership makes it a priority, you will always have another one that trumps it.

 

These ideas address some of those external and internal challenges you may have and that I know I have. I am more inclined to document stuff if I am not documenting dumb stuff that is already documented elsewhere. I will have an easier time finding the documentation we do have, if it is organized in a way that works for me and my team, both the consumer and producer of it. If I have dedicated time to actually perform the act of documenting it will probably get done. If not, then I am answerable to someone. Of course there can be a large gap between knowing what needs improvement and actually fixing it. Until then.

Stay frosty.