FFFUUUU Internet Explorer… a rant about an outage

I am not normally a proponent of hating on Microsoft, mostly because I think much of the hate they get for design decisions is simply because people do not take the time to understand how Microsoft’s new widget of the month works and why it works that way. I also think it is largely pointless. All Hardware Sucks, All Software Sucks once you really start to dig around under the hood. That and Microsoft doesn’t really give a shit about what you want and why you want it. If you are an enterprise customer they have you by the balls and you and Microsoft both know it. You are just going to have to deal with tiles, the Windows Store and all the other consumer centric bullshit that is coming your way regardless of how “enterprise friendly” your sales rep says Microsoft is.

That being said, I cannot always take my own medicine of enlightened apathy and Stockholm Syndrome and this is one of those times. We had a Windows Update get deployed this week that broke about 60% – 75% of our fleet, specifically Internet Explorer 11. Unfortunately we have a few line-of-business web applications that rely on it. You can imagine how that went.

Now there are a lot of reasons why this happened but midway through my support call where we are piecing together an uninstallation script to remove all the prerequisites of Internet Explorer 11 I had what I call a “boss epiphany”. A “boss epiphany” is when you step out your technical day-to-day and start asking bigger questions and is so named because my boss has a habit of doing this. I generally find it kind of annoying in a good-natured way because I feel like there is a disregard for the technical complexities that I have to deal with in order to make things work but I can’t begrudge that he cuts to the heart of the matter. And six hours into our outage what was the epiphany… “Why is this so fucking hard? We are using Microsoft’s main line-of-business browser (Internet Explorer) and their main line-of-business tool for managing workstations in an enterprise environment (SCCM).”

The answer is complicated from (my) technical perspective but the “boss epiphany” is a really good point. This shit should be easy. It’s not. Or I suck at it. Or maybe both. AND that brings me to my rant. Why in the name of Odin’s beard is software deployment and management in Windows so stupid? All SCCM is doing is really just running an installer. For all its “Enterprisy-ness” it just runs whatever stupid installer you get from Adobe, Microsoft or Oracle. There’s no standardization, no packaging or no guarantee anything will actually be atomic. Even MSI installers can do insane things – like accept arguments in long form (TRANSFROMS=stupidapp.mst) but not short form (/t stupidapp.mst) or my particular favorite, search for ProductKey registry keys to uninstall any older version of the application, and then try to uninstall it via the original .MSI. This fails horribly when that .MSI lives in a non-persistent client side cache (C:\Windows\ccmcache). Linux was created by a bunch of dope-smoking neckbeards and European commies and they have had solid standardized package management for like ten years. I remember taking a Debian Stable install up to Testing, and then down-grading to Stable and then finally just upgrading the whole thing to Unstable. AND EVERYTHING WORKED (MOSTLY). Lets see you try that kind of kernel and user land gymnastics with Windows. Maybe I just have not spent enough supporting Linux to hate it yet but I cannot help but admire the beauty of apt-get update && apt-get upgrade when most of my software deployments means gluing various .EXEs and registry keys together with batch files or PowerShell. It’s 2016 and this is how we are managing software deployments? I feel like I’m taking crazy pills here.

 

Lets look at the IEAK as a specific example since I suspect that’s half the reason I got us into this mess. The quotes from this r/sccm thread are perfect here:

  • “IEAK can’t handle pre reqs cleanly. Also ‘installs’ IE11 and marks it as successful if it fails due to prereqs”
  • “Dittoing this. IEAK was a nightmare.”
  • “IEAK worked fine for us apart from one issue. When installing it would fail to get a return from the WMI installed check of KB2729094 quick enough so it assumed it wasn’t installed and would not complete the IE11 install.”
  • “It turns out that even though the IEAK gave me a setup file it was still reaching out to the Internet to download the main payload for IE”
  • “I will never use IEAK again for an IE11 deployment, mainly for the reason you stated but also the CEIP issue.”

And that’s the supported, “Enterprise” deployment method. If you start digging around on the Internet, you see there are people out there deploying Internet Explorer 11 with Task Sequences, custom batch files, custom PowerShell scripts and the PowerShell Deployment Toolkit. Again, the technical part of me understands that Internet Explorer is a complicated piece of software and that there are reasons it is deployed this way but ultimately if it is easier for me to deploy Firefox with SCCM than Internet Explorer… well that just doesn’t seem right now does it?

Until next time… throw your computer away and go outside. Computers are dumb.

Can’tBan… Adventures with Kanban

Comic about Agile Programming

. . .

We started using Kanban in our shop about six months ago. This in of itself is interesting considering we are nominally an ITIL shop and the underlying philosophies of ITIL and Kanban seem diametrically opposed. Kanban, at least from my cursory experience is focused on speeding up the flow of work, identifying bottlenecks and meeting customers’ requirements more responsively. It is all about reducing “cycle time”, that is the time it takes to move a unit of work through to completion. ITIL is all about slowing the flow of work down and adding rigor and business oversight into IT processes. A side effect of this is that the cycle time increases.

If you are not familiar with Kanban the idea is simple. Projects get decomposed into discrete tasks, tasks get pulled through the system from initiation to finish and each stage of the project is represented by a queue. Queues’ have work in progress (WIP) limits which means only so many task can be in a single queue at the same time. The backlog is where everything you want to get done sits before you actually start working on it. DO YOU WANT TO KNOW MORE?

As I am sure the one reader of my blog knows, I simultaneously struggle with time management and I am also fascinated by it. What do I think about Kanban? I have mixed feelings.

The Good

  • Kanban is very visual. I like visual things – walk me through your application’s architecture over the phone and I have no idea what you have just told me five minutes later. Show me a diagram and I will get it. This appeal of course is personal and will vary widely depending on the individual.
  • Work in progress (WIP) limits! These are a fantastic concept. The idea that your team can only process so much work in a given unit of time and that constantly context switching between tasks has an associated cost is obvious to those in the trenches but not so much to those higher powers that exist beyond the Reality Impermeability Layer of upper management. If you literally show them there is not enough room in the execution queue for another task they will start to get it. All of sudden you and your leadership can start asking the real questions… why is task A being worked on before task Z? Do you need more resources to complete all the given tasks? Maybe task C can wait awhile? Why is task G moving so slowly? Why are we bottlenecked at this phase?
  • Priorities are made explicit. If I ever have doubt about what I am expected to be working on I can just check the execution queue. If my manager wants me to work on another task that is outside the execution queue, then we can have a discussion about whether or not to bump something back or hold the current “oh hey, can you take care of this right now?” task in the backlog. I cannot understate how awesome this. It makes the cost of context switching visible, keeps my tactical work aligned with my manager’s strategic goals, and makes us think about what tasks matter most and in what order they should get done. This is so much better than the weekly meeting, where more and more tasks get dumped into some nebulous to-do list that my team struggles through while leadership wonders why the “Pet Project of the Month” isn’t finished yet.

The Interesting

  • The scope of work that you set as a singular “task” is really important. If a single task is too large then it doesn’t accurately map to the work being done on a day-to-day basis and you lose out on Kanban’s ability to bring bottlenecks and patterns to the surface where they can be dealt with. If the tasks are to small then you end up spending too much time in the “meta-analysis” of figuring out what task is where instead of actually accomplishing things.
  • The type of work you decide to count as a Kanban task also has a huge effect on how your Kanban actually “runs”. Do you track break/fix, maintenance tasks, meetings, projects, all of the above? I think this really depends on how your team works and what they work on so there is no hard or fast answer here.
  • Some team members are more equal than others. We set our WIP limit to Number of Team Members * 2… the idea being that two to three tasks is about all a single person can really focus on and still be effective (i.e., “The Rule Of Threes”). Turns out though in practice that 60% of tasks are owned by only 20% team. Huh. I guess that would be called a bottleneck?

The Bad

  • Your queues need to actually be meaningful. Just having separate queues named “Initiation”, “Documentation”, “Sign-off” only works if you have discrete actions that are expected for the tasks in those queues. In our shop what I have found is only one queue matters: the execution queue. We have other queues but since they do not have requirements and WIP limits attached to them they are essentially just to-do lists. If a task goes into the Documentation queue, then you better damn well document your system before you move the task along. What we have is essentially a one queue Kanban system with a single WIP limit. If we restructured our Kanban process and truly pulled a task through each queue from beginning to finish I think we would see much more utility.
  • Flow vs. non-flow. An interesting side of effect of not having strong queue requirements is that tasks don’t really “flow”. For example: We are singularly focused on the execution queue and so every time I finish a task it gets moved onto the documentation queue where it piles up with all the other stuff I never documented. Now instead of backing off and making time for our team to document before pulling more work into the system I re-focus on whatever task just got promoted into the execution queue. Maybe this is why our documentation sucks so much? What this should tell us is 1) We have too many items in the documentation queue for new work, 2) the documentation queue needs a smaller WIP limit, 3) we need to make the hard decision to put off work until documentation is done if we actually want documentation and 4) documentation is work and work takes time. If we never give staff the time to document then we will end up with no documentation. I don’t necessarily thing everything needs to be pulled through each queue. Break/Fix work is often simple, ephemeral and if your ticket system doesn’t suck ,self-documenting. You could handle these types of tasks with a standalone queue.
  • Queues should have time-limits. You only have one of two states regarding a given unit of work, you are either actively working on it or you are not. Kanban should have the same relationship with tasks in a given queue. If a task has sat in the planning queue for a week without any actual planning occurring then it should be removed. Either the next queue is full (bottleneck), the planning queue is full (bottleneck/WIP limit to high) or your team is not working on your Kanban tasks (other larger systemic problems). Aggressively “reset” tasks by sending them to the backlog if no work is being performed on them and enforce your queue requirements otherwise all you have done is create six different “to-do-whenever-we-have-spare-time-which-is-never-lists” that just collect tasks.
  • Our implementation of Kanban does not work as a time management tool because we only track “project” work. Accordingly very little of my time is actually spent on the Kanban tasks since I am also doing break/fix, escalations, monitoring and preventive maintenance. This really detracts from the overall benefit of managing priorities, making then explicit and limiting context switching since our Kanban board only represents at best 25% of my team’s work.

In conclusion there are some things I really like about Kanban and with some tweaks I think our implementation could have a lot of utility. I am not convinced it will mix well with our weird combination of ITIL processes but no real help desk (see: Who Needs Tickets Anyway? and Those are Rookie Numbers according to r/sysadmin). We are getting value out of Kanban but it needs some real changes before it becomes just one more process of vague effectiveness.

It will be interesting to see where we are in another six months.

Until next time, keep your stick on the ice.

The Big Squeeze, Predictions in Pessimism

Layoff notice or stay the hell away from Alaska when oil is cheap… from sysadmin

 

I thought this would be a technical blog acting as a surrogate for my participation on ServerFault but instead it has morphed into some kind of weird meta-sysadmin blog/soap box/long-form of a reply on r/sysadmin. I guess I am OK with that…

Alaska is a boom and bust economy and despite having a lot going for us fiscally, a combination of our tax structure, oil prices and the Legislature’s approach to the ongoing budget deficit, we are doing our best to auger our economy into the ground. Time for a bit of gallows humor to commiserate with u/Clovis69! The best part of predictions is you get to see how hilariously uninformed you were down the road! Plus, if you are going to draw straws you might as well take bets on who gets the shortest one.

Be forewarned, I am not an economist, I am not even really that informed and if you are my employer, future or otherwise, I am largely being facetious.

The Micro View (what will happen to me and my shop)

  • We will take another 15-20% personnel cuts in IT operations (desktop, server and infrastructure support). That will bring us to close to a 45% reduction in staff since 2015.
  • We will take on additional IT workload as our programming teams continue to lose personnel and consequently shed operational tasks they were doing independently.
  • We will be required to adopt a low-touch, automation-centric support model in order to cope with the workload. We will not have the resources to do the kind of interrupt-driven, in-person support we do now. This is a huge change from our current culture.
  • We will lean really hard on folks that know SCCM, PowerShell, Group Policy and other automation frameworks. Tier-2/Tier-3 will come under more pressure as the interrupt rate increases due to the reduction in Tier-1 staff.
  • Team members that do not adopt automation frameworks will find themselves doing whatever non-automatable grunt work there is left. They will also be more likely to lose their jobs.
  • We will lose a critical team member that is performing this increased automation work as they can simply get paid better elsewhere without having a budget deficit hanging over their head.
  • If we do not complete our consolidation work to standardize and bring silo-ed teams together before we lose what little operational capacity we have left our shop will slip into full blown reactive mode. Preventive maintenance will not get done and in two years time things will be Bad (TM). I mean like straight-up r/sysadmin horror story Bad (TM).
  • I would be surprised if I am still in the same role in the same team.
  • We will somehow have even more meetings.

The Macro View (what will happen to my organization)

Preliminary plans to consolidate IT operations were introduced back in early 2015. In short, our administrative functions including IT operations, are largely decentralized and done at the department level. This leads to a lot of redundant work being performed, poor alignment of IT to the business goals of the organization as a whole, the inability to capture or recover value from economies of scale and widely disparate resources, functionality and service delivery. At a practical level, what this means is there are a whole lot of folks like myself all working to assimilate new workload, standardize it and then automate it as we cope with staff reduction. We are all hurriedly building levers to help us move more and more weight but no one has stopped to say, “Hey guys, if we all work together to build one lever we can move things that are an order of magnitude heavier,” consequently as valiant as our individual efforts are we are going to fail. If I lose four people out of a team of eight, no level of automation that I can come up with will keep our heads above water.

At this point I am not optimistic about our chances for success. The tempo of a project is often determined by its initial pace. I have never seen an IT project move faster as time goes on in the public sector; generally it moves slower and slower as it grinds through the layers of bureaucracy and swims upstream against the prevailing current of institutional inertia and resistance. It has been over a year without any progress that is visible to the rank-and-file staff such as myself and we only have about one, maybe two years, of money left in the piggy bank before we find that the income side of our balance sheet is only 35% of our expenses. To make things even more problematic entities that do want to give up control have had close to two years to actively position themselves to protect their internal IT.

I want IT consolidation to succeed. It seems like the only possible way to continue to provide a similar level of service in the face of a 30-60% staff reduction. I mean, what the hell else are we going to do? Are we going to keep doing things the same way until we run out of money, turn the lights off and go home? If it takes one person on my team to run SCCM for my 800 endpoints, and three people from your team to run SCCM for your 3000 endpoints, how much do you want to bet the four of them could run SCCM for all 12,000 of our endpoints? I am pretty damn confident they could. And this scenario repeats everywhere. We are bailing out our boats, and in each boat is one piece of a high volume bilge pump but we don’t trust each other and no one is talking and we are all moving in a million different directions instead of stopping, collectively getting over whatever stupid pettiness that keeps us from actually doing something smart for once and putting together our badass high volume bilge pump. We will either float together or drown separately.

I seem to recall a similar problem from our nation’s history…

Benjamin Franklin's Join or Die Political Cartoon

Things in our Datacenter that Annoy Me

Or alternatively how I learned to stop worrying about the little things…

In this post, I complain about little details that show my true colors as a some kind of pedantic, semi-obsessive, detail oriented system administrator. I mean, I try to play it cool but inside I am really freaking out, man! Not really but also kind of yes. More on that later.

 

Our racks are not deep or wide enough

Our racks were not sized correctly initially. They are quite “shallow”. A Dell 730 on ReadyRails is about 28″ deep. This is a pretty standard rack mounting depth for full-size rackmount equipment. In our racks, we only have about 4-6″ of space remaining between the posts and the door in the back of the rack. This complicates cabling since we do not have a lot of room to work with but it really gets annoying with PDUs. See below.

The combination of the shallow depth and lack of width, leads to weird PDU configurations

PDU Setup

The racks are too shallow to mount the PDUs parallel with the posts with the plugs facing out towards the door and too narrow to stack both PDUs on one side. The PDUs end up being mounted sideways where they stick out into the area between the posts, blocking airflow and making cabling a pain in the ass.

Check out u/tasysadmin’s approach which is much improved over ours. The extra depth and width allows both power circuits (you do have two redundant power circuits, right?) to move over to one side of the rack and slide into the gap between the posts and casing. This has a whole bunch of benefits: airflow is not restricted, you have more working space for cabling, your power does not have to cross the back of the rack and you can separate your data and your power.

Beautiful Rack Cabling

This also means that some of our racks have the posts moved in beyond the standard rack mounting depth of 28″ in order to better accommodate our PDUs, the result of which is that I only have two out five racks that can accommodate a Dell PowerEdge.

Data and power not separated

You ideally want to run power on one side of the rack and data on the other. Most people will cite electromagnetic interference as a good reason for doing this but I have yet to see a problem caused by it (knock on wood). That being said, it is still a good idea to put some distance between the two, much like your recently divorced aunt and uncle at family functions. There are plenty of other good reasons for keeping data and power separate, most of which center around cabling hygiene – it helps keep things much cleaner because your data cables tend to run up and down the rack posts headed for things like your top-of-rack switch, whereas your power needs to go other places (i.e., equipment). It is a lot easier to bundle cables if they more or less share the same cable path.

Cannot access cable tray because of PDU cables

Cable Tray

This is just another version of “data and power are not separated”. Our power and data both come in at the top of the rack. This means our 10 4/C AWG feeds for each PDU which are about .5″ in diameter are draped across our cabling tray which just sits on top of the racks instead of being suspended by a ladder bar (another great injustice!). I bet these guys generate quite the electromagnetic field. It would nice if they were more than 4″ away from some of 10 Gbps interconnects, huh? This arrangement also means the cable tray is huge pain to use. You have to move all the PDU power cables off of it, then pop the lid off in segments to move your cables around. Or you can just run them all over the top of the rack and hope that the fiber survives like we do. Again. Not ideal.

Inconsistent fastener use for mounting equipment

This one sounds kind of innocuous but it is one of those small little details that makes your life so much easier. Pick a fastener type and size and stay with it. I am partial to M6 because the larger threads are harder to strip out and the head has more surface area for your driver’s bit to contact with. It is pretty annoying to change tools each time the fastener type is different instead of just setting your torque level on your driver and going for it. Also – don’t even think of using self-tapping fasteners. They make cage nuts and square holes in rack posts for a reason.

Improper rail mounting and/or retention

Your equipment comes with mounting instructions and you should probably follow them. Engineers calculate how much weight a particular rail can bear and then figure out that you need four fasteners of grade X on each rail to adequately support the equipment. This is all condensed into some terrible IKEA-level instructions which makes you shake your head as you wonder why your vendor could not afford a better technical writer for the obscene price of whatever equipment you are racking. Once you decipher these arcane incantations you should probably follow them. Don’t skip installing cage nuts and fasteners – if they say you need four, then you need four. It only takes two more minutes to do the job right.

AND FOR THE LOVE OF $DIETY INSTALL WHATEVER HARDWARE IS REQUIRED TO RETAIN THE EQUIPMENT IN THE RAILS! Seriously. This is a safety issue. I am not sure why this step is skipped and people just set things on the rails without using the screws to retain it to the posts but racks move, earthquakes happen and this shit is heavy. I think most our disk shelves are about 50 pounds. You do not want that falling out of the rack and onto your intern’s head.

Use ReadyRails (or vendor equivalent)

For about $80 dollars you can have a universal, tool-less rail that installs in about 30 seconds. I would call that a good investment.

Inconsistent inventory tagging locations

I am guessing your shop maintains an inventory system and you probably have little inventory tags you affix to equipment. Do your best to make the place where the inventory tag goes consistent and readable once everything is rack and stacked. The last thing you want to do is pull an entire rack apart because some auditor wants you find the magical inventory tag stuck on some disk shelf in the middle of 12 shelf aggregate.

It would also be a good idea to put your inventory tag in your documentation so you do not have to play a yearly game of “find the missing inventory tag”.

Cable labeling is not consistent (just use serialized cables)

I suck at cable labeling and documentation in general (see here) so this is a bit hypocritical, nevertheless I find that there are four stages in cable labeling: nothing, consistent labeling of port and device on each end, confusion as labeled cables are reused but the label is not changed and finally adoption of serialized cables where each end has a unique tag that is documented.

This is largely personal preference but the general rules are simple: keep it clean, keep it consistent and keep it current (your documentation that is). The only thing worse than a unlabeled cable is a mislabeled cable.

Gaps in rack mount devices

Shelf Gap

Why? Just why? I don’t know and will probably never know… but my best guess is the rail on the top shelf was slightly bent during installation and then when we needed to add another shelf later the rail interfered with it. 10 minutes originally could of have saved 10 hours down the road. If it turns out I am one post hole short of being able to install another shelf I get to move all the workloads off this aggregate, pull out all the disk shelves until I reach this one, fix or replace the rail, re-rack and re-cable everything, re-create the aggregate and then move the workloads back.

 

Now that I have complained a bit (I am sure r/sysadmin will say that I have it way to easy)  I get to talk about the real lesson here: none of this shit matters.

On one level they do. All these little oversights accumulate technical debt that eventually comes back and bites you on the ass and doing it right the first time is the easiest and most efficient way. On other hand, none of this stuff directly breaks things. The fact that the power and data cabling are too close together for my comfort or that there is small gap in one of the disk shelf stacks does not cause outages. We have plenty of things that do however and those demand my attention. So collectively, lets take a deep breath, let it go, and stop worrying about it. It’ll get fixed someday.

The other lesson here is that nothing is temporary. If you cut a corner, particularly with physical equipment, that corner will remain cut until the equipment is retired. It is just too hard and costly to correct these kinds of oversights once you are in production. If you are putting a new system up, take some time to plan it out – consider the failure domain, how much fault tolerance and redundancy you need, labeling, inventory and all those other little things. You only get to stand this system up once. Go slow and give it some forethought, you may thank yourself one day.

A Ticket Too Far… Part II

Part I:  A Ticket Too Far… Breaking the Broken

OK. So I screwed up. If you read the above post, I make a lot of claims about ticket systems and process. Many of those claims were based on the idea that we did not have an approved and enforced policy in place. Turns out I was wrong, sort of.

I did a little digging and reviewed the policy. Customers are asked to submit a ticket if their issue is not immediately preventing their work, otherwise they can either call the help desk number, an IT staff member directly or visit them in person. There is a lot I could say about this policy and what provisions I agree with and which I do not, but policy is policy and I misrepresented the strength of it – it is very much vetted and approved.

I see the goals of a policy on customer facing support as follows:

  • Prevent sysadmins from forgetting customer requests
  • Allow sysadmins to control their interrupt-based workflow, prioritize and not have their workflow control them
  • Track customer requests and incidents so pain-points can be discovered and resolved proactively
  • Build a database of break/fix-based documentation
  • Create an acknowledgement and feedback mechanism for customer issues (i.e., “you have been assigned ticket #232”), backed up by mechanisms that forces sysadmin action (i.e., “this ticket has not been touched in three days, either close it or reply”). This feedback loop ensures that issues are acknowledge and resolved either one way or another.

The details may be wrong but the bigger point of my last post remains the same; the combination of our policy and ticket system’s technical limitations does not accomplish those goals or lead to ideal outcomes for either customers or IT staff.

But does it really? Perception and reality are not always the same so I started tracking how often I was interrupted by a customer or a team member over a period of about four weeks. It is important to mention this was not a particularly rigorous study, I just kept an Excel spreadsheet and anytime I diverted my attention for more than a few minutes from my current task I made a quick note of it. If anything, I was consistent in my inconsistency. I also kept track of what kind of interrupt it was, what group it came from and whether or not it had ticket attached to it.

 

a graph showing the number of interrupts per day

A couple of interesting discoveries here:

  • I am not interrupted nearly as much as I think I am. If you throw out the obvious outlier of Day 12, the median is two interrupts per day. Not as bad as I would have thought… but it is not that great either considering with meetings and other obligations, I probably only have one period per day of uninterrupted time to focus on complex projects that is longer than two hours. Getting an interrupt during that time period is a pretty serious set back.
  • 48% of the interrupts were related to break/fix issues. The other 52% were what I call “communication interrupts”. More on these later.
  • Of the 31 break/fix interrupts I recorded, only one actually had a ticket already associated with it. This is mind-boggling terrible as far as accepted time management best practices go.
  • Only 59% of the interrupts required immediate action versus action that could be queued and prioritized later. This means 59% of these interruptions really did not need to be interruptions at all, they needed to be tickets or agenda items in a meeting.

Even with my back-of-the-napkin math these are pretty damning conclusions. Our ticket system capture rate, at least at Tier-3 where I spend most of my time, is laughably non-existent. Going back to my previous post about creating tickets for customers, there would be no reason for me to do so considering I would be creating 97% of the tickets myself. Interestingly enough, it is not like these requests just vanish into thin air. They get tracked and recorded somehow, albeit most likely by individual staff in a manner that is not portable or visible. The work required to track issues is still being done, just the organization is not getting a lot of value out of it since it is just recorded in some crusty senior sysadmin’s logbook.

As an aside, It would be really interesting to perform the same experiment at the Tier-1 and Tier-2 support levels and see what the ratio is. Maybe it is higher and it is more the kind of issues and/or customers that Tier-3 deals with that lead to low ticket system capture. Or worse maybe it is the same and ticket system capture is just really bad everywhere.

Almost two-thirds of these interrupts did not have to be interrupts. They did not require immediate action. This is also pretty terrible because interrupts cost a lot. The accepted wisdom is that it takes between 10 and 20 minutes to get back in “the zone” (source) after being interrupted. For myself that is about 40 minutes lost on average per day just in dealing with context switching from one task to another. There is also a greater risk of mistakes being made trying regain focus on your task. It is harder to calculate this cost but it is assuredly there.

Finally, a bit over half of these interrupts were “communication interrupts”. These are hard to pin down but mostly they were a team member or a customer wanting to communicate something to me, “Hey, I fixed item A on Server 1, but item B and item C are still broken” or a request for information, “Hey, how do items Z and Y work on Server 2 and Server 3?”. These clearly have a interrupt cost but on the other hand, they also have a cost in not immediately jumping to the top of my to-do pile. If someone needs information from me it is likely because they are in a wait-cycle – they need to know something to continue their current task. It feels like it boils down to a “whose time is more valuable?” argument. Is it better for a Tier-2 team member to burn 45 minutes instead of 10 on a task because she had to dig up some information instead of asking a Tier-3 sysadmin? Or is better for the Tier-3 sysadmin to burn 20 minutes to save his Tier-2 team member 35? I do not really know if there is answer here but it is an interesting thread to pull on…

The interrupts that served to communicate information to me seem a little more clear cut. None of them required immediate action and most of them were essentially status updates. Implementing a daily standup meeting or something similar would be a prefect format for these kinds of interactions.

 

Well. It was an interesting little experiment. I am not sure if I am smarter or better off for it but curiosity is an itch that sometimes just needs to be scratched.

Until next time, stay frosty.

 

Are GOV IT teams apathetic?

I have been stewing about this post on r/sysadminIs apathy a problem in most government IT teams?, for a while and felt like it was worth a quick write-up since most of my short IT career has been spent in the public sector.

First off, apathy and team dysfunction is a problem everywhere. There is nothing unique about government employees versus private employees in that respect. What I think the poster is really asking is, “Is there something about government IT that produces apathetic teams?” and if you read a little deeper it seems like apathy really means “permanent discouragement”; that is to say, the condition where change, “doing things right or better”, greater efficiency are or seemly are impossible. When you read something like, “…trying to make things more efficient is met with reactions like ‘oh you naive boy’ and finger pointing,” it is hard to think of just plain old vanilla apathy.

Government is not a business (despite what some people think). Programs operate at a loss, are subsidized in many cases entirely, by taxes because the public and/or their representatives deems those programs worthy. The failure mechanism of market competition doesn’t exist. Incredibly effective programs can be cancelled because they are no longer politically favorable and incredibly ineffective programs can continue or expand because they have political support. Furthermore, in all things public servants need to remain impartial, unbiased and above impropriety. This leads to vast and byzantine processes, the components of which singularly make imminent good sense (for example, the prohibition of no-bid contracts) but collectively all these well-intentioned barnacles slow the ship-of-state dramatically. Success is not rewarded with growth either. Implementing a more efficient process, a more cost effective infrastructure and saving money generally results in less money. This tendency of budget reduction (“Hey, if you saved it, you did not need it to begin with, right?”) turns highly functioning teams into disasters overtime as they lose resources. Paradoxically, the better you are at utilizing your existing resources, the less you get. Finally, your entire leadership changes with every administration change. You may still be shoveling coal down in the engine room, but the new skipper just sent down word to reduce steam and come about hard in order to head in the opposite direction. Generally private companies that do this kind of thing, with this frequency, do not last long.

How does all this apply to Information Technology? It means that your organization will move very, very slow and technology moves very, very fast. Not a good combo.

 

Those are the challenges that a team faces but what about the other half of the equation… the people facing them?

Job classes are just one small part of this picture but they are emblematic of some of the challenges that face team leads and managers when dealing with the ‘People’, piece of People, Process and Technology (ITIL buzzword detected! +5 points). The idea of job classes is that across the organization people doing similar work should be paid the same. The problem lies in that updating a job class is beyond onerous and the time to completion is measured in years. Do you know how quickly Information Technology reinvents itself? Really quick. This means that job classes and their associated salaries tend to drift away from the actual on-the-ground work being done and the appropriate compensation level over time, making recruitment of new staff and retention of your best staff very difficult (The Dead Sea Effect). If you combine this with a lack of training and professional development, staff has a tendency to get pigeon-holed into a particular role without a clear promotion path. Furthermore, many of the job class series are disjointed in such a way as working at the top of one job series will not meet the prerequisites for another job series, making advancement difficult, and at least on paper sometimes impossible. For example: you could work as a Lead Programmer for three years leading a team of five people and not qualify, at least on paper, for an entry level IT Manager position.

How does all this apply to Information Technology? People get stuck doing one job, for too long, with no professional training or mentorship. Their skillsets decline towards obsolescence and they become frustrated and discouraged.

 

I have never met anyone in the public sector that just straight up did not give a crap. I have met people that feel stuck, discouraged, marginalized and ignored. And rightly so. Getting stuff done is very hard. It is like everyone has one ingredient necessary to make a cake, and you all more, or less, agree on the recipe. You are all trained and experienced bakers. You can easily make a cake but you each have 100 pieces of paperwork you have to fill out and wait on, sometimes for months, before you can do your part of the cake-baking process. You have 10 different bosses, each telling you to make a different desert when you know that cakes are by far the best desert for your particular bakery. Then you get yelled at for not making a cake in a timely manner, and then you all fired and replaced by food service contractors whose parent company charges an exorbitant hourly rate. But hey, the public eventually got their cake right? Or at least a donut. Not exactly what they ordered but better than nothing … right?

If IT is a thankless job (and I am not sure I agree with that piece of Sysadmin mythology), then Public Sector IT is even more thankless. You will face a Kafkaesque bureaucracy. You will likely be very underpaid and have a difficult time seeking promotion. You will never be able to accept a vendor-provided gift or meal over the price of $25. You will laugh when people ask if you plan on attending VMworld. The public will stereotype you as lazy, ineffective and overpaid. But you will preserve. You have a duty to the body politic to do your best with what you have. You will keep the lights on, you will keep the ship afloat even as more and more water pours in. You have to. Because that’s what Systems Administrators do.

And all you wanted was to simply make a cake.

 

 

One Year of Solitude: My Learning Experience as a Lead

It has been a little over a year since I stepped into a role as a technical lead and I thought this might be a good time to reflect on some of the lessons I have learned as I transition from being focused entirely on technical problems to trying to under how those technical pieces fit into a larger picture.

 

Tech is easy. People are hard. And I have no idea how to deal with them.

It is hard to understate this. People are really, really difficult to deal with compared to technology and I have so much to learn about this piece of sysadmin craft. I do not necessarily mean people are difficult in the sense that they are oppositional or hard to work with (although often they are) just that team dynamics are very complicated and the people composing your team have a huge spread in terms of experience, skills, motivations, personalities and goals. These underlying “attributes” are not static either, they change based on the day, the mood and the project making identifying, understanding them and planning around them even harder. The awareness of this underlying milieu composing your team members and thus your team is paramount to your project’s success.

All I can say is that I have just begun to develop an awareness of these “attributes” and am just getting the basics of recognizing different communication styles (person and instance dependent). I can just begin to tell whose motivations align with mine and whose do not. In hunting we call this the difference between “looking and seeing”. It takes a lot of practice to truly “see”, especially if like me, you are not that socially adept.

My homework in this category is to build an RPG-like “character sheet” for each team member, myself included,  and think about what their “attributes” are and where those attributes are strengths and where they can be weaknesses.

 

Everyone will hate you. Not really. But kinda yes.

One the hardest parts of being a team lead, is you are now “in-charge” of technical projects with a project team made up of many different members who are not within your direct “chain-of-command” (at least this is how it works in my world). This means you own the responsibility for the project but any authority you have is granted to you by a manager somewhere higher up the byzantine ladder of bureaucracy. Nominally, this authority allows you to assign and direct work directly related to the project but in practice this authority is entirely discretionary. You can ask team member A to work on item Z but it is really up to her and her direct supervisor if that is what she is going to do. In the hierarchical, authority-based culture and process driven business world that most of us of work in this means you need to be exceedingly careful about whose toes you step on. Authority on paper is one thing, authority in practice is entirely another.

 

Mo’ People, Mo’ Problems

My handful of project have thus far been composed of team members that kind of fall into these rough archetypes.

A portion of the team will be hesitant to take up the project and the work you are asking them to do since you are not strictly speaking their supervisor. They will passively help the project along and frequently you will be required to directly meet with them and/or their supervisor to make sure they are “cleared” for the work you assigned them and to make sure they feel OK about doing it. These guys want to be helpful but they don’t want to work beyond what their supervisor has designated. Get them “cleared” and make sure they feel safe doing the work and you have made a lot of progress.

Another portion of the team will be outright hostile. Either their goals or motivations do not align with the project or even worse their supervisor’s goals or motivations do not align with the project but someone higher up leaned on them and so they are playing along. This is tough. The best you can hope for here is to move these folks from actively resisting to passively resisting. They might be “dead weight” but at least they aren’t actively trying to slow things down any more. I don’t have much a working strategy here – an appeal to authority is rarely effective. Authority does not wanted to bothered by your little squabbles and arguably it has already failed because chain-of-command can make someone play along, but it cannot make they play nice. I try to tailor my communication style to whatever I am picking up from these team members (see the poorly named, Dealing with People You Can’t Stand), do my best to include them (trying to end-run them makes things ten times worse) and inoculate the team against their toxicity. I am fan of saying I deal with problems and not complaints because problems can actually be solved but a lot of times these folks just want to complain. Give them a soap box so they can get it out of their system so you can move on get work done but don’t let them stand on it for too long.

Another group will be unengaged. These poor souls were probably assigned to the project because their supervisor had to put someone on it. A lot of times the project will be outside their normal technical area of operations, the project will only marginally effect them, or both. They will passively assist where they can. The best strategy I have found here is to be concise, do your best not to waste their time, and use their experience and knowledge of the surrounding business processes and people as much as you can. These guys can generate some great ideas or see problems that you would never otherwise see. You just have to find a way to engage them.

The last group will be actively engaged and strongly motivated to see the project succeed. These folks will be doing the heavy lifting and 90% of the actual technical work required to actually accomplish the project. You have to be careful to not let these guys lean to hard on the other team members out of frustration and you have to not overly rely on them or burn them out otherwise you will be really screwed since they are actually the only people truly putting in the nuts-and-bolts work required for the project’s success.

A quick aside, if you do not have enough people in this last group the project is doomed to failure. There is no way a project composed mostly of people actively resisting its stated goals will succeed, at least not under my junior leadership.

Dysfunctional? Yes. But all teams are dysfunctional in certain ways and at certain times. Understanding and adapting to the nature of your team’s dysfunction lets you mitigate it and maybe, just maybe, help move it towards a healthier place.

Until next time, good luck!

A Ticket Too Far… Breaking the Broken

A funny thing happened a while back, one of my manager’s asked me to stop creating tickets on the behalf of customers. This, uh, well, this kind of made me pause for a few reasons. The first and most obvious one is that I cannot remember shit. I always feel terrible when I forget someone’s request and I feel doubly terrible when I forgot it due to an oversight as simple as getting a ticket. The second, is that it is generally considered a Good Thing (TM) to track your customer requests. I won’t even bother supporting that proposition because Tom Limoncelli has pretty much got that covered in Time Management for System Administrators.

The justification for this directive is pretty simple and common-sense and is a great example of how a technical person like me with the best of intentions can actually develop some self-sabotaging behavior.

  • Tickets created for customers by myself with my notes in them are confusing to Tier-1/Tier-2 support folks. It looks like I created the ticket but forgot to own it and I am still working the issue where in actuality I bumped the request all the way back down to Tier-1 where it should of started. Nothing makes a ticket linger in limbo longer, than looking like someone is working it but not being owned by anyone. This tendency for tickets to live in limbo is exacerbated because our ticket system does not support email notification.
  • Customers are confused when a Tier-1/Tier-2 person calls them after picking up a ticket from the queue and asks, “Hey there, I am calling about Request #234901 and your <insert issue here>”.
  • Finally and most importantly, it does nothing to help correct the behavior of customers and teach them the one true way to request assistance from IT by submitting a ticket.

OK. Rebuttal time! (Which sounds kind of weird when you say it out loud). The first two points are largely an artifact of our ticketing system and/or its implementation.

The ticket queue is actually a generic user in the ticket system that tickets can get assigned to by customers. There is no notification when a ticket is created and assigned to this queue, nor any when a ticket is assigned to you. The lack of notification requires a manager or a lead on our team to police the queue, assign tickets to line staff based on who they think is best suited to work a particular issue and then finally notify them via email, phone or in-person.

The arguably confusing series of events where a ticket is created on behalf of a user is again, mainly a technical fault of the system. The requester is set to the customer but line staff that picks up the ticket may just read the notes which has my grubby hands all over them… so whose issue is it? Mine or the customer’s?

That being said – both of these points could largely be alleviated by a smarter ticket system that had proper notification and our Tier-1 guys reading the notes a little more carefully. I can forgive them their trespass since they are extremely interrupt driven and have a tendency to shoot tickets first and ask questions later but still, the appropriate context is there.

The last point, the idea that creating tickets re-enforces bad end-user behavior, is by far the most salient one in my opinion. If you let people get away with not submitting tickets you are short-changing yourself and them. I won’t get credit for the work, the work won’t be documented, we won’t have accurate metrics and I am about 1000% more likely to forget the request and never do it.

Problem: We don’t have a policy requiring users to submit a ticket for a request, it’s more like a guideline. And the further up the support tiers you go, the fewer and fewer requests have tickets. This leaves my team in an interesting spot, we either create the ticket for the customer, tell the customer we won’t work the issue if they don’t create a ticket first (kind of a dick move, especially when there is no teeth to our policy) or not create a ticket at all.

Conclusion: Right idea but we are still focused on the symptom and not the cause. Let’s review.

  • The ticket system has technical deficiencies that lead to less than ideal outcomes. It makes it cumbersome for both technical staff and customers to use it and relies on staff doing the very thing ticket systems are supposed to reduce, interrupt people to let them know they have work assigned to them.
  • A policy is not useful if it does not have teeth. I already feel like a jerk telling a customer “Hey. I am working with another team/customer/whatever but if you submit a ticket someone will take a look”, when they are standing in my cubicle with big old doe eyes. I especially feel like a jerk when I do not even have a policy backing me up. Paraphrasing, Tom Limoncelli, “Your customers judge your competency by your availability. Your manager judges it by your completion of projects. These two dual requirements are directly opposed and balancing them is incredibly important.”
  • By the time I am creating a ticket on behalf of a customer the battle is already lost. I’ve already been interrupted with all the lost efficiency and the danger of mistakes that comes with it.
  • The customers that do not submit tickets get preferential treatment. They get to jump ahead of all the people that actually did submit tickets which hardly seems fair. All that is happening here is that we are encouraging the squeaky wheels to squeak louder.
  • The escalation chain gets skipped. A bunch of these kind of issues should be caught at Tier-1 and Tier-2. By skipping right to Tier-3, we are not applying our skills optimally and also depriving the Tier-1 and Tier-2 guys the chance to chew on a meatier problem. A large part of the reason I am creating tickets for customers is to bump the request back down to Tier-1 and Tier-2 where it should have been dealt with to begin with.

Creating tickets on the behalf of customers is not the problem. It is a symptom of deeper issues in Process and Technology. These issues will not be resolved by no longer generating tickets for customers. Customers will still skip the escalation chain, we will continue to re-enforce bad behavior, less issues will get recorded, our Tier-2 and Tier-3 will still be interrupt driven regardless of whether there is a ticket or not. All that will change is that we will be more likely to forget requests.

The technical problems can be resolved by implementing a new ticket system or by fixing our existing one. The policy problems can be solved by creating a standardized policy for all our customers and then actually ensuring that it has teeth. The people problems can be fixed by consistent and repeated re-training.

That covers the root cause but what about now? What do we do?

  • We create the ticket for the customer – We cannot really do that. It disobeys a directive from leadership and it has all the problems discussed above.
  • We tell the customer to come back with a ticket – This does not really address the root cause, annoys the customers and we do not have policy backing it up. It is not really an option.
  • Do not use a ticket to track the request – And here we are by process of elimination. If things are broken sometimes the best way to fix them is let them break even further.

Until next time . . .

When the world gets bad enough, the good go crazy, but the smart…they go bad.Evil Abed

Budget Cuts and Consolidation: Taking it to the Danger Zone

For those of you that do not know, Alaska is kind of like the 3rd world of the United States in that we have semi-exploitative love/hate economic relationship with a single industry . . . petroleum. Why does this matter? It matters because two years ago oil was $120 a barrel and now it is floating between $40 and $50. For those of us in public service or who are private industry support services that contract with government and municipal agencies it means that our budget just shrank by 60%. The Legislature is currently struggling with balancing a budget that runs an annual 3.5 to 4 billion dollar deficit, a pretty difficult task if your only revenue stream is oil.

Regardless of where you work and who you work for in Alaska, this means “the times, they are a changin'”. As budgets shrink, so do resources: staff, time, support services, training opportunities, travel, equipment refreshes and so on. Belts tighten but we still have to eat. One way to make the food go further is to consolidate. In IT, especially these days of Everything-as-a-Service, there is more and more momentum in the business to go to centralized, standardized and consolidated service delivery (ITIL buzzword detected! +5 points).

In the last few years, I have been involved in a few of these type of projects. I am here to share a couple of observations.

 

 

Consolidation, Workload and Ops Capacity

 

Above you should find a fairly straight forward management-esque graph with made-up numbers and metrics. Workload is how much stuff you actually have to get done. This is deceptive because Workload can break down in many different types of work: projects, break/fix, work that requires immediate action, and work that can be scheduled. But for the sake of this general 40,000ft view, it can just be deemed work that you and your team do.

Operational Capacity is simply you and your team’s ability to actually do that work. Again, this is deceptive because depending on your team’s skills, personalities, culture, organizational support, and morale, their Operational Capacity can look different even if the total amount of work they do in given time stays constant. But whatever, management-esque talk can be vague.

Consolidation projects can be all over the map as well: combining disparate systems that have the same business function, eliminating duplicate systems and/or services, centralization of services or even something as disruptive as combining business units and teams. Consolidation projects generally require standardization as a prerequisite; how else would you consolidate? The technical piece here is generally the smallest, People, Process, Technology, right?

And from that technical standpoint, especially one from a team somewhere along that Workload vs. Operational Capacity timeline, consolidation and standardization look very, very different.

Standardization has no appreciable long-term Workload increase or reduction. There is an Increased capture of business value for existing work performed. If there is wider use of the same Process and Technology the given business value of a unit of work goes further, for example if it takes 10 hours to patch 200 workstations it may only take 10.2 hours to patch 2000 workstations.

Consolidation brings a long-term Workload increase with a corresponding increase of Operational Capacity due to addition of new resources or re-allocation of existing resources (that’s the dotted orange line on the graph). For example, if there is wide spread adoption of the same Process and Technology, you can take the 10 hours my team spends on patching workstations and combine it with the 10 hours another team spends on patching workstations. You just bought yourself some Operational Capacity in the terms of having twice as many people deal with the patching or maybe it turns out that it only takes 10 hours to patch both team’s workstations and you freed up 10 hours worth of labor that can go to something else. There is still more work than before but that increased Workload is more than offset by increased Operational Capacity.

Both standardization and consolidation projects increase the short-term Workload while the project is on-going (see Spring of ’15 in the graph). They are often triggered by external events like mergers, management decisions, or simply proactive planning in a time of shrinking budgets. In this example, it is a reduction of staff. This obviously reduces the team’s Operational Capacity. The ability to remain proactive at both the strategic and tactical level is reduced. In fact, we are just barely able to get work done. BUT we have (or had) enough surplus capacity to continue to remain proactive even while taking on more projects, hopefully projects that will either reduce our Workload or increase our Operational Capacity or both because things are thin right now.

Boom! Things get worse. Workload increases a few months later. Maybe another position was cut, maybe a new project or requirement from on-high that was unanticipated came down to your team. Now you are in, wait for it… THE DANGER ZONE! You cannot get all the work done inside of the required time frame with what you have. This is a bad, bad, bad place to be for too long. You have to put projects on hold, put maintenance on hold or let the ticket queues grow. Your team works harder, longer and burns out. A steady hand, a calm demeanor and a bit of healthy irreverence are really important here. Your team needs to pick your projects very, very carefully since you are no longer in a position to complete them all. The one’s you do complete damn well better either lower Workload significantly, increase your Operational Capacity or hopefully do both. Mistakes here, cost a lot more than they did a year ago.

The problem here is technical staff does not generally prioritize their projects. Their business leaders do. And in times where budgets are evaporating, priorities seems to settle around a single thing: cost savings. This makes obvious sense but the danger is that there is no reason that the project with the most significant cost savings will also happen to be the project that will help your team decrease their Workload and/or increase their Operational Capacity. I am not saying it won’t happen just that there is no guarantee that it will. So your team is falling apart, you just completed a project that saves the whole business rap star dollars worth of money and you have not done anything to move your team out of THE DANGER ZONE.

In summation, projects that increase your Operational Capacity and/or reduce your Workload have significant long-term savings in terms more efficient allocation of resources but the projects that will get priority will be those that have immediate short-term savings in terms of dollars and cents.

Then a critical team member finds better work. Then it’s over. No more projects with cost savings, no more projects at all. All that maintenance that was put off, all the business leaders that tolerated the “temporary” increase in response time for ticket resolution, all the “I really should verify our backups via simulated recovery” kind of tasks – all those salmon come home to spawn. Your team is in full blown reactive mode. They spend all their time putting out fires. You are just surviving.

Moral of the story? If you go to THE DANGER ZONE, don’t stay to long and make sure you have a plan to get your team out.

 

Documentation or how I wasted an hour

As if confirming my own  tendency to “do as I say, not what I do” I just wasted about an hour this morning trying to figure out why a newly created virtual machine was not correctly registering its hostname with Active Directory via DynamicDNS. Of course, this was a series of errors greatly exasperated by the fact that I had only had two out my required four cups of coffee and I stayed up too late watching the ironically named and absolutely hilarious Workaholics.

Let’s review, shall we?

  • Being tired and trying to do something mildly complicated
  • Allowing myself to become distracted by an interrupt task in the middle of this work
  • Not verifying the accuracy of our documentation prior to assigning the IP address in question to the virtual machine
  • Screwing up and assigning the IP address to the wrong virtual machine (both hostname and subnet octets are very similar)
  • Not reading the instrumentation; the output of ipconfig /all plainly said “(Duplicate)” Duh.

All of these factors made what should of been a 15 minute troubleshooting task stretch out into an hour.

Root cause: The IP address I picked for one of the virtual machines was already in use and the documentation was not updated to reflect this.

Potential solutions: I dunno… how about keeping our documentation updated (easier said than done)? Or better yet, stop using a “documentation system” for IP addresses that relies on discretionary operational practices (i.e., an Excel Spreadsheet stored on SharePoint) and use something like IPAM. Maybe, instead of going down the ol’ “runlist” of potential problems, I should of stopped, gathered a bit more information before I proceeded with troubleshooting? The issue was right there in the ipconfig output. I was looking *right* at it. I guess that is the difference between looking and seeing.

In short . . . happy Monday you jerks.

 

facepalm