Tag Archives: time management

Morale, Workload and Tickets – A Follow-Up

You guys remember Elaine right? Six months ago, (Morale is Low, Workload is Up) I looked into some ticketing system data to try to discover what is going on in our team and how our seemingly ever increasing workload was being distributed along staff.  We unfortunately came to some alarming conclusions and hopefully mitigated them. I checked in with her recently to see how she was doing. Her response?

Things are just great. Just greatttt…

Let’s dig back in and see what we find. Here’s what I’m hoping we see:

  • Elaine’s percentage of our break/fix work drops to below 20%. She was recently promoted to our tier-2 staff and her skill set should be dedicated more towards proactive work and operational maintenance.
  • George and Susan have come onto their own and between the two of them are managing at least 60% of the ticket queue. They’re our front line tier-1 staff so I would expect the majority of the break/fix work goes to them and they escalate tickets as necessary to other staff.
  • Our total ticket volume drops a bit. I don’t think we’re going to get back to our “baseline” from 2016 but hopefully August was an outlier.

 

Well, shit. That’s not what I was hoping to see.

That’s not great but not entirely unexpected. We did take over support for another 300 users in August so I would expect that workload would increase which it is has by roughly double. However it is troubling because theoretically we are standardizing that department’s technology infrastructure which should lead to a decline in reactive break/fix work. Standardization should generate returns in reduced workload. If we add them into our existing centralized and automated processes the same labor hours we are already spending will just go that much further. If we don’t do that, all we have done is just add more work, that while different tactically, is strategically identical. This is really a race against time – we need to standardize management of this department before our lack of operational capacity catches up with us and causes deeper systemic failures pushing us too far down the “reactive” side of operations that we can’t climb back up. We’re in the Danger Zone.

 

This is starting to look bad.

Looking over our ticket load per team member things are starting to look bleaker. Susan and George are definitely helping out but the only two months where Elaine’s ticket counts are close to them was when she was out of the office for a much needed extended vacation. Elaine is still owning more of the team’s work than she should, especially now that she’s nominally in a tier-2 position. Lets also remember than in August when responsibility for those additional 300 users was moved to my team, along with Susan and George, we lost two more employees in the transition. That works out to a 20% reduction of manpower and that includes our manager as a technical asset (which is debatable). If you look at just reduction in line staff it is even higher. This is starting to look like a recipe for failure.

 

Yep. Confirmed stage 2 dumpster fire.

Other than the dip in December and January when Elaine was on vacation things look more or less the same. Here’s another view of just the tier-1 (George, Frank and Susan) and tier-2 (Elaine and Kramer) staff:

Maybe upgrade this to a stage 3 dumpster fire?

I think this graph speaks for itself. Elaine and Susan are by far doing the bulk of the reactive break/fix work. This has serious consequences. There is substantial proactive automation work that only Elaine has the skills to do. The more of that work that is delayed to resolve break/fix issues the more reactive we become and the harder it is to do the proactive work that prevents things from breaking in the first place. You can see how quickly this can spiral out of control. We’re past the Danger Zone at this point. To extend the Top Gun metaphor – we are about to stall (“No! Goose! Noo! Oh no!”). The list of options that I have, lowly technical lead that I am, is getting shorter. It’s getting real short.

In summation: Things are getting worse and there’s no reason to expect that to change.

  • Since August 2016 we have lost three positions (30% reduction in workforce). Since I started in November 2014 we have seen a loss of eight positions (50% reduction in workforce).
  • Our break/fix workload has effectively doubled.
  • We have had a change in leadership and a refocusing of priorities on service delivery over proactive operational maintenance which makes sense because the customers are starting to feel the friction. Of course with limited operational capacity putting off PM for too long is starting to get risky.
  • We have an incredibly uneven distribution of our break/fix work.
  • Our standardization efforts for our new department are obviously failing.
  • It seems less likely every day that we are going to be able to climb back up the reactive side of the slope we are on with such limited resources and little operational capacity.

Until next time, Stay frosty.

 

Morale Is Low, Workload Is Up

Earlier this month, I came back from lunch and I could tell something was off. One of my team members, lets call her Elaine, who is by far the the most upbeat, relentlessly optimistic and quickest to laugh off any of our daily trials and tribulations was silent, hurriedly moving around and uncharacteristically short with customers and coworkers. Maybe she was having a bad day I wondered as I made a mental note to keep tabs on her for the week to see if she bounced back to her normal self. When her attitude didn’t change after a few days then I was really worried.

Time to earn my team lead stripes so I took her aside and asked her what’s up. I could hear the steam venting as she started with, “I’m just so f*****g busy”. I decided to shut up and listen as she continued. There was a lot to unpack: She was under-pressure to redesign our imaging process to incorporate a new department that got rolled under us, she was handling the majority of our largely bungled Office 365 Exchange Online post-migration support and she was still crushing tickets on the help desk with the best of them. The straw that broke the camel’s back – spending a day to clean-up her cubicle that was full of surplus equipment because someone commented that our messy work area looked unprofessional…  “I don’t have time for unimportant s**t like that right now!” as she continued furiously cleaning.

The first thing I did and asked her what the high priority task of the afternoon was and figured out how to move it somewhere else. Next I recommended that she finish her cleaning, take off early and then take tomorrow off. When someone is that worked up, myself included, generally a great place to start is to get some distance between you and whatever is stressing you out until you decompress a bit.

Next I started looking through our ticket system to see if I could get some supporting information about her workload that I could take to our manager.

Huh. Not a great trend.

That’s an interesting uptick that just so happens to coincide with us taking over the support responsibilities for the previously mentioned department. We did bring their team of four people over but only managed to retain two in the process. Our workload increased substantially too since we not only had to continue to the maintain the same service level but we now have the additional challenge of performing discovery, taking over the administration and standardizing their systems (I have talked about balancing consolidation projects and workload before). It was an unfortunate coincidence that we had to schedule our Office 365 migration at the same time due to a scheduling conflict. Bottom line: We increased our workload by a not insignificant amount and lost two people. Not great a start.

I wonder how our new guys (George and Susan) are doing? Lets take a look at the ticket distribution, shall we?

Huh. Also not a great trend.

Back in December 2016 it looks like Elaine started taking on more and more of the team’s tickets. August of 2017 was clearly a rough month for the team as we started eating through all that additional workload but noticeably that workload was not being distributed evenly.

Here is another view that I think really underlines the point.

Yeah. That sucks for Elaine.

As far back as a year Elaine has been handling about 25% of our tickets and since then her percentage of the tickets has increased to close to 50%. What makes this worse is not only has the absolute quantity of tickets in August more than doubled compared to the average of the 11 preceding months but the relative percentage of her contribution has doubled as well. This is bad and I should of noticed, a long time ago.

Elaine and I had a little chat about this situation and here’s what I distilled out of it:

  • “If I don’t take the tickets they won’t get done”
  • “I’m the one that learns new stuff as it comes along so then I’m the one that ends up supporting it”
  • “There’s too many user requests for me to get my project work done quickly”

Service Delivery and Business Processes. A foe beyond any technical lead.

This is where my power as a technical lead ends. It takes a manager or possibly even an executive to address these issues but I can do my best to advocate for my team.

The first issue is actually simple. Elaine needs to stop taking it upon herself to own the majority of the tickets. If the tickets aren’t in the queue then no one else will have the opportunity to take them. If the tickets linger, that’s not Elaine’s problem, that’s a service delivery problem for a manager to solve.

The second issue is a little harder since it is fundamentally about the ability of staff to learn as they go, be self-motivated and be OK with just jumping into a technology without any real guidance or training. Round after round of budget cuts has decimated our training budget and increased our tempo to point where cross training and knowledge sharing is incredibly difficult. I routinely hear, “I don’t know anything about X. I never had any training on X. How am I supposed to fix X!” from team members and as sympathetic as I am about how crappy of a situation that is there is nothing I can do about it. The days of being an “IT guy” that can go down The Big Blue Runbook of Troubleshooting are over. Every day something new that you have never seen before is broken and you just have to figure it out.

Elaine is right though – she is punching way above her weight, the result of which is that she owns more and more the support burden as technology changes and as our team fails to evenly adopt the change. A manager could request some targeted training or maybe some force augmentation from another agency or contracting services. Neither are particularly likely outcomes given our budget unfortunately.

The last one is a perennial struggle of the sysadmin: Your boss judges your efficacy by your ability to complete projects, your users (and thus your boss’ peers via the chain of command) judge your efficacy by your responsiveness to service requests. These two standards are in direct competition. This is such as common and complicated problem that there is a fantastic book about it: Time Management for Systems Administrators

The majority of the suggestions to help alleviate this problem require management buy-in and most of them our shop doesn’t have: A easy to use ticket system with notification features, a policy stating that tickets are the method of requesting support in all but the most exigent of circumstances, a true triage system, a rotating interrupt blocker position and so on. The best I can do here is to recommend to Elaine to develop some time management skills, work on healthy coping skills (exercise, walking, taking breaks, etc.) and doing regular one-on-one sessions with our manager so Elaine has a venue for discussing these frustrations privately so at least if they cannot be solved they can acknowledged.

I brought a sanitized version of this to our team manager and we made some substantial progress. He reminded me that George and Susan have only been on our team for a month and that it will take some time for them to come up to speed before they can really start eating through the ticket queue. He also told Elaine, that while her tenacity in the ticket queue is admirable she needs to stop taking so many tickets so the other guys have a chance. If they linger, well, we can cross that bridge when we come to it.

The best we can do is wait and see. It’ll be interesting to see what happens as George and Susan adjust to our team and how well the strategy of leaving tickets unowned to encourage team members to grab them works out.

Until next time, stay frosty.

 

Can’tBan… Adventures with Kanban

Comic about Agile Programming

. . .

We started using Kanban in our shop about six months ago. This in of itself is interesting considering we are nominally an ITIL shop and the underlying philosophies of ITIL and Kanban seem diametrically opposed. Kanban, at least from my cursory experience is focused on speeding up the flow of work, identifying bottlenecks and meeting customers’ requirements more responsively. It is all about reducing “cycle time”, that is the time it takes to move a unit of work through to completion. ITIL is all about slowing the flow of work down and adding rigor and business oversight into IT processes. A side effect of this is that the cycle time increases.

If you are not familiar with Kanban the idea is simple. Projects get decomposed into discrete tasks, tasks get pulled through the system from initiation to finish and each stage of the project is represented by a queue. Queues’ have work in progress (WIP) limits which means only so many task can be in a single queue at the same time. The backlog is where everything you want to get done sits before you actually start working on it. DO YOU WANT TO KNOW MORE?

As I am sure the one reader of my blog knows, I simultaneously struggle with time management and I am also fascinated by it. What do I think about Kanban? I have mixed feelings.

The Good

  • Kanban is very visual. I like visual things – walk me through your application’s architecture over the phone and I have no idea what you have just told me five minutes later. Show me a diagram and I will get it. This appeal of course is personal and will vary widely depending on the individual.
  • Work in progress (WIP) limits! These are a fantastic concept. The idea that your team can only process so much work in a given unit of time and that constantly context switching between tasks has an associated cost is obvious to those in the trenches but not so much to those higher powers that exist beyond the Reality Impermeability Layer of upper management. If you literally show them there is not enough room in the execution queue for another task they will start to get it. All of sudden you and your leadership can start asking the real questions… why is task A being worked on before task Z? Do you need more resources to complete all the given tasks? Maybe task C can wait awhile? Why is task G moving so slowly? Why are we bottlenecked at this phase?
  • Priorities are made explicit. If I ever have doubt about what I am expected to be working on I can just check the execution queue. If my manager wants me to work on another task that is outside the execution queue, then we can have a discussion about whether or not to bump something back or hold the current “oh hey, can you take care of this right now?” task in the backlog. I cannot understate how awesome this. It makes the cost of context switching visible, keeps my tactical work aligned with my manager’s strategic goals, and makes us think about what tasks matter most and in what order they should get done. This is so much better than the weekly meeting, where more and more tasks get dumped into some nebulous to-do list that my team struggles through while leadership wonders why the “Pet Project of the Month” isn’t finished yet.

The Interesting

  • The scope of work that you set as a singular “task” is really important. If a single task is too large then it doesn’t accurately map to the work being done on a day-to-day basis and you lose out on Kanban’s ability to bring bottlenecks and patterns to the surface where they can be dealt with. If the tasks are to small then you end up spending too much time in the “meta-analysis” of figuring out what task is where instead of actually accomplishing things.
  • The type of work you decide to count as a Kanban task also has a huge effect on how your Kanban actually “runs”. Do you track break/fix, maintenance tasks, meetings, projects, all of the above? I think this really depends on how your team works and what they work on so there is no hard or fast answer here.
  • Some team members are more equal than others. We set our WIP limit to Number of Team Members * 2… the idea being that two to three tasks is about all a single person can really focus on and still be effective (i.e., “The Rule Of Threes”). Turns out though in practice that 60% of tasks are owned by only 20% team. Huh. I guess that would be called a bottleneck?

The Bad

  • Your queues need to actually be meaningful. Just having separate queues named “Initiation”, “Documentation”, “Sign-off” only works if you have discrete actions that are expected for the tasks in those queues. In our shop what I have found is only one queue matters: the execution queue. We have other queues but since they do not have requirements and WIP limits attached to them they are essentially just to-do lists. If a task goes into the Documentation queue, then you better damn well document your system before you move the task along. What we have is essentially a one queue Kanban system with a single WIP limit. If we restructured our Kanban process and truly pulled a task through each queue from beginning to finish I think we would see much more utility.
  • Flow vs. non-flow. An interesting side of effect of not having strong queue requirements is that tasks don’t really “flow”. For example: We are singularly focused on the execution queue and so every time I finish a task it gets moved onto the documentation queue where it piles up with all the other stuff I never documented. Now instead of backing off and making time for our team to document before pulling more work into the system I re-focus on whatever task just got promoted into the execution queue. Maybe this is why our documentation sucks so much? What this should tell us is 1) We have too many items in the documentation queue for new work, 2) the documentation queue needs a smaller WIP limit, 3) we need to make the hard decision to put off work until documentation is done if we actually want documentation and 4) documentation is work and work takes time. If we never give staff the time to document then we will end up with no documentation. I don’t necessarily thing everything needs to be pulled through each queue. Break/Fix work is often simple, ephemeral and if your ticket system doesn’t suck ,self-documenting. You could handle these types of tasks with a standalone queue.
  • Queues should have time-limits. You only have one of two states regarding a given unit of work, you are either actively working on it or you are not. Kanban should have the same relationship with tasks in a given queue. If a task has sat in the planning queue for a week without any actual planning occurring then it should be removed. Either the next queue is full (bottleneck), the planning queue is full (bottleneck/WIP limit to high) or your team is not working on your Kanban tasks (other larger systemic problems). Aggressively “reset” tasks by sending them to the backlog if no work is being performed on them and enforce your queue requirements otherwise all you have done is create six different “to-do-whenever-we-have-spare-time-which-is-never-lists” that just collect tasks.
  • Our implementation of Kanban does not work as a time management tool because we only track “project” work. Accordingly very little of my time is actually spent on the Kanban tasks since I am also doing break/fix, escalations, monitoring and preventive maintenance. This really detracts from the overall benefit of managing priorities, making then explicit and limiting context switching since our Kanban board only represents at best 25% of my team’s work.

In conclusion there are some things I really like about Kanban and with some tweaks I think our implementation could have a lot of utility. I am not convinced it will mix well with our weird combination of ITIL processes but no real help desk (see: Who Needs Tickets Anyway? and Those are Rookie Numbers according to r/sysadmin). We are getting value out of Kanban but it needs some real changes before it becomes just one more process of vague effectiveness.

It will be interesting to see where we are in another six months.

Until next time, keep your stick on the ice.

A Ticket Too Far… Part II

Part I:  A Ticket Too Far… Breaking the Broken

OK. So I screwed up. If you read the above post, I make a lot of claims about ticket systems and process. Many of those claims were based on the idea that we did not have an approved and enforced policy in place. Turns out I was wrong, sort of.

I did a little digging and reviewed the policy. Customers are asked to submit a ticket if their issue is not immediately preventing their work, otherwise they can either call the help desk number, an IT staff member directly or visit them in person. There is a lot I could say about this policy and what provisions I agree with and which I do not, but policy is policy and I misrepresented the strength of it – it is very much vetted and approved.

I see the goals of a policy on customer facing support as follows:

  • Prevent sysadmins from forgetting customer requests
  • Allow sysadmins to control their interrupt-based workflow, prioritize and not have their workflow control them
  • Track customer requests and incidents so pain-points can be discovered and resolved proactively
  • Build a database of break/fix-based documentation
  • Create an acknowledgement and feedback mechanism for customer issues (i.e., “you have been assigned ticket #232”), backed up by mechanisms that forces sysadmin action (i.e., “this ticket has not been touched in three days, either close it or reply”). This feedback loop ensures that issues are acknowledge and resolved either one way or another.

The details may be wrong but the bigger point of my last post remains the same; the combination of our policy and ticket system’s technical limitations does not accomplish those goals or lead to ideal outcomes for either customers or IT staff.

But does it really? Perception and reality are not always the same so I started tracking how often I was interrupted by a customer or a team member over a period of about four weeks. It is important to mention this was not a particularly rigorous study, I just kept an Excel spreadsheet and anytime I diverted my attention for more than a few minutes from my current task I made a quick note of it. If anything, I was consistent in my inconsistency. I also kept track of what kind of interrupt it was, what group it came from and whether or not it had ticket attached to it.

 

a graph showing the number of interrupts per day

A couple of interesting discoveries here:

  • I am not interrupted nearly as much as I think I am. If you throw out the obvious outlier of Day 12, the median is two interrupts per day. Not as bad as I would have thought… but it is not that great either considering with meetings and other obligations, I probably only have one period per day of uninterrupted time to focus on complex projects that is longer than two hours. Getting an interrupt during that time period is a pretty serious set back.
  • 48% of the interrupts were related to break/fix issues. The other 52% were what I call “communication interrupts”. More on these later.
  • Of the 31 break/fix interrupts I recorded, only one actually had a ticket already associated with it. This is mind-boggling terrible as far as accepted time management best practices go.
  • Only 59% of the interrupts required immediate action versus action that could be queued and prioritized later. This means 59% of these interruptions really did not need to be interruptions at all, they needed to be tickets or agenda items in a meeting.

Even with my back-of-the-napkin math these are pretty damning conclusions. Our ticket system capture rate, at least at Tier-3 where I spend most of my time, is laughably non-existent. Going back to my previous post about creating tickets for customers, there would be no reason for me to do so considering I would be creating 97% of the tickets myself. Interestingly enough, it is not like these requests just vanish into thin air. They get tracked and recorded somehow, albeit most likely by individual staff in a manner that is not portable or visible. The work required to track issues is still being done, just the organization is not getting a lot of value out of it since it is just recorded in some crusty senior sysadmin’s logbook.

As an aside, It would be really interesting to perform the same experiment at the Tier-1 and Tier-2 support levels and see what the ratio is. Maybe it is higher and it is more the kind of issues and/or customers that Tier-3 deals with that lead to low ticket system capture. Or worse maybe it is the same and ticket system capture is just really bad everywhere.

Almost two-thirds of these interrupts did not have to be interrupts. They did not require immediate action. This is also pretty terrible because interrupts cost a lot. The accepted wisdom is that it takes between 10 and 20 minutes to get back in “the zone” (source) after being interrupted. For myself that is about 40 minutes lost on average per day just in dealing with context switching from one task to another. There is also a greater risk of mistakes being made trying regain focus on your task. It is harder to calculate this cost but it is assuredly there.

Finally, a bit over half of these interrupts were “communication interrupts”. These are hard to pin down but mostly they were a team member or a customer wanting to communicate something to me, “Hey, I fixed item A on Server 1, but item B and item C are still broken” or a request for information, “Hey, how do items Z and Y work on Server 2 and Server 3?”. These clearly have a interrupt cost but on the other hand, they also have a cost in not immediately jumping to the top of my to-do pile. If someone needs information from me it is likely because they are in a wait-cycle – they need to know something to continue their current task. It feels like it boils down to a “whose time is more valuable?” argument. Is it better for a Tier-2 team member to burn 45 minutes instead of 10 on a task because she had to dig up some information instead of asking a Tier-3 sysadmin? Or is better for the Tier-3 sysadmin to burn 20 minutes to save his Tier-2 team member 35? I do not really know if there is answer here but it is an interesting thread to pull on…

The interrupts that served to communicate information to me seem a little more clear cut. None of them required immediate action and most of them were essentially status updates. Implementing a daily standup meeting or something similar would be a prefect format for these kinds of interactions.

 

Well. It was an interesting little experiment. I am not sure if I am smarter or better off for it but curiosity is an itch that sometimes just needs to be scratched.

Until next time, stay frosty.

 

A Ticket Too Far… Breaking the Broken

A funny thing happened a while back, one of my manager’s asked me to stop creating tickets on the behalf of customers. This, uh, well, this kind of made me pause for a few reasons. The first and most obvious one is that I cannot remember shit. I always feel terrible when I forget someone’s request and I feel doubly terrible when I forgot it due to an oversight as simple as getting a ticket. The second, is that it is generally considered a Good Thing (TM) to track your customer requests. I won’t even bother supporting that proposition because Tom Limoncelli has pretty much got that covered in Time Management for System Administrators.

The justification for this directive is pretty simple and common-sense and is a great example of how a technical person like me with the best of intentions can actually develop some self-sabotaging behavior.

  • Tickets created for customers by myself with my notes in them are confusing to Tier-1/Tier-2 support folks. It looks like I created the ticket but forgot to own it and I am still working the issue where in actuality I bumped the request all the way back down to Tier-1 where it should of started. Nothing makes a ticket linger in limbo longer, than looking like someone is working it but not being owned by anyone. This tendency for tickets to live in limbo is exacerbated because our ticket system does not support email notification.
  • Customers are confused when a Tier-1/Tier-2 person calls them after picking up a ticket from the queue and asks, “Hey there, I am calling about Request #234901 and your <insert issue here>”.
  • Finally and most importantly, it does nothing to help correct the behavior of customers and teach them the one true way to request assistance from IT by submitting a ticket.

OK. Rebuttal time! (Which sounds kind of weird when you say it out loud). The first two points are largely an artifact of our ticketing system and/or its implementation.

The ticket queue is actually a generic user in the ticket system that tickets can get assigned to by customers. There is no notification when a ticket is created and assigned to this queue, nor any when a ticket is assigned to you. The lack of notification requires a manager or a lead on our team to police the queue, assign tickets to line staff based on who they think is best suited to work a particular issue and then finally notify them via email, phone or in-person.

The arguably confusing series of events where a ticket is created on behalf of a user is again, mainly a technical fault of the system. The requester is set to the customer but line staff that picks up the ticket may just read the notes which has my grubby hands all over them… so whose issue is it? Mine or the customer’s?

That being said – both of these points could largely be alleviated by a smarter ticket system that had proper notification and our Tier-1 guys reading the notes a little more carefully. I can forgive them their trespass since they are extremely interrupt driven and have a tendency to shoot tickets first and ask questions later but still, the appropriate context is there.

The last point, the idea that creating tickets re-enforces bad end-user behavior, is by far the most salient one in my opinion. If you let people get away with not submitting tickets you are short-changing yourself and them. I won’t get credit for the work, the work won’t be documented, we won’t have accurate metrics and I am about 1000% more likely to forget the request and never do it.

Problem: We don’t have a policy requiring users to submit a ticket for a request, it’s more like a guideline. And the further up the support tiers you go, the fewer and fewer requests have tickets. This leaves my team in an interesting spot, we either create the ticket for the customer, tell the customer we won’t work the issue if they don’t create a ticket first (kind of a dick move, especially when there is no teeth to our policy) or not create a ticket at all.

Conclusion: Right idea but we are still focused on the symptom and not the cause. Let’s review.

  • The ticket system has technical deficiencies that lead to less than ideal outcomes. It makes it cumbersome for both technical staff and customers to use it and relies on staff doing the very thing ticket systems are supposed to reduce, interrupt people to let them know they have work assigned to them.
  • A policy is not useful if it does not have teeth. I already feel like a jerk telling a customer “Hey. I am working with another team/customer/whatever but if you submit a ticket someone will take a look”, when they are standing in my cubicle with big old doe eyes. I especially feel like a jerk when I do not even have a policy backing me up. Paraphrasing, Tom Limoncelli, “Your customers judge your competency by your availability. Your manager judges it by your completion of projects. These two dual requirements are directly opposed and balancing them is incredibly important.”
  • By the time I am creating a ticket on behalf of a customer the battle is already lost. I’ve already been interrupted with all the lost efficiency and the danger of mistakes that comes with it.
  • The customers that do not submit tickets get preferential treatment. They get to jump ahead of all the people that actually did submit tickets which hardly seems fair. All that is happening here is that we are encouraging the squeaky wheels to squeak louder.
  • The escalation chain gets skipped. A bunch of these kind of issues should be caught at Tier-1 and Tier-2. By skipping right to Tier-3, we are not applying our skills optimally and also depriving the Tier-1 and Tier-2 guys the chance to chew on a meatier problem. A large part of the reason I am creating tickets for customers is to bump the request back down to Tier-1 and Tier-2 where it should have been dealt with to begin with.

Creating tickets on the behalf of customers is not the problem. It is a symptom of deeper issues in Process and Technology. These issues will not be resolved by no longer generating tickets for customers. Customers will still skip the escalation chain, we will continue to re-enforce bad behavior, less issues will get recorded, our Tier-2 and Tier-3 will still be interrupt driven regardless of whether there is a ticket or not. All that will change is that we will be more likely to forget requests.

The technical problems can be resolved by implementing a new ticket system or by fixing our existing one. The policy problems can be solved by creating a standardized policy for all our customers and then actually ensuring that it has teeth. The people problems can be fixed by consistent and repeated re-training.

That covers the root cause but what about now? What do we do?

  • We create the ticket for the customer – We cannot really do that. It disobeys a directive from leadership and it has all the problems discussed above.
  • We tell the customer to come back with a ticket – This does not really address the root cause, annoys the customers and we do not have policy backing it up. It is not really an option.
  • Do not use a ticket to track the request – And here we are by process of elimination. If things are broken sometimes the best way to fix them is let them break even further.

Until next time . . .

When the world gets bad enough, the good go crazy, but the smart…they go bad.Evil Abed

The Art and Burden of Documentation

I have been thinking a lot about documentation lately, mostly about my own shortcomings and trying to understand why the act of documenting seems so difficult and why the quality of the documentation that does get done is often found lacking. Good documentation and good documentation practices are such a fundamental part of the health of an IT shop you would think we as a field would be better at it. My experience is limited and anecdotal (whose is not?) but I have yet to see a shop with solid documentation and solid documentation practices. This extends to myself as well. I can look back at my various positions and roles and there are very few where I actually felt satisfied with the quality of my documentation.

Read on for “aksysadmin’s made-up principles of how to not suck at documentation and do other things good too”.

 

1. Develop a standardized format, platform and process from the bottom up. Your team uses this, not you.

Leadership has a tendency to standardize on a single standard, platform or process. This is generally considered a good thing. The problem is, leadership does not write technical documentation. We do. And what standard makes sense to them, may not make any sense to the technical staff (*cough* ITIL *cough*). What platform seems adequate to them, may seem unwieldy to sysadmins (*cough* SharePoint *cough*).  Standardization may generally be considered a good thing but forcing a standard, platform or process on a team without input or understanding their problem domain and use case is generally considered a bad thing. The harder you make it for your team to document, the less likely they will be to perform a task they are already unlikely to perform.

2. Don’t document how, document why

This is part an internal challenge (IT staff documenting the wrong things) and an external challenge (leadership requiring the wrong kind of things to be documented). I see lots of documentation that is essentially a re-hashed version of a vendor’s manual. Ninety-nine percent of the time your vendor has exhaustive resources on how to do something. It is right there. In the manual. Go read it. Unless it is incredibly unintuitive, and sometimes it is, why would you waste your precious time re-writing an authoritative set of information into a non-authoritative set that requires your team to maintain it? Reading and understanding vendor documentation should be considered a fundamental skill, if your guys cannot read or are unwilling to read vendor manuals you have other problems that need addressing.

What you should document, is why you did things. You will not remember why this particular group was setup or why things are this way instead of that way in six months and your successor certainly will have no idea. Use your documentation to provide context and meaning.

3. Document where to find things

Documenting why something is the way it is great but it is also important to document where things are. I am talking about, things like IP Addresses, Organizational Charts, Passwords and so on. This is another opportunity to avoid work, err work more efficiently. Chances are many of these things have authoritative sources maintained by other people or tools. Why write your IP Addresses down manually in an Excel Spreadsheet when you can use a tool like IPAM to track them? Why track the phone tree for your different workgroups when Active Directory can do that for you? Why spend time doing stuff that is already done? Why indeed?

Figuring out what stuff to document in the where category can be hard to do. I have found the easiest way to do this is to pretend you are brand new. Better yet if you have a brand new team member ask him to track these kinds of information requests as he acquaints himself to your particular little piece of hell. What does he need to know right now to do your job? That is what your replacement will be asking himself after you have ascended.

4. Don’t document break/fix issues

Do not fill up your Wiki, SharePoint, OneNote or file share full of Word .docx with break/fix issues. Your infrastructure and process documentation should be broad and “provide context and meaning” which is pretty much the opposite kind of information than the kind break/fix issues are about – specific configurations, systems or problems.

You already have a place to “document” break/fix issues – it is called your damn ticket system. Use it. Document your fixes in your tickets. If your Tier-1 guys have a habit of closing all but the simplest of tickets with “done” or “fixed”, slap them (and probably yourself as well) and say that their future self just came back in time to hit them for making their job harder. If you do not have ticket system then you have other problems that need to be addressed.

5. Have a panic page

Take the really important stuff from the why documentation and the where documentation and make it into a panic page. A panic page is a short piece of documentation that contains all the information you or anyone else would need to have in order to deal with a “whoops” situation. Think things like vendor contact phone numbers, contract support entitlements, how to file and escalate a case and maybe where to find your co-worker’s emergency scotch. I borrowed this one from my supervisor and it turned out to be prescient suggestion on his part.

6. Have a hard copy

This is an extension of the panic page principle. Panic situations have a way of making electronic documentation inaccessible. “Oh but wait my documentation is in the cloud, I can get to it anywhere with my mobile device, oh I am so smart” you say, yeah well, you will be screwed when you drop your iPad or it runs out of batteries or you happen to live in Alaska which has comparable infrastructure to, say Afghanistan. Have a paper copy on hand, preferably two. Yes it will be harder to maintain but it will be a lot better than having no documentation if your file server explodes or the polar bears take over your data center.

7. Have designated documentation days

As a sysadmin, you generally do not have luxury of setting your own priorities. If your leadership wants documentation, instead of just saying “Hey, we need to document better” at your weekly staff meeting, they need to make it a priority. Nothing does this better than designating a day for documentation. Read-Only Friday is a good one because you are not making changes on Friday anyway, right… righhhhtttt? Of course, you are still going to get interruptions and tickets so designate one person as the interruption blocker and another team member as the documenter (borrowed from the Tom Limoncelli’s excellent Time Management for System Administrators). Rotate individuals as appropriate. These designated documentation days are your time, to make time, to actually Get Shit Done. All those little notes you meant to flesh out with some more context but never had time to… do it now. Organize your stuff. Clean it up. Review it for accuracy. Do this with a frequency related to how fast things change. Until leadership makes it a priority, you will always have another one that trumps it.

 

These ideas address some of those external and internal challenges you may have and that I know I have. I am more inclined to document stuff if I am not documenting dumb stuff that is already documented elsewhere. I will have an easier time finding the documentation we do have, if it is organized in a way that works for me and my team, both the consumer and producer of it. If I have dedicated time to actually perform the act of documenting it will probably get done. If not, then I am answerable to someone. Of course there can be a large gap between knowing what needs improvement and actually fixing it. Until then.

Stay frosty.