Tag Archives: morale

Morale, Workload and Tickets – A Follow-Up

You guys remember Elaine right? Six months ago, (Morale is Low, Workload is Up) I looked into some ticketing system data to try to discover what is going on in our team and how our seemingly ever increasing workload was being distributed along staff.  We unfortunately came to some alarming conclusions and hopefully mitigated them. I checked in with her recently to see how she was doing. Her response?

Things are just great. Just greatttt…

Let’s dig back in and see what we find. Here’s what I’m hoping we see:

  • Elaine’s percentage of our break/fix work drops to below 20%. She was recently promoted to our tier-2 staff and her skill set should be dedicated more towards proactive work and operational maintenance.
  • George and Susan have come onto their own and between the two of them are managing at least 60% of the ticket queue. They’re our front line tier-1 staff so I would expect the majority of the break/fix work goes to them and they escalate tickets as necessary to other staff.
  • Our total ticket volume drops a bit. I don’t think we’re going to get back to our “baseline” from 2016 but hopefully August was an outlier.

 

Well, shit. That’s not what I was hoping to see.

That’s not great but not entirely unexpected. We did take over support for another 300 users in August so I would expect that workload would increase which it is has by roughly double. However it is troubling because theoretically we are standardizing that department’s technology infrastructure which should lead to a decline in reactive break/fix work. Standardization should generate returns in reduced workload. If we add them into our existing centralized and automated processes the same labor hours we are already spending will just go that much further. If we don’t do that, all we have done is just add more work, that while different tactically, is strategically identical. This is really a race against time – we need to standardize management of this department before our lack of operational capacity catches up with us and causes deeper systemic failures pushing us too far down the “reactive” side of operations that we can’t climb back up. We’re in the Danger Zone.

 

This is starting to look bad.

Looking over our ticket load per team member things are starting to look bleaker. Susan and George are definitely helping out but the only two months where Elaine’s ticket counts are close to them was when she was out of the office for a much needed extended vacation. Elaine is still owning more of the team’s work than she should, especially now that she’s nominally in a tier-2 position. Lets also remember than in August when responsibility for those additional 300 users was moved to my team, along with Susan and George, we lost two more employees in the transition. That works out to a 20% reduction of manpower and that includes our manager as a technical asset (which is debatable). If you look at just reduction in line staff it is even higher. This is starting to look like a recipe for failure.

 

Yep. Confirmed stage 2 dumpster fire.

Other than the dip in December and January when Elaine was on vacation things look more or less the same. Here’s another view of just the tier-1 (George, Frank and Susan) and tier-2 (Elaine and Kramer) staff:

Maybe upgrade this to a stage 3 dumpster fire?

I think this graph speaks for itself. Elaine and Susan are by far doing the bulk of the reactive break/fix work. This has serious consequences. There is substantial proactive automation work that only Elaine has the skills to do. The more of that work that is delayed to resolve break/fix issues the more reactive we become and the harder it is to do the proactive work that prevents things from breaking in the first place. You can see how quickly this can spiral out of control. We’re past the Danger Zone at this point. To extend the Top Gun metaphor – we are about to stall (“No! Goose! Noo! Oh no!”). The list of options that I have, lowly technical lead that I am, is getting shorter. It’s getting real short.

In summation: Things are getting worse and there’s no reason to expect that to change.

  • Since August 2016 we have lost three positions (30% reduction in workforce). Since I started in November 2014 we have seen a loss of eight positions (50% reduction in workforce).
  • Our break/fix workload has effectively doubled.
  • We have had a change in leadership and a refocusing of priorities on service delivery over proactive operational maintenance which makes sense because the customers are starting to feel the friction. Of course with limited operational capacity putting off PM for too long is starting to get risky.
  • We have an incredibly uneven distribution of our break/fix work.
  • Our standardization efforts for our new department are obviously failing.
  • It seems less likely every day that we are going to be able to climb back up the reactive side of the slope we are on with such limited resources and little operational capacity.

Until next time, Stay frosty.

 

Morale Is Low, Workload Is Up

Earlier this month, I came back from lunch and I could tell something was off. One of my team members, lets call her Elaine, who is by far the the most upbeat, relentlessly optimistic and quickest to laugh off any of our daily trials and tribulations was silent, hurriedly moving around and uncharacteristically short with customers and coworkers. Maybe she was having a bad day I wondered as I made a mental note to keep tabs on her for the week to see if she bounced back to her normal self. When her attitude didn’t change after a few days then I was really worried.

Time to earn my team lead stripes so I took her aside and asked her what’s up. I could hear the steam venting as she started with, “I’m just so f*****g busy”. I decided to shut up and listen as she continued. There was a lot to unpack: She was under-pressure to redesign our imaging process to incorporate a new department that got rolled under us, she was handling the majority of our largely bungled Office 365 Exchange Online post-migration support and she was still crushing tickets on the help desk with the best of them. The straw that broke the camel’s back – spending a day to clean-up her cubicle that was full of surplus equipment because someone commented that our messy work area looked unprofessional…  “I don’t have time for unimportant s**t like that right now!” as she continued furiously cleaning.

The first thing I did and asked her what the high priority task of the afternoon was and figured out how to move it somewhere else. Next I recommended that she finish her cleaning, take off early and then take tomorrow off. When someone is that worked up, myself included, generally a great place to start is to get some distance between you and whatever is stressing you out until you decompress a bit.

Next I started looking through our ticket system to see if I could get some supporting information about her workload that I could take to our manager.

Huh. Not a great trend.

That’s an interesting uptick that just so happens to coincide with us taking over the support responsibilities for the previously mentioned department. We did bring their team of four people over but only managed to retain two in the process. Our workload increased substantially too since we not only had to continue to the maintain the same service level but we now have the additional challenge of performing discovery, taking over the administration and standardizing their systems (I have talked about balancing consolidation projects and workload before). It was an unfortunate coincidence that we had to schedule our Office 365 migration at the same time due to a scheduling conflict. Bottom line: We increased our workload by a not insignificant amount and lost two people. Not great a start.

I wonder how our new guys (George and Susan) are doing? Lets take a look at the ticket distribution, shall we?

Huh. Also not a great trend.

Back in December 2016 it looks like Elaine started taking on more and more of the team’s tickets. August of 2017 was clearly a rough month for the team as we started eating through all that additional workload but noticeably that workload was not being distributed evenly.

Here is another view that I think really underlines the point.

Yeah. That sucks for Elaine.

As far back as a year Elaine has been handling about 25% of our tickets and since then her percentage of the tickets has increased to close to 50%. What makes this worse is not only has the absolute quantity of tickets in August more than doubled compared to the average of the 11 preceding months but the relative percentage of her contribution has doubled as well. This is bad and I should of noticed, a long time ago.

Elaine and I had a little chat about this situation and here’s what I distilled out of it:

  • “If I don’t take the tickets they won’t get done”
  • “I’m the one that learns new stuff as it comes along so then I’m the one that ends up supporting it”
  • “There’s too many user requests for me to get my project work done quickly”

Service Delivery and Business Processes. A foe beyond any technical lead.

This is where my power as a technical lead ends. It takes a manager or possibly even an executive to address these issues but I can do my best to advocate for my team.

The first issue is actually simple. Elaine needs to stop taking it upon herself to own the majority of the tickets. If the tickets aren’t in the queue then no one else will have the opportunity to take them. If the tickets linger, that’s not Elaine’s problem, that’s a service delivery problem for a manager to solve.

The second issue is a little harder since it is fundamentally about the ability of staff to learn as they go, be self-motivated and be OK with just jumping into a technology without any real guidance or training. Round after round of budget cuts has decimated our training budget and increased our tempo to point where cross training and knowledge sharing is incredibly difficult. I routinely hear, “I don’t know anything about X. I never had any training on X. How am I supposed to fix X!” from team members and as sympathetic as I am about how crappy of a situation that is there is nothing I can do about it. The days of being an “IT guy” that can go down The Big Blue Runbook of Troubleshooting are over. Every day something new that you have never seen before is broken and you just have to figure it out.

Elaine is right though – she is punching way above her weight, the result of which is that she owns more and more the support burden as technology changes and as our team fails to evenly adopt the change. A manager could request some targeted training or maybe some force augmentation from another agency or contracting services. Neither are particularly likely outcomes given our budget unfortunately.

The last one is a perennial struggle of the sysadmin: Your boss judges your efficacy by your ability to complete projects, your users (and thus your boss’ peers via the chain of command) judge your efficacy by your responsiveness to service requests. These two standards are in direct competition. This is such as common and complicated problem that there is a fantastic book about it: Time Management for Systems Administrators

The majority of the suggestions to help alleviate this problem require management buy-in and most of them our shop doesn’t have: A easy to use ticket system with notification features, a policy stating that tickets are the method of requesting support in all but the most exigent of circumstances, a true triage system, a rotating interrupt blocker position and so on. The best I can do here is to recommend to Elaine to develop some time management skills, work on healthy coping skills (exercise, walking, taking breaks, etc.) and doing regular one-on-one sessions with our manager so Elaine has a venue for discussing these frustrations privately so at least if they cannot be solved they can acknowledged.

I brought a sanitized version of this to our team manager and we made some substantial progress. He reminded me that George and Susan have only been on our team for a month and that it will take some time for them to come up to speed before they can really start eating through the ticket queue. He also told Elaine, that while her tenacity in the ticket queue is admirable she needs to stop taking so many tickets so the other guys have a chance. If they linger, well, we can cross that bridge when we come to it.

The best we can do is wait and see. It’ll be interesting to see what happens as George and Susan adjust to our team and how well the strategy of leaving tickets unowned to encourage team members to grab them works out.

Until next time, stay frosty.

 

Prometheus and Sisyphus: A Modern Myth of Developers and Sysadmins

I am going to be upfront with you. You are about to read a long and meandering post that will seem almost a little too whiny at times where I talk some crap about our developers and their burdens (applications). I like our dev teams and I like to think I work really well with their leads so think of this post as a bit of satirical sibling rivalry and underneath the hyperbole and good nature-ed teasing there might be a small, “little-t” truth.

That truth is that operations, whether it’s the database administrator, the network team, the sysadmins or the help desk, always, always, always gets the short straw and that is because collectively we own “the forest” that the developers tend to their “trees” in.

I have a lot to say about the oft-repeated sysadmin myth about “how misunderstood sysadmins are” and how the they just seem to get stepped on all the time and so on and so on. I am not a big fan of the “special snowflake sysadmin syndrome” and I am especially not a fan of it when it is used as an excuse to be rude or unprofessional but that being said, I think it is worth stating that even I know I am half full of crap when I say sysadmins always get the short straw.

OK disclaimers are all done! Lets tell some stories!

 

DevOps – That means I get Local Admin right?

My organization is quite granular and each of our departments more or less maintain their own development teams supporting their own mission-specific applications along with either a developer that essentially fulfils an operations role or a single operations guy doing support solely for that department. The central ops team maintains things like the LAN, Active Directory, the virtualization platform and so on. If the powers that be wanted a new application for their department, the developers would request the required virtual machines, the ops team would spin up a dozen VMs off of a template, join them to AD, give the developers local admin and off we go.

Much like Bob Belcher, all the ops guys could do is “complain the whole time”.

 

This arrangement led to some amazing things that break in ways that are too awesome to truly describe:

  • We have an in-house application that uses SharePoint as a front-end, calls some custom web services tied to a database or two that auto-populates an Excel spreadsheet that is used for timekeeping. Everyone else just fills out the spreadsheet.
  • We have another SharePoint integrated application, used ironically enough for compliance training, that passes your Active Directory credentials in plaintext through two or three servers all hosting different web services.
  • Our deployment process is essentially to copy everything off your workstation onto the IIS servers.
  • Our revision control is: E:\WWW\Site, E:\WWW\Site (Copy), E:\WWW-Site-Dev McDeveloper
  • We have an application that manages account on-boarding, a process of which is already automated by our Active Directory team. Naturally they conflict.
  • We had at one point in time, four or five different backup systems all of which used BackupExec for some insane reason, three of which backed up the same data.
  • We managed to break a production IIS server by restoring a copy of the test database.
  • And then there’s Jenga: Enterprise Edition…

 

Jenga: Enterprise Edition – Not so fun when it needs four nines of uptime.

A satirical (but only just) rendering of one our application’s design pattern that I call “The Spider Web”

What you are looking at is my humorous attempt to scribble out a satirical sketch of one of our line-of-business applications which managed to actually turn out pretty accurate. The Jenga application is so named because all the pieces are interconnected in ways that turn the prospect of upgrading any of it into the project of upgrading all it. Ready? Ere’ we go!

It’s built around a core written in a language that we have not had any on-staff expertise in for the better part of ten years. In order to provide the functionality that the business needed as the application aged, the developers wrote new “modules” in other languages that essentially just call APIs or exposed services and then bolted them on. The database is relatively small, around 6 TBs, but almost 90% of it is static read-only data that we cannot separate out which drastically reduces the things our DBA and myself can do in terms of recovery, backup and replication and performance optimization. There is no truly separate development or testing environments so we use snapshot copies to expose what appear to be “atomic” copies of the production data (which contains PII!) on two or three other servers so our developers can validate application operations against it. We used to do this with manual fricking database restores, which was god damned expensive in terms of time and storage. There are no less than eight database servers involved but the application cannot be distributed or setup in some kind of multi-master deployment with convergence so staff at remote sites suffer abysmal performance if anything resembling contention happens on their shared last-mile connections.  The “service accounts” are literally user accounts that the developers use to RDP to the servers, start the application’s GUI, and then enable the application’s various services via interacting with above mentioned GUI (any hick-up in the RDP session and *poof* there goes that service). The public facing web server directly queries the production database). The internally consumed pieces of the application and the externally consumed pieces are co-mingled, meaning an outage anywhere is an outage everywhere. It also means we cannot segment the application in public and internally facing pieces. The client requires a hard-coded drive map to run since application upgrades are handled internally with copy jobs which essential replace all the local .DLLs on a workstation when new ones are detected and last but not least it runs on an EOL version of MSSQL.

Whew. That’s was a lot. Sorry about that. Despite that the fact that a whole department pretty much lives or dies by this application’s continued functionality our devs have not made much progress in re-architecturing and modernizing it. This really is not their fault but it does not change the fact that my team has an increasingly hard time keeping this thing running in a satisfactory manner.

 

Operations: The Digital Custodian Team.

In the middle of a brain storming session where we were trying to figure out how to move Jenga to a new virtualization infrastructure, all on a weekend when I will be traveling in order to squeeze the outage into the only period within the next two months that was not going to be unduly disruptive I began to feel like my team was getting screwed. They have more developers supporting this application than we have in our whole operations team and it is on us to figure out how to move Jenga without losing any blocks or having any lengthy service windows? What are those guys actually working on over there? Why am I trying to figure out which missing .DLL from .NET 1.0 needs be imported onto the new IIS 8.5 web servers so some obscure service that no really one understands runs in a supported environment? Why does operations own the life-cycle management? Why aren’t the developers updating and re-writing code to reflect the underlying environmental and API changes each time a new server OS is released with a new set of libraries? Why are our business expectations for application reliability so widely out-of-sync with what the architecture can actually deliver? Just what in the hell is going on here!

Honestly. I don’t know but it sucks. It sucks for the customers, it sucks for the devs but mostly I feel like it sucks for my team because we have to support four other line-of-business applications. We own the forest right? So when a particular tree catches on fire they call us to figure out what to do. No one mentions that we probably should expect trees wrapped in paraffin wax and then doused in diesel fuel to catch on fire. When we point out that tending trees in this manner probably won’t deliver the best results if you want something other than a bonfire we get met with a vague shrug.

Is this how it works? Your team of rockstar, “creative-type”, code-poets whip up some kind of amazing business application, celebrate and then hand it off to operations where we have to figure out how to keep it alive as the platform and code base age into senility for the next 20 years? I mean who owns the on-call phone for all these applications… hint: it’s not the dev team.

I understand that sometimes messes happen… just why does it feel like we are the only ones cleaning it up?

 

You’re not my Supervisor! Organizational Structure and Silos!

Bureaucratium ad infinitum.

 

At first blush I was going to blame my favorite patsy, Process Improvement and the inspid industry around it for this current state of affairs but after some thought I think the real answer here is something much simpler: the dev team and my team don’t work for the same person. Not even close. If we play a little game of “trace the organizational chart” we have five layers of management before we reach a position that has direct reports that eventually lead to both teams. Each one of those layers is a person – with their own concerns, motivations, proclivities and spin they put on any given situation. The developers and operations team (“dudes that work”), more or less, agree that the design of the Jenga application is Not a Good Thing (TM). But as each team gets told to move in a certain direction by each layer of management our efforts and goals diverge. No amount of fuzzy-wuzzy DevOps or new-fangled Agile Standup Kanban Continuous Integration Gamefication Buzzword Compliant bullshit is ever going to change that. Nothing makes “enemies” out of friends faster than two (or three or four) managers maneuvering for leverage and dragging their teams along with them. I cannot help but wonder what our culture would be like if the lead devs sat right next to me and we established project teams out of our combined pool of developer and operations talent as individual department’s put forth work. What would things be like if our developers were not chained to some stupid line-of-business application from the late ’80s, toiling away to polish a turd and implement feature requests like some kind of modern Promethian myth? What would things be like if our operations team was not constantly trying to figure out how to make old crap run while our budgets and staff are whittled away, snatching victory from defeat time and time again only to watch the cycle of mistakes repeat itself again and again like some kind Sisyphean dystopia with cubicles? What if we could sit down together and I dunno… fix things?

Sorry there are no great conclusions or flashes of prophetic insight here, I am just as uninformed as the rest of the masses, but I cannot help but think, maybe, maybe we have too many chefs in the kitchen arguing about the menu. But then again, what do I know? I’m just the custodian.

Until next time, stay frosty.

Kafka in IT: How a Simple Change Can Take a Year to Implement

Public sector IT has never had a reputation of being particularly fast-moving or responsive. In fact, it seems to have a reputation for being staffed by apathetic under-skilled workers toiling away in basements and boiler rooms supporting legacy, “mission-critical”, monolithic applications that sit half-finished and half-deployed by their long-gone and erstwhile overpaid contractors (*cough* Deloitte, CGI *cough*). This topic might seem familiar… Budget Cuts and Consolidation and Are GOV IT teams apathetic?

Why do things move so slow, especially in a field that demands the opposite? I don’t have an answer to that larger question but I do have an object lesson, well maybe what I really have is part-apology, part-explanation and part-catharsis. Gather around and hear The tale of how a single change to our organization’s perimeter proxy devices took a year!

 

03/10

We get a ticket stating that one of our teams’ development servers is no longer letting them access it via UNC share or RDP. I assign one of our tier-2 guys to take a look and a few days later it gets escalated to me. The server will not respond to any incoming network traffic, but if I access it via console and send traffic out it magically works. This smells suspiciously like a host-based firewall acting up but our security team swears up and down our Host Intrusion Protection software is in “detect” mode and I verified that we have disabled the native Windows firewall. I open up a few support tickets with our vendors and start chasing such wild geese as a layer-2 disjoint in our UCS fabric and “asymmetric routing” issues. No dice. Eventually someone gets the smart idea to move the IP address to another VM to try and narrow the issue down to either the VM or the environment. It’s the VM (of course it is)! These shenanigans take two weeks.

04/01

I finish re-platforming the development server onto a new Server 2012 R2 virtual machine. This in-of-itself would be worth a post since the best way I can summarize our deployment methodology is “guess-and-check”. Anyway, the immediate issue is now resolved. YAY!

05/01

I rebuild the entire development, testing, staging and production stack and migrate everything over except the production server which is publically accessible. The dev team wants to do a soft cutover instead of just moving the IP address to the new server. This means we will need have our networking team make some changes to the proxy perimeter devices.

05/15

I catch up on other work and finish the roughly ten pages of forms, diagrams and a security plan that are required for a perimeter device change request.

06/06

I open a ticket upstream, discuss the change with the network team and make some minor modifications to the ticket.

06/08

I filled out the wrong forms and/or I filled them out incorrectly. Whoops.

06/17

After a few tries I get the right forms and diagrams filled out. The ticket gets assigned to the security team for approval.

06/20

Someone from the security team picks up the ticket and begins to review it.

07/06

Sweet! Two weeks later my change request gets approval from the security team (that’s actually pretty fast). The ticket gets transfered back to the networking team which begins to work on implementation.

07/18

I create a separate ticket to track the required SSL/TLS certificate I will need for the HTTPS-enabled services on the server. This ticket follows a similar parallel process, documentation is filled out and validated, goes to the security team for approval and then back to the networking team for implementation. My original ticket for the perimeter change is still being worked on.

08/01

A firmware upgrade on the perimeter devices breaks high availability. The network team freezes all new work until the issue is corrected (they start their internal change control process for emergency break/fix issues).

08/24

The server’s HTTPS certificate has to be replaced before it expires at the end of the month. Our dev’s business group coughs up the few hundred dollars. We had planned to use the perimeter proxies’ wildcard certificate for no extra cost but oh well, too late.

09/06

HA restored! Wonderful! New configuration changes are released to the networking team for implementation.

10/01

Nothing happens upstream… I am not sure why.  I call about once a week and hear, we are swamped, two weeks until implementation. Should be soon.

11/03

The ticket gets transferred to another member of the network team and within a week the configuration change is ready for testing.

11/07

The dev team discovers an issue. Their application is relying on the originating client IP address for logging and what basically amounts to “two-factor authentication” (i.e., a username is tied to an IP address). This breaks fantastically once the service gets moved behind a web proxy. Neat.

11/09

I work with the dev lead and the networking team to come up with a solution. Turns out we can pass the originating IP address through the proxies but it changes the variable server-side that their code needs to reference.

11/28

Business leaders say that the code change is a no-go. We are about to hit their “code/infrastructure freeze” period that last from December to April. Fair enough.

12/01

We hit the code freeze. Things open back up again in mid-April. Looking ahead, I already have infrastructure work scheduled late April and early May which brings us right around to June: one year.

EDIT: The change was committed on 05/30 and we passed our rollback period on 06/14. As of 06/19 I just submitted the last ticket to our networking team to remove the legacy configuration.

 

*WHEW* Let’s take a break. Here’s doge to entertain you during the intermission:

 

My team is asking for a change that involves taking approximately six services that are already publically accessible via a legacy implementation, moving those services to a single IP address and placing an application proxy between the Big Bad Internet and the hosting servers. Nothing too crazy here.

Here’s some parting thoughts to ponder.

  • ITIL. Love it or hate it ITIL adds a lot of drag. I hope it adds some value.
  • I don’t mean to pick on the other teams but it clearly seems like they don’t have enough resources (expertise, team members, whatever they need they don’t have enough of it).
  • I could have done better with all the process stuff on my end. Momentum is important so I probably should not of let some of that paperwork sit for as long as it did.
  • The specialization of teams cuts both ways. It is easy to slip from being isolated and silo-ed to just basic outright distrust, and when you assume that everyone is out to get you (probably because that’s what experience has taught you) then you C.Y.A. ten times till Sunday to protect yourself and your team. Combine this with ITIL for a particularly potent blend of bureaucratic misery.
  • Centralized teams like networking and security that are not embedded in different business groups end up serving a whole bunch of different masters. All of whom are going in different directions and want different things. In our organization this seems to mean that the loudest, meanest person who is holding their feet to the SLA gets whatever they want at the expense of their quieter customers like myself.
  • Little time lags magnify delay as the project goes on. Two weeks in security approval limbo puts me four weeks behind a few months down the road which means I then miss my certificate expiry deadline which then means I need to fill out another ticket which then puts me further behind and so on ad infinitum.
  • This kind of shit is why developers are just saying “#YOLO! Screw you Ops! LEEEEROY JENKINS! We are moving to the Cloud!” and ignoring all this on-prem, organizational pain and doing DevDev (it’s like DevOps but it leads to hilarious brokenness in other new and exciting ways).
  • Public Sector IT runs on chaos, disorder and the frustrations of people just trying to Do Things. See anything ever written by Kafka.
  • ITIL. I thought it was worth mentioning twice because that’s how much overhead it adds (by design).

 

Until next time, may your tickets be speedily resolved.

The Big Squeeze, Predictions in Pessimism


Layoff notice or stay the hell away from Alaska when oil is cheap… from sysadmin

 

I thought this would be a technical blog acting as a surrogate for my participation on ServerFault but instead it has morphed into some kind of weird meta-sysadmin blog/soap box/long-form of a reply on r/sysadmin. I guess I am OK with that…

Alaska is a boom and bust economy and despite having a lot going for us fiscally, a combination of our tax structure, oil prices and the Legislature’s approach to the ongoing budget deficit, we are doing our best to auger our economy into the ground. Time for a bit of gallows humor to commiserate with u/Clovis69! The best part of predictions is you get to see how hilariously uninformed you were down the road! Plus, if you are going to draw straws you might as well take bets on who gets the shortest one.

Be forewarned, I am not an economist, I am not even really that informed and if you are my employer, future or otherwise, I am largely being facetious.

The Micro View (what will happen to me and my shop)

  • We will take another 15-20% personnel cuts in IT operations (desktop, server and infrastructure support). That will bring us to close to a 45% reduction in staff since 2015.
  • We will take on additional IT workload as our programming teams continue to lose personnel and consequently shed operational tasks they were doing independently.
  • We will be required to adopt a low-touch, automation-centric support model in order to cope with the workload. We will not have the resources to do the kind of interrupt-driven, in-person support we do now. This is a huge change from our current culture.
  • We will lean really hard on folks that know SCCM, PowerShell, Group Policy and other automation frameworks. Tier-2/Tier-3 will come under more pressure as the interrupt rate increases due to the reduction in Tier-1 staff.
  • Team members that do not adopt automation frameworks will find themselves doing whatever non-automatable grunt work there is left. They will also be more likely to lose their jobs.
  • We will lose a critical team member that is performing this increased automation work as they can simply get paid better elsewhere without having a budget deficit hanging over their head.
  • If we do not complete our consolidation work to standardize and bring silo-ed teams together before we lose what little operational capacity we have left our shop will slip into full blown reactive mode. Preventive maintenance will not get done and in two years time things will be Bad (TM). I mean like straight-up r/sysadmin horror story Bad (TM).
  • I would be surprised if I am still in the same role in the same team.
  • We will somehow have even more meetings.

The Macro View (what will happen to my organization)

Preliminary plans to consolidate IT operations were introduced back in early 2015. In short, our administrative functions including IT operations, are largely decentralized and done at the department level. This leads to a lot of redundant work being performed, poor alignment of IT to the business goals of the organization as a whole, the inability to capture or recover value from economies of scale and widely disparate resources, functionality and service delivery. At a practical level, what this means is there are a whole lot of folks like myself all working to assimilate new workload, standardize it and then automate it as we cope with staff reduction. We are all hurriedly building levers to help us move more and more weight but no one has stopped to say, “Hey guys, if we all work together to build one lever we can move things that are an order of magnitude heavier,” consequently as valiant as our individual efforts are we are going to fail. If I lose four people out of a team of eight, no level of automation that I can come up with will keep our heads above water.

At this point I am not optimistic about our chances for success. The tempo of a project is often determined by its initial pace. I have never seen an IT project move faster as time goes on in the public sector; generally it moves slower and slower as it grinds through the layers of bureaucracy and swims upstream against the prevailing current of institutional inertia and resistance. It has been over a year without any progress that is visible to the rank-and-file staff such as myself and we only have about one, maybe two years, of money left in the piggy bank before we find that the income side of our balance sheet is only 35% of our expenses. To make things even more problematic entities that do want to give up control have had close to two years to actively position themselves to protect their internal IT.

I want IT consolidation to succeed. It seems like the only possible way to continue to provide a similar level of service in the face of a 30-60% staff reduction. I mean, what the hell else are we going to do? Are we going to keep doing things the same way until we run out of money, turn the lights off and go home? If it takes one person on my team to run SCCM for my 800 endpoints, and three people from your team to run SCCM for your 3000 endpoints, how much do you want to bet the four of them could run SCCM for all 12,000 of our endpoints? I am pretty damn confident they could. And this scenario repeats everywhere. We are bailing out our boats, and in each boat is one piece of a high volume bilge pump but we don’t trust each other and no one is talking and we are all moving in a million different directions instead of stopping, collectively getting over whatever stupid pettiness that keeps us from actually doing something smart for once and putting together our badass high volume bilge pump. We will either float together or drown separately.

I seem to recall a similar problem from our nation’s history…

Benjamin Franklin's Join or Die Political Cartoon

Are GOV IT teams apathetic?

I have been stewing about this post on r/sysadminIs apathy a problem in most government IT teams?, for a while and felt like it was worth a quick write-up since most of my short IT career has been spent in the public sector.

First off, apathy and team dysfunction is a problem everywhere. There is nothing unique about government employees versus private employees in that respect. What I think the poster is really asking is, “Is there something about government IT that produces apathetic teams?” and if you read a little deeper it seems like apathy really means “permanent discouragement”; that is to say, the condition where change, “doing things right or better”, greater efficiency are or seemly are impossible. When you read something like, “…trying to make things more efficient is met with reactions like ‘oh you naive boy’ and finger pointing,” it is hard to think of just plain old vanilla apathy.

Government is not a business (despite what some people think). Programs operate at a loss, are subsidized in many cases entirely, by taxes because the public and/or their representatives deems those programs worthy. The failure mechanism of market competition doesn’t exist. Incredibly effective programs can be cancelled because they are no longer politically favorable and incredibly ineffective programs can continue or expand because they have political support. Furthermore, in all things public servants need to remain impartial, unbiased and above impropriety. This leads to vast and byzantine processes, the components of which singularly make imminent good sense (for example, the prohibition of no-bid contracts) but collectively all these well-intentioned barnacles slow the ship-of-state dramatically. Success is not rewarded with growth either. Implementing a more efficient process, a more cost effective infrastructure and saving money generally results in less money. This tendency of budget reduction (“Hey, if you saved it, you did not need it to begin with, right?”) turns highly functioning teams into disasters overtime as they lose resources. Paradoxically, the better you are at utilizing your existing resources, the less you get. Finally, your entire leadership changes with every administration change. You may still be shoveling coal down in the engine room, but the new skipper just sent down word to reduce steam and come about hard in order to head in the opposite direction. Generally private companies that do this kind of thing, with this frequency, do not last long.

How does all this apply to Information Technology? It means that your organization will move very, very slow and technology moves very, very fast. Not a good combo.

 

Those are the challenges that a team faces but what about the other half of the equation… the people facing them?

Job classes are just one small part of this picture but they are emblematic of some of the challenges that face team leads and managers when dealing with the ‘People’, piece of People, Process and Technology (ITIL buzzword detected! +5 points). The idea of job classes is that across the organization people doing similar work should be paid the same. The problem lies in that updating a job class is beyond onerous and the time to completion is measured in years. Do you know how quickly Information Technology reinvents itself? Really quick. This means that job classes and their associated salaries tend to drift away from the actual on-the-ground work being done and the appropriate compensation level over time, making recruitment of new staff and retention of your best staff very difficult (The Dead Sea Effect). If you combine this with a lack of training and professional development, staff has a tendency to get pigeon-holed into a particular role without a clear promotion path. Furthermore, many of the job class series are disjointed in such a way as working at the top of one job series will not meet the prerequisites for another job series, making advancement difficult, and at least on paper sometimes impossible. For example: you could work as a Lead Programmer for three years leading a team of five people and not qualify, at least on paper, for an entry level IT Manager position.

How does all this apply to Information Technology? People get stuck doing one job, for too long, with no professional training or mentorship. Their skillsets decline towards obsolescence and they become frustrated and discouraged.

 

I have never met anyone in the public sector that just straight up did not give a crap. I have met people that feel stuck, discouraged, marginalized and ignored. And rightly so. Getting stuff done is very hard. It is like everyone has one ingredient necessary to make a cake, and you all more, or less, agree on the recipe. You are all trained and experienced bakers. You can easily make a cake but you each have 100 pieces of paperwork you have to fill out and wait on, sometimes for months, before you can do your part of the cake-baking process. You have 10 different bosses, each telling you to make a different desert when you know that cakes are by far the best desert for your particular bakery. Then you get yelled at for not making a cake in a timely manner, and then you all fired and replaced by food service contractors whose parent company charges an exorbitant hourly rate. But hey, the public eventually got their cake right? Or at least a donut. Not exactly what they ordered but better than nothing … right?

If IT is a thankless job (and I am not sure I agree with that piece of Sysadmin mythology), then Public Sector IT is even more thankless. You will face a Kafkaesque bureaucracy. You will likely be very underpaid and have a difficult time seeking promotion. You will never be able to accept a vendor-provided gift or meal over the price of $25. You will laugh when people ask if you plan on attending VMworld. The public will stereotype you as lazy, ineffective and overpaid. But you will preserve. You have a duty to the body politic to do your best with what you have. You will keep the lights on, you will keep the ship afloat even as more and more water pours in. You have to. Because that’s what Systems Administrators do.

And all you wanted was to simply make a cake.