The HumbleLab: Storage Spaces with Tiers – Making Pigs Fly!

I have mixed feelings about homelabs. It seems ludicrous to me that in a field that changes as fast as IT that employers do not invest in training. You would think on-the-clock time dedicated to learning would be an investment that would pay itself back in spades. I also think there is something psychologically dangerous in working your 8-10 hour day and then going home and spending your evenings and weekends studying/playing in your homelab. Unplugging and leaving computers behind is pretty important, in fact I find the more and more I do IT the less interest I have in technology in general. Something, something, make an interest a career and then learn to hate it. Oh well.

That being said, IT is a fast changing field and if you are not keeping up one way or another, you are falling behind. A homelab is one way to do this, plus sometimes it is kind of nice to just do stuff without attending governance meetings or submitting to the tyranny of your organization’s change control board.

Being the cheapskate that I am, I didn’t want to go out spend thousands of my own dollars on hardware like all the cool cats in r/homelab so I just grabbed some random crap lying around work, partly just to see how much use I could squeeze out of it.

Dell OptiPlex 990 (circa 2012)

  • Intel i7-2600, 3.4GHz 4 Cores, 8 Threads, 256KB L2, 8MB L3
  • 16GBs, Non-EEC, 1333MHz DDR3
  • Samsung SSD PM830, 128GBs SATA 3.0 Gb/s
  • Samsung SSD 840 EVO 250GBs SATA 6.0 Gb/s
  • Seagate Barracuda 1TB SATA 3.0 Gb/s

The OptiPlex shipped with just the 128GB SSD which only had enough storage capacity to host the smallest of Windows virtual machines so I scrounged up the two other disks from other desktops that were slated for recycling. I am particularly proud of the Seagate because if the datecode on the drive is to be believed it was originally manufactured sometime in late 2009.

A bit of a pig huh? Let’s see if we can make this little porker fly.

A picture of the inside of HumbleLab

Oh yeah… look at that quality hardware and cable management. Gonna be hosting prod workloads on this baby.

I started out with a pretty simple/lazy install of Windows Server 2012 R2 and the Hyper-V role. At this point in time I only had the original 128GB SSD that operating system was installed on and the ancient Seagate being utilized for .VHD/.VHDX storage.

Performance was predictably abysmal, especially once I got a SQL VM setup and “running”:

IOmeter output

At this point, I added in the other 256GB SSD, destroyed the volume I was using for .VHD/.VHDX storage and recreated it using Storage Spaces. I don’t have much to say about Storage Spaces here since I have such a simple/stupid setup. I just created a single Storage Pool using the 256GB SSD and 1TB SATA drive. Obviously with only two disks I was limited to a Simple Storage Layout (no disk redundancy/YOLO mode). I did opt to create a larger 8GB Write Cache using PowerShell but other than that I pretty much just clicked through the wizard in Server Manager:

 

Let’s see how we did:

IOMeter Results with Storage Tiers

A marked improvement! We tripled our IOPS from a snail-like 234 to a tortoise-like 820 and managed to reduce the response time from 14ms to 5ms. The latency reduction is probably the most important. We generally shoot for under 2ms for our production workloads but considering the hardware 5-6ms isn’t bad at all.

 

What if I just run .VHDX file directly on the shared 128GB SSD that the Hyper-V host is utilizing without any Storage Tiers involved at all?

Hmm… not surprisingly the results are even better but what was surprising is by how much.  We are looking at sub 2ms latency and about four and half times more IOPS than what my Storage Spaces Virtual Disk can deliver.

Of course benchmarks, especially quick and dirty ones like this, are very rarely the whole story and likely do not even come close to simulating your true workload but at least it gives us a basic picture of what my aging hardware can do: SATA = Glacial, Storage Tiers with SSD Caching=OK, SSD=Good. It also illustrates just how damn fast SSDs are. If you have a poorly performing application, moving it over to SSD storage is likely going to be the single easiest thing you can do to improve its performance. Sure, the existing bottleneck in the codebase or database design is still there, but does that matter anymore if everything is moving 4x faster? Like they say, Hardware is Cheap, Developers are Expensive.

I put this together prior to the general release of Server 2016 so it would be interesting to see if running this same setup on 2016’s implementation of Storage Spaces with ReFS instead of NTFS would yield better results. It also would be interesting to refactor the SQL database and at the very least place the TempDB, SysDBs and Log files directly onto to host’s 128GB SSD. A project for another time I guess…

Until next time… may your pigs fly!

A flying pig powered by a rocket

Additional reading / extra credit:

Don’t Build Private Clouds? Then What Do We Build?

Give Subbu Allamaraju’s blog post Don’t Build Private Clouds a read if you have not yet. I think it is rather compelling but also wrong in a sense. In summation: 1) Your workload is not as special as you think it is, 2) your private cloud isn’t really a “cloud” since it lacks the defining scale, resiliency, automation framework, PaaS/SaaS and self-service on-demand functionality that a true cloud offering like AWS, Azure or Google has and 3) your organization is probably doing a poor job of building a private cloud anyway.

Now lets look at my team – we maintain a small Cisco FlexPod environment – about 14 ESXi hosts, 1.5TBs RAM and about 250TBs of storage. We support about 600 users and I am primary for the following:

  • Datacenter Virtualization: Cisco UCS, Nexus 5Ks, vSphere, NetApp and CheckPoint firewalls
  • Server Infrastructure: Platform support for 150 VMs, running mostly either IIS or SQL
  • SCCM Administration (although one of our juniors has taken over the day to day tasks)
  • Active Directory Maintenance and Configuration Management through GPOs
  • Team lead responsibilities under the discretion of my manager for larger projects with multiple groups and stakeholders
  • Escalation point for the team, point-of-contact for developer teams
  • Automation and monitoring of infrastructure and services

My-day-to-day consists of work supporting these focus areas – assisting team members with a particularly thorny issue, migrating in-house applications onto new VMs, working with our developer teams to address application issues, existing platform maintenance, holding meetings talking about all this work with my team, attending meetings talking about all this work with my managers, sending emails about all this work to the business stakeholders and a surprising amount of tier-1 support (see here and here).

If we waved our magic wand and moved everything into the cloud tomorrow, particularly into PaaS where the real value to cost sweet spot seems to be, what would I have left to do? What would I have left to build and maintain?

Nothing. I would have nothing left to build.

Almost all of my job is working on back-end infrastructure, doing platform support or acting as an human API/”automation framework”. As Subbu states, I am a part of the cycle of “brittle, time-consuming, human-operator driven, ticket based on-premises infrastructure [that] brews a culture of mistrust, centralization, dependency and control“.

I take a ticket saying, “Hey, we need a new VM.” and I run some PowerShell scripts to create and provision above said new VM in a semi-automated fashion, I then copy the contents of the older VM’s IIS directory over. I then notice that our developers are passing credentials in plaintext back and forth through web forms and .XML files between different web services which kicks off a whole week’s worth of work to re-do all their sites in HTTPS. I then setup a meeting to talk about these changes with my team (cross training) and if we are lucky  someone upstream actually gets to my ticket and these changes go live. This takes about three to four weeks optimistically.

In the new world our intrepid developer tweaks his Visual Studio deployment settings and his application gets pushed to an Azure WebApp which comes baked in with geographical redundancy, automatic scale-out/scale-up, load-balancing, a dizzying array of backup and recovery options, integration with SaaS authentication providers, PCI/OSI/SOC compliance and the list goes on. This takes all of five minutes.

However here is where I think Subbu get its wrong: Of our 150 VMs, about 50% of them belong to those “stateful monoliths”. They are primarily composed of line-of-business applications with proprietary code bases that we don’t have access to or they are legacy applications built on things like PowerBuilder and no one understands how they work anymore. They are spread out across 10 to 20 VMs to provide segmentation but have huge monolithic database designs. It would cost us millions of dollars to re-factor the application into a design that could truly take advantage of cloud services in their PaaS form. Our other option would be cloud-based IaaS which is not that different from the developer’s perspective than what we are currently doing except that it costs more.

I am not even going to touch on our largest piece of IT spend which is a line-of-business application that has “large monolithic databases running on handcrafted hardware.” in the form of an IBM z/OS mainframe. Now our refactoring cost is in the ten of millions of dollars.

 

If this magical cloud world comes to pass what do I build? What do I do?

  • Like some kind of carrion lord, I rule over my decaying infrastructure and accumulated technical debt until everything legacy has been deprecated and I am no longer needed.
  • I go full retar… err… endpoint management. I don’t see desktops going away anytime soon despite all this talk of tablets, mobile devices and BYOD.
  • On-prem LAN networking will probably stick around but unfortunately this is all contracted out in my organization.
  • I could become a developer.
  • I could become a manager.
  • I could find another field of work.

 

Will this magical cloud world come to pass?

Maybe in the real world but I have a hard time imaging how it work for us. We are so far behind in terms of technology and so organizationally dysfunctional that I cannot see how moving 60% of our services from on-prem IaaS to cloud-based IaaS would make sense, even if leadership could lay off all of the infrastructure support people like myself.

Our workloads aren’t special. They’re just stupid and it would cost a lot of money to make them less stupid.

 

The real pearl of wisdom…

The state of [your] infrastructure influences your organizational culture.Of all things in that post, I think this is the most perceptive as it is in direct opposition to everything our leadership has been saying about IT consolidation. The message we have continually been hearing for the last year and a half is that IT Operations is a commodity service – the technology doesn’t matter, the institutional knowledge doesn’t matter, the choice of vendor doesn’t matter, the talent doesn’t matter: It is all essentially the same and it is just a numbers game to find the implementation that is the most affordable.

As a nerd-at-heart I have always disagreed with this position because I believe your technology choices determine what is possible (i.e., if you need a plane but you get a boat that isn’t going to work out for you) but the insight here that I have never really deeply considered is that your choice of technology drastically effects how you do things. It effects your organization’s cultural orientation to IT. If you are a Linux shop, does that technology choice precede your dedication to continuous integration, platform-as-code and remote collaboration? If you are a Windows shop, does that technology choice precede your stuffy corporate culture of ITIL misery and on-premise commuter hell? How much does our technological means of accomplishing our economic goals shape our culture? How much indeed?

 

Until next time, keep your stick on the ice.

SCCM SUP Failing to Download Updates – Invalid Certificate Error

I am currently re-building my System Center lab which includes re-installing and re-configuring a basic SCCM instance. I was in the process of getting my Software Update Point (SUP) setup and a few of my SUP groups failed to download their respective updates and deploy correctly. I dimly remember working through this issue in our production environment a few years ago when we moved from SCCM 2012 to 2012 R2 and I cursed myself for not taking better notes and so here we are atoning for our past sins!

Here’s the offending error:

SCCM SUP Download Error Dialog - Invalid Cert

 

SCCM is a chatty little guy and manages to generate some 160 or so different log files, spread out across a number of different locations (see the highly recommended MSDN blog article, A List of SCCM Log Files). Identifying and locating the relevant logs to whatever issue you are troubleshooting is about half the battle with SCCM. Unfortunately I couldn’t seem to find patchdownloader.log in its expected location of SMS_CCM\Logs. Turns out if you are running the Configuration Manager Console from a RDP session patchdownloader.log will get stored in C:\Users\%USERNAME%\AppData\Local\Temp instead of SMS\Logs. Huh. In my case, I am RDPing to the SCCM Server and running the console but I wonder if I run it from a client workstation whether the resulting log will end up locally on that workstation in my %TEMP% folder or whether it will end up on the SCCM SUP server in SMS_CCM\Logs… an experiment for another day I guess.

 

Here’s the juicy bits:

 

A couple of interesting things to note from here:

  • We get the actual error code (0x80073633) which you can use CMTrace’s Error Lookup functionality to “resolve” back to the human readable one the Console presents you with. Sometimes this turns out to be useful information.
  • We get the download location for the update
  • We get the distribution package location that the update is being downloaded to

If I manually browse to the wsus.ds.download.windowsupdate.com URL I manage to download the update without issues. No certificate validation issues which one would expect considering that the connection is going over HTTP according to the log. Makes one wonder how the resulting error was related to an “invalid certificate”…

OK. How do I fix it? Well like most things SCCM the solution is as stupid as it is brilliant. Manually download the update from the Microsoft Update Catalog. Go find the offending update in its respective Software Update Group by referencing the KB number and download it again but this time set your Download Location to the directory that already contains it.

Whoops. Didn’t work.

Take a look at the first attempt to download the content… SCCM Is looking for Windows10.0-KB3172989-x64.cab so it can be downloaded into my %TEMP% directory and then eventually moved off the Deployment Package’s source location at \\SCCM\Source Files\Windows\Updates\2016.

The file I downloaded is not named Windows10.0-KB3172989-x64.cab – it’s actually an .msu file. Use 7-Zip or similar tool to pull the .cab file out of it and now it SCCM SUP should successfully “download” the the update and ship it off to source location for your Deployment Package.

 

FFFUUUU Internet Explorer… a rant about an outage

I am not normally a proponent of hating on Microsoft, mostly because I think much of the hate they get for design decisions is simply because people do not take the time to understand how Microsoft’s new widget of the month works and why it works that way. I also think it is largely pointless. All Hardware Sucks, All Software Sucks once you really start to dig around under the hood. That and Microsoft doesn’t really give a shit about what you want and why you want it. If you are an enterprise customer they have you by the balls and you and Microsoft both know it. You are just going to have to deal with tiles, the Windows Store and all the other consumer centric bullshit that is coming your way regardless of how “enterprise friendly” your sales rep says Microsoft is.

That being said, I cannot always take my own medicine of enlightened apathy and Stockholm Syndrome and this is one of those times. We had a Windows Update get deployed this week that broke about 60% – 75% of our fleet, specifically Internet Explorer 11. Unfortunately we have a few line-of-business web applications that rely on it. You can imagine how that went.

Now there are a lot of reasons why this happened but midway through my support call where we are piecing together an uninstallation script to remove all the prerequisites of Internet Explorer 11 I had what I call a “boss epiphany”. A “boss epiphany” is when you step out your technical day-to-day and start asking bigger questions and is so named because my boss has a habit of doing this. I generally find it kind of annoying in a good-natured way because I feel like there is a disregard for the technical complexities that I have to deal with in order to make things work but I can’t begrudge that he cuts to the heart of the matter. And six hours into our outage what was the epiphany… “Why is this so fucking hard? We are using Microsoft’s main line-of-business browser (Internet Explorer) and their main line-of-business tool for managing workstations in an enterprise environment (SCCM).”

The answer is complicated from (my) technical perspective but the “boss epiphany” is a really good point. This shit should be easy. It’s not. Or I suck at it. Or maybe both. AND that brings me to my rant. Why in the name of Odin’s beard is software deployment and management in Windows so stupid? All SCCM is doing is really just running an installer. For all its “Enterprisy-ness” it just runs whatever stupid installer you get from Adobe, Microsoft or Oracle. There’s no standardization, no packaging or no guarantee anything will actually be atomic. Even MSI installers can do insane things – like accept arguments in long form (TRANSFROMS=stupidapp.mst) but not short form (/t stupidapp.mst) or my particular favorite, search for ProductKey registry keys to uninstall any older version of the application, and then try to uninstall it via the original .MSI. This fails horribly when that .MSI lives in a non-persistent client side cache (C:\Windows\ccmcache). Linux was created by a bunch of dope-smoking neckbeards and European commies and they have had solid standardized package management for like ten years. I remember taking a Debian Stable install up to Testing, and then down-grading to Stable and then finally just upgrading the whole thing to Unstable. AND EVERYTHING WORKED (MOSTLY). Lets see you try that kind of kernel and user land gymnastics with Windows. Maybe I just have not spent enough supporting Linux to hate it yet but I cannot help but admire the beauty of apt-get update && apt-get upgrade when most of my software deployments means gluing various .EXEs and registry keys together with batch files or PowerShell. It’s 2016 and this is how we are managing software deployments? I feel like I’m taking crazy pills here.

 

Lets look at the IEAK as a specific example since I suspect that’s half the reason I got us into this mess. The quotes from this r/sccm thread are perfect here:

  • “IEAK can’t handle pre reqs cleanly. Also ‘installs’ IE11 and marks it as successful if it fails due to prereqs”
  • “Dittoing this. IEAK was a nightmare.”
  • “IEAK worked fine for us apart from one issue. When installing it would fail to get a return from the WMI installed check of KB2729094 quick enough so it assumed it wasn’t installed and would not complete the IE11 install.”
  • “It turns out that even though the IEAK gave me a setup file it was still reaching out to the Internet to download the main payload for IE”
  • “I will never use IEAK again for an IE11 deployment, mainly for the reason you stated but also the CEIP issue.”

And that’s the supported, “Enterprise” deployment method. If you start digging around on the Internet, you see there are people out there deploying Internet Explorer 11 with Task Sequences, custom batch files, custom PowerShell scripts and the PowerShell Deployment Toolkit. Again, the technical part of me understands that Internet Explorer is a complicated piece of software and that there are reasons it is deployed this way but ultimately if it is easier for me to deploy Firefox with SCCM than Internet Explorer… well that just doesn’t seem right now does it?

Until next time… throw your computer away and go outside. Computers are dumb.

Can’tBan… Adventures with Kanban

Comic about Agile Programming

. . .

We started using Kanban in our shop about six months ago. This in of itself is interesting considering we are nominally an ITIL shop and the underlying philosophies of ITIL and Kanban seem diametrically opposed. Kanban, at least from my cursory experience is focused on speeding up the flow of work, identifying bottlenecks and meeting customers’ requirements more responsively. It is all about reducing “cycle time”, that is the time it takes to move a unit of work through to completion. ITIL is all about slowing the flow of work down and adding rigor and business oversight into IT processes. A side effect of this is that the cycle time increases.

If you are not familiar with Kanban the idea is simple. Projects get decomposed into discrete tasks, tasks get pulled through the system from initiation to finish and each stage of the project is represented by a queue. Queues’ have work in progress (WIP) limits which means only so many task can be in a single queue at the same time. The backlog is where everything you want to get done sits before you actually start working on it. DO YOU WANT TO KNOW MORE?

As I am sure the one reader of my blog knows, I simultaneously struggle with time management and I am also fascinated by it. What do I think about Kanban? I have mixed feelings.

The Good

  • Kanban is very visual. I like visual things – walk me through your application’s architecture over the phone and I have no idea what you have just told me five minutes later. Show me a diagram and I will get it. This appeal of course is personal and will vary widely depending on the individual.
  • Work in progress (WIP) limits! These are a fantastic concept. The idea that your team can only process so much work in a given unit of time and that constantly context switching between tasks has an associated cost is obvious to those in the trenches but not so much to those higher powers that exist beyond the Reality Impermeability Layer of upper management. If you literally show them there is not enough room in the execution queue for another task they will start to get it. All of sudden you and your leadership can start asking the real questions… why is task A being worked on before task Z? Do you need more resources to complete all the given tasks? Maybe task C can wait awhile? Why is task G moving so slowly? Why are we bottlenecked at this phase?
  • Priorities are made explicit. If I ever have doubt about what I am expected to be working on I can just check the execution queue. If my manager wants me to work on another task that is outside the execution queue, then we can have a discussion about whether or not to bump something back or hold the current “oh hey, can you take care of this right now?” task in the backlog. I cannot understate how awesome this. It makes the cost of context switching visible, keeps my tactical work aligned with my manager’s strategic goals, and makes us think about what tasks matter most and in what order they should get done. This is so much better than the weekly meeting, where more and more tasks get dumped into some nebulous to-do list that my team struggles through while leadership wonders why the “Pet Project of the Month” isn’t finished yet.

The Interesting

  • The scope of work that you set as a singular “task” is really important. If a single task is too large then it doesn’t accurately map to the work being done on a day-to-day basis and you lose out on Kanban’s ability to bring bottlenecks and patterns to the surface where they can be dealt with. If the tasks are to small then you end up spending too much time in the “meta-analysis” of figuring out what task is where instead of actually accomplishing things.
  • The type of work you decide to count as a Kanban task also has a huge effect on how your Kanban actually “runs”. Do you track break/fix, maintenance tasks, meetings, projects, all of the above? I think this really depends on how your team works and what they work on so there is no hard or fast answer here.
  • Some team members are more equal than others. We set our WIP limit to Number of Team Members * 2… the idea being that two to three tasks is about all a single person can really focus on and still be effective (i.e., “The Rule Of Threes”). Turns out though in practice that 60% of tasks are owned by only 20% team. Huh. I guess that would be called a bottleneck?

The Bad

  • Your queues need to actually be meaningful. Just having separate queues named “Initiation”, “Documentation”, “Sign-off” only works if you have discrete actions that are expected for the tasks in those queues. In our shop what I have found is only one queue matters: the execution queue. We have other queues but since they do not have requirements and WIP limits attached to them they are essentially just to-do lists. If a task goes into the Documentation queue, then you better damn well document your system before you move the task along. What we have is essentially a one queue Kanban system with a single WIP limit. If we restructured our Kanban process and truly pulled a task through each queue from beginning to finish I think we would see much more utility.
  • Flow vs. non-flow. An interesting side of effect of not having strong queue requirements is that tasks don’t really “flow”. For example: We are singularly focused on the execution queue and so every time I finish a task it gets moved onto the documentation queue where it piles up with all the other stuff I never documented. Now instead of backing off and making time for our team to document before pulling more work into the system I re-focus on whatever task just got promoted into the execution queue. Maybe this is why our documentation sucks so much? What this should tell us is 1) We have too many items in the documentation queue for new work, 2) the documentation queue needs a smaller WIP limit, 3) we need to make the hard decision to put off work until documentation is done if we actually want documentation and 4) documentation is work and work takes time. If we never give staff the time to document then we will end up with no documentation. I don’t necessarily thing everything needs to be pulled through each queue. Break/Fix work is often simple, ephemeral and if your ticket system doesn’t suck ,self-documenting. You could handle these types of tasks with a standalone queue.
  • Queues should have time-limits. You only have one of two states regarding a given unit of work, you are either actively working on it or you are not. Kanban should have the same relationship with tasks in a given queue. If a task has sat in the planning queue for a week without any actual planning occurring then it should be removed. Either the next queue is full (bottleneck), the planning queue is full (bottleneck/WIP limit to high) or your team is not working on your Kanban tasks (other larger systemic problems). Aggressively “reset” tasks by sending them to the backlog if no work is being performed on them and enforce your queue requirements otherwise all you have done is create six different “to-do-whenever-we-have-spare-time-which-is-never-lists” that just collect tasks.
  • Our implementation of Kanban does not work as a time management tool because we only track “project” work. Accordingly very little of my time is actually spent on the Kanban tasks since I am also doing break/fix, escalations, monitoring and preventive maintenance. This really detracts from the overall benefit of managing priorities, making then explicit and limiting context switching since our Kanban board only represents at best 25% of my team’s work.

In conclusion there are some things I really like about Kanban and with some tweaks I think our implementation could have a lot of utility. I am not convinced it will mix well with our weird combination of ITIL processes but no real help desk (see: Who Needs Tickets Anyway? and Those are Rookie Numbers according to r/sysadmin). We are getting value out of Kanban but it needs some real changes before it becomes just one more process of vague effectiveness.

It will be interesting to see where we are in another six months.

Until next time, keep your stick on the ice.

The Big Squeeze, Predictions in Pessimism


Layoff notice or stay the hell away from Alaska when oil is cheap… from sysadmin

 

I thought this would be a technical blog acting as a surrogate for my participation on ServerFault but instead it has morphed into some kind of weird meta-sysadmin blog/soap box/long-form of a reply on r/sysadmin. I guess I am OK with that…

Alaska is a boom and bust economy and despite having a lot going for us fiscally, a combination of our tax structure, oil prices and the Legislature’s approach to the ongoing budget deficit, we are doing our best to auger our economy into the ground. Time for a bit of gallows humor to commiserate with u/Clovis69! The best part of predictions is you get to see how hilariously uninformed you were down the road! Plus, if you are going to draw straws you might as well take bets on who gets the shortest one.

Be forewarned, I am not an economist, I am not even really that informed and if you are my employer, future or otherwise, I am largely being facetious.

The Micro View (what will happen to me and my shop)

  • We will take another 15-20% personnel cuts in IT operations (desktop, server and infrastructure support). That will bring us to close to a 45% reduction in staff since 2015.
  • We will take on additional IT workload as our programming teams continue to lose personnel and consequently shed operational tasks they were doing independently.
  • We will be required to adopt a low-touch, automation-centric support model in order to cope with the workload. We will not have the resources to do the kind of interrupt-driven, in-person support we do now. This is a huge change from our current culture.
  • We will lean really hard on folks that know SCCM, PowerShell, Group Policy and other automation frameworks. Tier-2/Tier-3 will come under more pressure as the interrupt rate increases due to the reduction in Tier-1 staff.
  • Team members that do not adopt automation frameworks will find themselves doing whatever non-automatable grunt work there is left. They will also be more likely to lose their jobs.
  • We will lose a critical team member that is performing this increased automation work as they can simply get paid better elsewhere without having a budget deficit hanging over their head.
  • If we do not complete our consolidation work to standardize and bring silo-ed teams together before we lose what little operational capacity we have left our shop will slip into full blown reactive mode. Preventive maintenance will not get done and in two years time things will be Bad (TM). I mean like straight-up r/sysadmin horror story Bad (TM).
  • I would be surprised if I am still in the same role in the same team.
  • We will somehow have even more meetings.

The Macro View (what will happen to my organization)

Preliminary plans to consolidate IT operations were introduced back in early 2015. In short, our administrative functions including IT operations, are largely decentralized and done at the department level. This leads to a lot of redundant work being performed, poor alignment of IT to the business goals of the organization as a whole, the inability to capture or recover value from economies of scale and widely disparate resources, functionality and service delivery. At a practical level, what this means is there are a whole lot of folks like myself all working to assimilate new workload, standardize it and then automate it as we cope with staff reduction. We are all hurriedly building levers to help us move more and more weight but no one has stopped to say, “Hey guys, if we all work together to build one lever we can move things that are an order of magnitude heavier,” consequently as valiant as our individual efforts are we are going to fail. If I lose four people out of a team of eight, no level of automation that I can come up with will keep our heads above water.

At this point I am not optimistic about our chances for success. The tempo of a project is often determined by its initial pace. I have never seen an IT project move faster as time goes on in the public sector; generally it moves slower and slower as it grinds through the layers of bureaucracy and swims upstream against the prevailing current of institutional inertia and resistance. It has been over a year without any progress that is visible to the rank-and-file staff such as myself and we only have about one, maybe two years, of money left in the piggy bank before we find that the income side of our balance sheet is only 35% of our expenses. To make things even more problematic entities that do want to give up control have had close to two years to actively position themselves to protect their internal IT.

I want IT consolidation to succeed. It seems like the only possible way to continue to provide a similar level of service in the face of a 30-60% staff reduction. I mean, what the hell else are we going to do? Are we going to keep doing things the same way until we run out of money, turn the lights off and go home? If it takes one person on my team to run SCCM for my 800 endpoints, and three people from your team to run SCCM for your 3000 endpoints, how much do you want to bet the four of them could run SCCM for all 12,000 of our endpoints? I am pretty damn confident they could. And this scenario repeats everywhere. We are bailing out our boats, and in each boat is one piece of a high volume bilge pump but we don’t trust each other and no one is talking and we are all moving in a million different directions instead of stopping, collectively getting over whatever stupid pettiness that keeps us from actually doing something smart for once and putting together our badass high volume bilge pump. We will either float together or drown separately.

I seem to recall a similar problem from our nation’s history…

Benjamin Franklin's Join or Die Political Cartoon

Things in our Datacenter that Annoy Me

Or alternatively how I learned to stop worrying about the little things…

In this post, I complain about little details that show my true colors as a some kind of pedantic, semi-obsessive, detail oriented system administrator. I mean, I try to play it cool but inside I am really freaking out, man! Not really but also kind of yes. More on that later.

 

Our racks are not deep or wide enough

Our racks were not sized correctly initially. They are quite “shallow”. A Dell 730 on ReadyRails is about 28″ deep. This is a pretty standard rack mounting depth for full-size rackmount equipment. In our racks, we only have about 4-6″ of space remaining between the posts and the door in the back of the rack. This complicates cabling since we do not have a lot of room to work with but it really gets annoying with PDUs. See below.

The combination of the shallow depth and lack of width, leads to weird PDU configurations

PDU Setup

The racks are too shallow to mount the PDUs parallel with the posts with the plugs facing out towards the door and too narrow to stack both PDUs on one side. The PDUs end up being mounted sideways where they stick out into the area between the posts, blocking airflow and making cabling a pain in the ass.

Check out u/tasysadmin’s approach which is much improved over ours. The extra depth and width allows both power circuits (you do have two redundant power circuits, right?) to move over to one side of the rack and slide into the gap between the posts and casing. This has a whole bunch of benefits: airflow is not restricted, you have more working space for cabling, your power does not have to cross the back of the rack and you can separate your data and your power.

Beautiful Rack Cabling

This also means that some of our racks have the posts moved in beyond the standard rack mounting depth of 28″ in order to better accommodate our PDUs, the result of which is that I only have two out five racks that can accommodate a Dell PowerEdge.

Data and power not separated

You ideally want to run power on one side of the rack and data on the other. Most people will cite electromagnetic interference as a good reason for doing this but I have yet to see a problem caused by it (knock on wood). That being said, it is still a good idea to put some distance between the two, much like your recently divorced aunt and uncle at family functions. There are plenty of other good reasons for keeping data and power separate, most of which center around cabling hygiene – it helps keep things much cleaner because your data cables tend to run up and down the rack posts headed for things like your top-of-rack switch, whereas your power needs to go other places (i.e., equipment). It is a lot easier to bundle cables if they more or less share the same cable path.

Cannot access cable tray because of PDU cables

Cable Tray

This is just another version of “data and power are not separated”. Our power and data both come in at the top of the rack. This means our 10 4/C AWG feeds for each PDU which are about .5″ in diameter are draped across our cabling tray which just sits on top of the racks instead of being suspended by a ladder bar (another great injustice!). I bet these guys generate quite the electromagnetic field. It would nice if they were more than 4″ away from some of 10 Gbps interconnects, huh? This arrangement also means the cable tray is huge pain to use. You have to move all the PDU power cables off of it, then pop the lid off in segments to move your cables around. Or you can just run them all over the top of the rack and hope that the fiber survives like we do. Again. Not ideal.

Inconsistent fastener use for mounting equipment

This one sounds kind of innocuous but it is one of those small little details that makes your life so much easier. Pick a fastener type and size and stay with it. I am partial to M6 because the larger threads are harder to strip out and the head has more surface area for your driver’s bit to contact with. It is pretty annoying to change tools each time the fastener type is different instead of just setting your torque level on your driver and going for it. Also – don’t even think of using self-tapping fasteners. They make cage nuts and square holes in rack posts for a reason.

Improper rail mounting and/or retention

Your equipment comes with mounting instructions and you should probably follow them. Engineers calculate how much weight a particular rail can bear and then figure out that you need four fasteners of grade X on each rail to adequately support the equipment. This is all condensed into some terrible IKEA-level instructions which makes you shake your head as you wonder why your vendor could not afford a better technical writer for the obscene price of whatever equipment you are racking. Once you decipher these arcane incantations you should probably follow them. Don’t skip installing cage nuts and fasteners – if they say you need four, then you need four. It only takes two more minutes to do the job right.

AND FOR THE LOVE OF $DIETY INSTALL WHATEVER HARDWARE IS REQUIRED TO RETAIN THE EQUIPMENT IN THE RAILS! Seriously. This is a safety issue. I am not sure why this step is skipped and people just set things on the rails without using the screws to retain it to the posts but racks move, earthquakes happen and this shit is heavy. I think most our disk shelves are about 50 pounds. You do not want that falling out of the rack and onto your intern’s head.

Use ReadyRails (or vendor equivalent)

For about $80 dollars you can have a universal, tool-less rail that installs in about 30 seconds. I would call that a good investment.

Inconsistent inventory tagging locations

I am guessing your shop maintains an inventory system and you probably have little inventory tags you affix to equipment. Do your best to make the place where the inventory tag goes consistent and readable once everything is rack and stacked. The last thing you want to do is pull an entire rack apart because some auditor wants you find the magical inventory tag stuck on some disk shelf in the middle of 12 shelf aggregate.

It would also be a good idea to put your inventory tag in your documentation so you do not have to play a yearly game of “find the missing inventory tag”.

Cable labeling is not consistent (just use serialized cables)

I suck at cable labeling and documentation in general (see here) so this is a bit hypocritical, nevertheless I find that there are four stages in cable labeling: nothing, consistent labeling of port and device on each end, confusion as labeled cables are reused but the label is not changed and finally adoption of serialized cables where each end has a unique tag that is documented.

This is largely personal preference but the general rules are simple: keep it clean, keep it consistent and keep it current (your documentation that is). The only thing worse than a unlabeled cable is a mislabeled cable.

Gaps in rack mount devices

Shelf Gap

Why? Just why? I don’t know and will probably never know… but my best guess is the rail on the top shelf was slightly bent during installation and then when we needed to add another shelf later the rail interfered with it. 10 minutes originally could of have saved 10 hours down the road. If it turns out I am one post hole short of being able to install another shelf I get to move all the workloads off this aggregate, pull out all the disk shelves until I reach this one, fix or replace the rail, re-rack and re-cable everything, re-create the aggregate and then move the workloads back.

 

Now that I have complained a bit (I am sure r/sysadmin will say that I have it way to easy)  I get to talk about the real lesson here: none of this shit matters.

On one level they do. All these little oversights accumulate technical debt that eventually comes back and bites you on the ass and doing it right the first time is the easiest and most efficient way. On other hand, none of this stuff directly breaks things. The fact that the power and data cabling are too close together for my comfort or that there is small gap in one of the disk shelf stacks does not cause outages. We have plenty of things that do however and those demand my attention. So collectively, lets take a deep breath, let it go, and stop worrying about it. It’ll get fixed someday.

The other lesson here is that nothing is temporary. If you cut a corner, particularly with physical equipment, that corner will remain cut until the equipment is retired. It is just too hard and costly to correct these kinds of oversights once you are in production. If you are putting a new system up, take some time to plan it out – consider the failure domain, how much fault tolerance and redundancy you need, labeling, inventory and all those other little things. You only get to stand this system up once. Go slow and give it some forethought, you may thank yourself one day.

A Ticket Too Far… Part II

Part I:  A Ticket Too Far… Breaking the Broken

OK. So I screwed up. If you read the above post, I make a lot of claims about ticket systems and process. Many of those claims were based on the idea that we did not have an approved and enforced policy in place. Turns out I was wrong, sort of.

I did a little digging and reviewed the policy. Customers are asked to submit a ticket if their issue is not immediately preventing their work, otherwise they can either call the help desk number, an IT staff member directly or visit them in person. There is a lot I could say about this policy and what provisions I agree with and which I do not, but policy is policy and I misrepresented the strength of it – it is very much vetted and approved.

I see the goals of a policy on customer facing support as follows:

  • Prevent sysadmins from forgetting customer requests
  • Allow sysadmins to control their interrupt-based workflow, prioritize and not have their workflow control them
  • Track customer requests and incidents so pain-points can be discovered and resolved proactively
  • Build a database of break/fix-based documentation
  • Create an acknowledgement and feedback mechanism for customer issues (i.e., “you have been assigned ticket #232”), backed up by mechanisms that forces sysadmin action (i.e., “this ticket has not been touched in three days, either close it or reply”). This feedback loop ensures that issues are acknowledge and resolved either one way or another.

The details may be wrong but the bigger point of my last post remains the same; the combination of our policy and ticket system’s technical limitations does not accomplish those goals or lead to ideal outcomes for either customers or IT staff.

But does it really? Perception and reality are not always the same so I started tracking how often I was interrupted by a customer or a team member over a period of about four weeks. It is important to mention this was not a particularly rigorous study, I just kept an Excel spreadsheet and anytime I diverted my attention for more than a few minutes from my current task I made a quick note of it. If anything, I was consistent in my inconsistency. I also kept track of what kind of interrupt it was, what group it came from and whether or not it had ticket attached to it.

 

a graph showing the number of interrupts per day

A couple of interesting discoveries here:

  • I am not interrupted nearly as much as I think I am. If you throw out the obvious outlier of Day 12, the median is two interrupts per day. Not as bad as I would have thought… but it is not that great either considering with meetings and other obligations, I probably only have one period per day of uninterrupted time to focus on complex projects that is longer than two hours. Getting an interrupt during that time period is a pretty serious set back.
  • 48% of the interrupts were related to break/fix issues. The other 52% were what I call “communication interrupts”. More on these later.
  • Of the 31 break/fix interrupts I recorded, only one actually had a ticket already associated with it. This is mind-boggling terrible as far as accepted time management best practices go.
  • Only 59% of the interrupts required immediate action versus action that could be queued and prioritized later. This means 59% of these interruptions really did not need to be interruptions at all, they needed to be tickets or agenda items in a meeting.

Even with my back-of-the-napkin math these are pretty damning conclusions. Our ticket system capture rate, at least at Tier-3 where I spend most of my time, is laughably non-existent. Going back to my previous post about creating tickets for customers, there would be no reason for me to do so considering I would be creating 97% of the tickets myself. Interestingly enough, it is not like these requests just vanish into thin air. They get tracked and recorded somehow, albeit most likely by individual staff in a manner that is not portable or visible. The work required to track issues is still being done, just the organization is not getting a lot of value out of it since it is just recorded in some crusty senior sysadmin’s logbook.

As an aside, It would be really interesting to perform the same experiment at the Tier-1 and Tier-2 support levels and see what the ratio is. Maybe it is higher and it is more the kind of issues and/or customers that Tier-3 deals with that lead to low ticket system capture. Or worse maybe it is the same and ticket system capture is just really bad everywhere.

Almost two-thirds of these interrupts did not have to be interrupts. They did not require immediate action. This is also pretty terrible because interrupts cost a lot. The accepted wisdom is that it takes between 10 and 20 minutes to get back in “the zone” (source) after being interrupted. For myself that is about 40 minutes lost on average per day just in dealing with context switching from one task to another. There is also a greater risk of mistakes being made trying regain focus on your task. It is harder to calculate this cost but it is assuredly there.

Finally, a bit over half of these interrupts were “communication interrupts”. These are hard to pin down but mostly they were a team member or a customer wanting to communicate something to me, “Hey, I fixed item A on Server 1, but item B and item C are still broken” or a request for information, “Hey, how do items Z and Y work on Server 2 and Server 3?”. These clearly have a interrupt cost but on the other hand, they also have a cost in not immediately jumping to the top of my to-do pile. If someone needs information from me it is likely because they are in a wait-cycle – they need to know something to continue their current task. It feels like it boils down to a “whose time is more valuable?” argument. Is it better for a Tier-2 team member to burn 45 minutes instead of 10 on a task because she had to dig up some information instead of asking a Tier-3 sysadmin? Or is better for the Tier-3 sysadmin to burn 20 minutes to save his Tier-2 team member 35? I do not really know if there is answer here but it is an interesting thread to pull on…

The interrupts that served to communicate information to me seem a little more clear cut. None of them required immediate action and most of them were essentially status updates. Implementing a daily standup meeting or something similar would be a prefect format for these kinds of interactions.

 

Well. It was an interesting little experiment. I am not sure if I am smarter or better off for it but curiosity is an itch that sometimes just needs to be scratched.

Until next time, stay frosty.

 

Are GOV IT teams apathetic?

I have been stewing about this post on r/sysadminIs apathy a problem in most government IT teams?, for a while and felt like it was worth a quick write-up since most of my short IT career has been spent in the public sector.

First off, apathy and team dysfunction is a problem everywhere. There is nothing unique about government employees versus private employees in that respect. What I think the poster is really asking is, “Is there something about government IT that produces apathetic teams?” and if you read a little deeper it seems like apathy really means “permanent discouragement”; that is to say, the condition where change, “doing things right or better”, greater efficiency are or seemly are impossible. When you read something like, “…trying to make things more efficient is met with reactions like ‘oh you naive boy’ and finger pointing,” it is hard to think of just plain old vanilla apathy.

Government is not a business (despite what some people think). Programs operate at a loss, are subsidized in many cases entirely, by taxes because the public and/or their representatives deems those programs worthy. The failure mechanism of market competition doesn’t exist. Incredibly effective programs can be cancelled because they are no longer politically favorable and incredibly ineffective programs can continue or expand because they have political support. Furthermore, in all things public servants need to remain impartial, unbiased and above impropriety. This leads to vast and byzantine processes, the components of which singularly make imminent good sense (for example, the prohibition of no-bid contracts) but collectively all these well-intentioned barnacles slow the ship-of-state dramatically. Success is not rewarded with growth either. Implementing a more efficient process, a more cost effective infrastructure and saving money generally results in less money. This tendency of budget reduction (“Hey, if you saved it, you did not need it to begin with, right?”) turns highly functioning teams into disasters overtime as they lose resources. Paradoxically, the better you are at utilizing your existing resources, the less you get. Finally, your entire leadership changes with every administration change. You may still be shoveling coal down in the engine room, but the new skipper just sent down word to reduce steam and come about hard in order to head in the opposite direction. Generally private companies that do this kind of thing, with this frequency, do not last long.

How does all this apply to Information Technology? It means that your organization will move very, very slow and technology moves very, very fast. Not a good combo.

 

Those are the challenges that a team faces but what about the other half of the equation… the people facing them?

Job classes are just one small part of this picture but they are emblematic of some of the challenges that face team leads and managers when dealing with the ‘People’, piece of People, Process and Technology (ITIL buzzword detected! +5 points). The idea of job classes is that across the organization people doing similar work should be paid the same. The problem lies in that updating a job class is beyond onerous and the time to completion is measured in years. Do you know how quickly Information Technology reinvents itself? Really quick. This means that job classes and their associated salaries tend to drift away from the actual on-the-ground work being done and the appropriate compensation level over time, making recruitment of new staff and retention of your best staff very difficult (The Dead Sea Effect). If you combine this with a lack of training and professional development, staff has a tendency to get pigeon-holed into a particular role without a clear promotion path. Furthermore, many of the job class series are disjointed in such a way as working at the top of one job series will not meet the prerequisites for another job series, making advancement difficult, and at least on paper sometimes impossible. For example: you could work as a Lead Programmer for three years leading a team of five people and not qualify, at least on paper, for an entry level IT Manager position.

How does all this apply to Information Technology? People get stuck doing one job, for too long, with no professional training or mentorship. Their skillsets decline towards obsolescence and they become frustrated and discouraged.

 

I have never met anyone in the public sector that just straight up did not give a crap. I have met people that feel stuck, discouraged, marginalized and ignored. And rightly so. Getting stuff done is very hard. It is like everyone has one ingredient necessary to make a cake, and you all more, or less, agree on the recipe. You are all trained and experienced bakers. You can easily make a cake but you each have 100 pieces of paperwork you have to fill out and wait on, sometimes for months, before you can do your part of the cake-baking process. You have 10 different bosses, each telling you to make a different desert when you know that cakes are by far the best desert for your particular bakery. Then you get yelled at for not making a cake in a timely manner, and then you all fired and replaced by food service contractors whose parent company charges an exorbitant hourly rate. But hey, the public eventually got their cake right? Or at least a donut. Not exactly what they ordered but better than nothing … right?

If IT is a thankless job (and I am not sure I agree with that piece of Sysadmin mythology), then Public Sector IT is even more thankless. You will face a Kafkaesque bureaucracy. You will likely be very underpaid and have a difficult time seeking promotion. You will never be able to accept a vendor-provided gift or meal over the price of $25. You will laugh when people ask if you plan on attending VMworld. The public will stereotype you as lazy, ineffective and overpaid. But you will preserve. You have a duty to the body politic to do your best with what you have. You will keep the lights on, you will keep the ship afloat even as more and more water pours in. You have to. Because that’s what Systems Administrators do.

And all you wanted was to simply make a cake.

 

 

One Year of Solitude: My Learning Experience as a Lead

It has been a little over a year since I stepped into a role as a technical lead and I thought this might be a good time to reflect on some of the lessons I have learned as I transition from being focused entirely on technical problems to trying to under how those technical pieces fit into a larger picture.

 

Tech is easy. People are hard. And I have no idea how to deal with them.

It is hard to understate this. People are really, really difficult to deal with compared to technology and I have so much to learn about this piece of sysadmin craft. I do not necessarily mean people are difficult in the sense that they are oppositional or hard to work with (although often they are) just that team dynamics are very complicated and the people composing your team have a huge spread in terms of experience, skills, motivations, personalities and goals. These underlying “attributes” are not static either, they change based on the day, the mood and the project making identifying, understanding them and planning around them even harder. The awareness of this underlying milieu composing your team members and thus your team is paramount to your project’s success.

All I can say is that I have just begun to develop an awareness of these “attributes” and am just getting the basics of recognizing different communication styles (person and instance dependent). I can just begin to tell whose motivations align with mine and whose do not. In hunting we call this the difference between “looking and seeing”. It takes a lot of practice to truly “see”, especially if like me, you are not that socially adept.

My homework in this category is to build an RPG-like “character sheet” for each team member, myself included,  and think about what their “attributes” are and where those attributes are strengths and where they can be weaknesses.

 

Everyone will hate you. Not really. But kinda yes.

One the hardest parts of being a team lead, is you are now “in-charge” of technical projects with a project team made up of many different members who are not within your direct “chain-of-command” (at least this is how it works in my world). This means you own the responsibility for the project but any authority you have is granted to you by a manager somewhere higher up the byzantine ladder of bureaucracy. Nominally, this authority allows you to assign and direct work directly related to the project but in practice this authority is entirely discretionary. You can ask team member A to work on item Z but it is really up to her and her direct supervisor if that is what she is going to do. In the hierarchical, authority-based culture and process driven business world that most of us of work in this means you need to be exceedingly careful about whose toes you step on. Authority on paper is one thing, authority in practice is entirely another.

 

Mo’ People, Mo’ Problems

My handful of project have thus far been composed of team members that kind of fall into these rough archetypes.

A portion of the team will be hesitant to take up the project and the work you are asking them to do since you are not strictly speaking their supervisor. They will passively help the project along and frequently you will be required to directly meet with them and/or their supervisor to make sure they are “cleared” for the work you assigned them and to make sure they feel OK about doing it. These guys want to be helpful but they don’t want to work beyond what their supervisor has designated. Get them “cleared” and make sure they feel safe doing the work and you have made a lot of progress.

Another portion of the team will be outright hostile. Either their goals or motivations do not align with the project or even worse their supervisor’s goals or motivations do not align with the project but someone higher up leaned on them and so they are playing along. This is tough. The best you can hope for here is to move these folks from actively resisting to passively resisting. They might be “dead weight” but at least they aren’t actively trying to slow things down any more. I don’t have much a working strategy here – an appeal to authority is rarely effective. Authority does not wanted to bothered by your little squabbles and arguably it has already failed because chain-of-command can make someone play along, but it cannot make they play nice. I try to tailor my communication style to whatever I am picking up from these team members (see the poorly named, Dealing with People You Can’t Stand), do my best to include them (trying to end-run them makes things ten times worse) and inoculate the team against their toxicity. I am fan of saying I deal with problems and not complaints because problems can actually be solved but a lot of times these folks just want to complain. Give them a soap box so they can get it out of their system so you can move on get work done but don’t let them stand on it for too long.

Another group will be unengaged. These poor souls were probably assigned to the project because their supervisor had to put someone on it. A lot of times the project will be outside their normal technical area of operations, the project will only marginally effect them, or both. They will passively assist where they can. The best strategy I have found here is to be concise, do your best not to waste their time, and use their experience and knowledge of the surrounding business processes and people as much as you can. These guys can generate some great ideas or see problems that you would never otherwise see. You just have to find a way to engage them.

The last group will be actively engaged and strongly motivated to see the project succeed. These folks will be doing the heavy lifting and 90% of the actual technical work required to actually accomplish the project. You have to be careful to not let these guys lean to hard on the other team members out of frustration and you have to not overly rely on them or burn them out otherwise you will be really screwed since they are actually the only people truly putting in the nuts-and-bolts work required for the project’s success.

A quick aside, if you do not have enough people in this last group the project is doomed to failure. There is no way a project composed mostly of people actively resisting its stated goals will succeed, at least not under my junior leadership.

Dysfunctional? Yes. But all teams are dysfunctional in certain ways and at certain times. Understanding and adapting to the nature of your team’s dysfunction lets you mitigate it and maybe, just maybe, help move it towards a healthier place.

Until next time, good luck!