Prometheus and Sisyphus: A Modern Myth of Developers and Sysadmins

I am going to be upfront with you. You are about to read a long and meandering post that will seem almost a little too whiny at times where I talk some crap about our developers and their burdens (applications). I like our dev teams and I like to think I work really well with their leads so think of this post as a bit of satirical sibling rivalry and underneath the hyperbole and good nature-ed teasing there might be a small, “little-t” truth.

That truth is that operations, whether it’s the database administrator, the network team, the sysadmins or the help desk, always, always, always gets the short straw and that is because collectively we own “the forest” that the developers tend to their “trees” in.

I have a lot to say about the oft-repeated sysadmin myth about “how misunderstood sysadmins are” and how the they just seem to get stepped on all the time and so on and so on. I am not a big fan of the “special snowflake sysadmin syndrome” and especially not a fan of it when it is used as an excuse to be rude or unprofessional and at the risk of contradicting my earlier statement I think it is worth stating that even I know I am half full of crap when I say sysadmins always get the short straw.

OK disclaimers are all done! Lets tell some stories!

 

DevOps – That means I get Local Admin right?

My organization is quite granular and each of our departments more or less maintain their own development teams supporting their own mission-specific applications along with either a developer that essentially fulfilled an operations role or a single operations guy doing support. The “central” team maintained things like the LAN, Active Directory, the virtualization platform and so on. If the powers on high wanted a new application for their department, the developers would request the required virtual machines, the Operations Team would spin up a dozen VMs off of a template, join them to AD, give the developers local admin and they’d be on their merry way.

Much like Bob Belcher, all the Ops guys could do is “complain the whole time”.

 

This arrangement led to some amazing things that break in ways that are too awesome to truly describe:

  • We have department with a staff of 120 and 180 Active Directory Security Groups. At last count some 45 are completely empty. Auditing NTFS permissions is… uh difficult?
  • We have an in-house application that uses SharePoint as a front-end, calls some in-house web services tied to a database or two that auto-populates an Excel spreadsheet that is used for timekeeping. Everyone else, just fills out the spreadsheet.
  • We have another SharePoint integrated application that is used ironically enough for compliance training that passes your Active Directory credentials in plaintext through two or three servers all hosting different web services.
  • Our deployment process is use Windows File Explorer to copy everything off your workstation onto the IIS servers.
  • Our revision control is: E:\WWW\Site, E:\WWW\Site (Copy), E:\WWW-Site-Dev McDeveloper
  • We have an application that manages account on-boarding, a process of which is already automated by our Active Directory team.
  • We had at one point in time, four or five different backup systems all of which used BackupExec for some insane reason, three of which backed up the same data.
  • And then there’s Jenga: Enterprise Edition…

 

Jenga: Enterprise Edition – Not so fun when it needs four nines of uptime.

A satirical (but only just) rendering of one our application’s design pattern that I call “The Spider Web”

What you are looking at is my humorous attempt to scribble out a satirical sketch of one of our line-of-business applications which managed to actually turn out pretty accurate. The Jenga application is so named because all the pieces are interconnected in ways that turn the prospect of upgrading any of it into the project of upgrading all it. Ready? Ere’ we go!

It’s built around a core written in a language that we haven’t had any on-staff expertise in for the better part of ten years. In order to provide the functionality that business needed as the core aged, the developers wrote new “modules” in more current and maintainable languages that essentially just call APIs or exposed services and bolted them on. The database is relatively small, around 6 TBs, but almost 90% of it is static read-only data that we cannot separate out which drastically reduces the cool things our DBA and myself can do in terms of recovery, backup and replication and performance optimization. There is no truly separate development or testing environments so we use snapshot copies to expose what appear to be “atomic” copies of the production data (which contains PII!) on two or three other servers so our developers can validate application operations against it. We used to do this with manual fricking database restores, which was god damned expensive in terms of time and storage. There are no less than eight database servers involved but the application cannot be distributed or setup in some kind of multi-master deployment with convergence so staff at remote sites suffer abysmal performance if anything resembling contention happens on their shared last-mile connections, the “service accounts” are literally user accounts that the developers use to RDP to the servers, start the application’s GUI, and then enable the application’s various services via interacting with above mentioned GUI (any hick-up in the RDP session and *poof* there goes that service), the public facing web server directly queries the production database (our DBA’s favorite piece), the internally consumed pieces of the application and the externally consumed pieces are co-mingled, meaning an outage anywhere is an outage everywhere, the client requires a hard-coded drive map to run since application upgrades are handled internally with copy jobs essentially replacing all the local .DLLs when new ones are detected, oh and it runs on out of support versions SQL.

Whew. That’s was a lot. Sorry about that. Despite that the fact that a whole department pretty much lives or dies by this application’s continued functionality our devs haven’t made much progress in re-architecturing and modernizing this application. Now this really isn’t their fault but it doesn’t change the fact that my team has an increasingly hard time keeping this thing running in a satisfactory manner.

 

Operations: The Digital Custodian Team.

Somewhere in our brain storming session of trying to figure out how to move Jenga to a new virtualization infrastructure, all on a weekend when I’ll be traveling in order to squeeze the outage into the only period within the next two months that wasn’t going to be unduly disruptive I began to feel like my team was getting screwed. They have more developers supporting this application than we have in our whole operations team and its on us to figure out how to move Jenga without losing any blocks or having any lengthy service windows? What are those guys actually working on over there? Why are we trying to figure out which missing .DLL from .NET 1.0 needs be imported onto the new IIS 8.5 web server so some obscure service than no really one understands runs in a supported environment? Why does operations own the life-cycle management? Shouldn’t the developers be updating and re-writing code to reflect the underlying environmental and API changes each time a new server OS is released with a new set of libraries? Why is our business expectations for application reliability so widely out-of-sync with what the architecture can actually deliver? What’s going on here?

Honestly. I don’t know but it sucks. It sucks for the customers, it sucks for the devs but mostly it sucks my team because we have to support four other line of business applications. We own the forest right? So when a particular tree catches on fire they call us to figure out what to do. No one mentions that we probably shouldn’t expect trees wrapped in paraffin wax and then doused in diesel fuel not to catch on fire. When we point out that tending trees in this manner probably won’t deliver the best results if you want something other than a bonfire we get met with a vague shrug.

Is this how it works? Your team of rockstar, “creative-type”, code-poets whip up some kind of amazing business application, celebrate and then hand it off to operations where we have to figure out how to keep it alive as the platform and code base age into senility for the next 20 years? I mean who owns the on-call phone for all these applications… hint: it’s not the dev team.

I understand that sometimes messes happen… just why does it feel like we are the only ones cleaning it up?

 

You’re not my Supervisor! Organizational Structure and Silos!

Management! We got it!

 

At first blush I was going to blame my favorite patsy, Process Improvement and the inspid industry around it for this current state of affairs but after some thought I think the real answer here is something much simpler: the dev team and my team don’t work for the same person. Not even close. If we play a little game of “trace the organizational chart” we have five layers of management before we reach a position that has direct reports that eventually lead to both teams. Each one of those layers is a person – with their own concerns, motivations, proclivities and spin they put on any given situation. The developers and operations team (“dudes that work”), more or less, agree that the design of the Jenga application is Not a Good Thing (TM). But as each team gets directed to move in a certain direction by each layer of management our efforts and goals diverge. No amount of fuzzy-wuzzy DevOps or new-fangled Agile Standup Kanban Continuous Integration Gamefication Buzzword Compliant bullshit is ever going to change that. Nothing makes “enemies” out of friends faster than two (or three or four) managers maneuvering for leverage and dragging their teams along with them. I cannot help but wonder what our culture would be like if the lead devs sat right next to me and we established project teams out of our combined pool of developers and operations talent as individual department’s put forth work. What would things be like if our developers weren’t chained to some stupid line-of-business application from the late ’80s, toiling away to polish a turd and implement feature requests like some kind of modern Promethian myth? What would things be like if our operations team wasn’t constantly trying to figure out how to make old crap run while our budgets and staff are whittled away, snatching victory from defeat time and time again only to watch the cycle of mistakes restart itself again and again like some kind Sisyphean dystopia with cubicles? What if we could sit down together and I dunno… fix things?

Sorry there are no great conclusions or flashes of prophetic insight here, I am just as uninformed as the rest of the unwashed masses, but I cannot help but think, maybe, maybe we have too many chefs in the kitchen arguing about the menu. But then again, what do I know? I’m just the custodian.

Until next time, stay frosty.

The HumbleLab: Windows Server 2016, ReFS and “no sufficient eligible resources” Storage Tier Errors

Well, that didn’t last too long did it? Three months after getting my Windows Server 2012 R2 based HumbleLab setup I tore it down  to start fresh.

As a refresher The HumbleLab lives on some pretty humble hardware:

Dell OptiPlex 990 (circa 2012)

  • Intel i7-2600, 3.4GHz 4 Cores, 8 Threads, 256KB L2, 8MB L3
  • 16GBs, Non-EEC, 1333MHz DDR3
  • Samsung SSD PM830, 128GBs SATA 3.0 Gb/s
  • Samsung SSD 840 EVO 250GBs SATA 6.0 Gb/s
  • Seagate Barracuda 1TB SATA 3.0 Gb/s

However I did managed to scrounge up a Hitachi/HGST Ultrastar 7K3000 3TB SATA drive in our parts bin that was manufactured in April 2011 to swap places with the eight year old Seagate drive.  Not only is the Hitachi drive three years newer but it also has three times as much capacity bringing a whopping 3TBs of raw storage to the little HumbleLab! Double win!

My OptiPlex lacks any kind of real storage management and my Storage Pool was configured in Simple Storage Layout which just stripes the data across all the drives in the Storage Pool. It also should go without saying that I am not using any of Storage Space’s Failover Clustering or Scale-Out functionality. I couldn’t think of simple way to swap my SATA drives other than to export my Virtual Machines, destroy the Storage Pool, swap the drives and recreate it. The only problem is I didn’t really have any readily available temporary storage that I could dump my VMs on and my lab was kind of broken so I just nuked everything and started over with a fresh install of Server 2016 which I wanted to upgrade to anyway. Oh well, sometimes the smartest way forward is kind of stupid.

Not much to say about the install process but I did run across the same “storage pool does not have sufficient eligible resources” issue creating my Storage Pool.

Neat! There’s still a rounding error in the GUI. Never change Microsoft. Never change.

According to the Internet’s most accurate source of technical information, Microsoft’s TechNet Forums, there is a rounding error in how disks are presented in the wizard. I guess what happens is when you want to use all 2.8TBs of your disk, the numbers don’t match up exactly with the actual capacity and consequently the wizard fails as it tries to create a Storage Tier bigger than the underlying disk. I guess. I mean it seems plausible at least. If you specify the size in GBs or even MBs supposedly that will work but naturally it didn’t work for me and I ended up trying to create my new Virtual Disk using PowerShell. I slowly backed off the size of my Storage Tiers from the total capacity of the underlying disks until it worked with 3GBs worth of slack space. A little disappointing that the wizard doesn’t automagically do this for you and doubly disappointing that this issue is still present in Server 2016.

Here’s my PowerShell snippet:

 

Now for the big reveal? How’d we do?

Not bad at all for running on junk! We were able to squeeze a bit more go juice out of the HumbleLab with Server 2016 and ReFS! We bumped the IOPS up to 2240 from 880 and reduced latency down to sub 2ms numbers from 4ms which is amazing considering what we are running this on.

I think that this performance increase is largely due to the combination of how Storage Tiers and ReFS are implemented in Server 2016 and not due to ReFS’s block cloning technology which is focused on optimizing certain types of storage operations associated with virtualization workloads. As I understand it, Storage Tiers previously were “passive” in the sense that a scheduled task would move hot data onto SSD tiers and cooling/cold data back onto HDD tiers whereas in Server 2016 Storage Tiers and ReFS can do realtime storage optimization. Holy shmow! Windows Server is starting to look like a real operating system these days! There are plenty of gotchas of course and it is not really clear to me whether they are talking about Storage Spaces / Storage Tiers or Storage Spaces Direct but either way I am happy with the performance increase!

Until next time!

 

Kafka in IT: How a Simple Change Can Take a Year to Implement

Public sector IT has never had a reputation of being particularly fast-moving or responsive. In fact, it seems to have a reputation for being staffed by apathetic under-skilled workers toiling away in basements and boiler rooms supporting legacy, “mission-critical”, monolithic applications that sit half-finished and half-deployed by their long-gone and erstwhile overpaid contractors (*cough* Deloitte, CGI *cough*). This topic might seem familiar… Budget Cuts and Consolidation and Are GOV IT teams apathetic?

Why do things move so slow, especially in a field that demands the opposite? I don’t have an answer to that larger question but I do have an object lesson, well maybe what I really have is part-apology, part-explanation and part-catharsis. Gather around and hear The tale of how a single change to our organization’s perimeter proxy devices took a year!

 

03/10

We get a ticket stating that one of our teams’ development servers is no longer letting them access it via UNC share or RDP. I assign one of our tier-2 guys to take a look and a few days later it gets escalated to me. The server will not respond to any incoming network traffic, but if I access it via console and send traffic out it magically works. This smells suspiciously like a host-based firewall acting up but our security team swears up and down our Host Intrusion Protection software is in “detect” mode and I verified that we have disabled the native Windows firewall. I open up a few support tickets with our vendors and start chasing such wild geese as a layer-2 disjoint in our UCS fabric and “asymmetric routing” issues. No dice. Eventually someone gets the smart idea to move the IP address to another VM to try and narrow the issue down to either the VM or the environment. It’s the VM (of course it is)! These shenanigans take two weeks.

04/01

I finish re-platforming the development server onto a new Server 2012 R2 virtual machine. This in-of-itself would be worth a post since the best way I can summarize our deployment methodology is “guess-and-check”. Anyway, the immediate issue is now resolved. YAY!

05/01

I rebuild the entire development, testing, staging and production stack and migrate everything over except the production server which is publically accessible. The dev team wants to do a soft cutover instead of just moving the IP address to the new server. This means we will need have our networking team make some changes to the proxy perimeter devices.

05/15

I catch up on other work and finish the roughly ten pages of forms, diagrams and a security plan that are required for a perimeter device change request.

06/06

I open a ticket upstream, discuss the change with the network team and make some minor modifications to the ticket.

06/08

I filled out the wrong forms and/or I filled them out incorrectly. Whoops.

06/17

After a few tries I get the right forms and diagrams filled out. The ticket gets assigned to the security team for approval.

06/20

Someone from the security team picks up the ticket and begins to review it.

07/06

Sweet! Two weeks later my change request gets approval from the security team (that’s actually pretty fast). The ticket gets transfered back to the networking team which begins to work on implementation.

07/18

I create a separate ticket to track the required SSL/TLS certificate I will need for the HTTPS-enabled services on the server. This ticket follows a similar parallel process, documentation is filled out and validated, goes to the security team for approval and then back to the networking team for implementation. My original ticket for the perimeter change is still being worked on.

08/01

A firmware upgrade on the perimeter devices breaks high availability. The network team freezes all new work until the issue is corrected (they start their internal change control process for emergency break/fix issues).

08/24

The server’s HTTPS certificate has to be replaced before it expires at the end of the month. Our dev’s business group coughs up the few hundred dollars. We had planned to use the perimeter proxies’ wildcard certificate for no extra cost but oh well, too late.

09/06

HA restored! Wonderful! New configuration changes are released to the networking team for implementation.

10/01

Nothing happens upstream… I am not sure why.  I call about once a week and hear, we are swamped, two weeks until implementation. Should be soon.

11/03

The ticket gets transferred to another member of the network team and within a week the configuration change is ready for testing.

11/07

The dev team discovers an issue. Their application is relying on the originating client IP address for logging and what basically amounts to “two-factor authentication” (i.e., a username is tied to an IP address). This breaks fantastically once the service gets moved behind a web proxy. Neat.

11/09

I work with the dev lead and the networking team to come up with a solution. Turns out we can pass the originating IP address through the proxies but it changes the variable server-side that their code needs to reference.

11/28

Business leaders say that the code change is a no-go. We are about to hit their “code/infrastructure freeze” period that last from December to April. Fair enough.

12/01

We hit the code freeze. Things open back up again in mid-April. Looking ahead, I already have infrastructure work scheduled late April and early May which brings us right around to June: one year.

*WHEW* Let’s take a break. Here’s doge to entertain you during the intermission:

 

My team is asking for a change that involves taking approximately six services that are already publically accessible via a legacy implementation, moving those services to a single IP address and placing an application proxy between the Big Bad Internet and the hosting servers. Nothing too crazy here.

Here’s some parting thoughts to ponder.

  • ITIL. Love it or hate it ITIL adds a lot of drag. I hope it adds some value.
  • I don’t mean to pick on the other teams but it clearly seems like they don’t have enough resources (expertise, team members, whatever they need they don’t have enough of it).
  • I could have done better with all the process stuff on my end. Momentum is important so I probably should not of let some of that paperwork sit for as long as it did.
  • The specialization of teams cuts both ways. It is easy to slip from being isolated and silo-ed to just basic outright distrust, and when you assume that everyone is out to get you (probably because that’s what experience has taught you) then you C.Y.A. ten times till Sunday to protect yourself and your team. Combine this with ITIL for a particularly potent blend of bureaucratic misery.
  • Centralized teams like networking and security that are not embedded in different business groups end up serving a whole bunch of different masters. All of whom are going in different directions and want different things. In our organization this seems to mean that the loudest, meanest person who is holding their feet to the SLA gets whatever they want at the expense of their quieter customers like myself.
  • Little time lags magnify delay as the project goes on. Two weeks in security approval limbo puts me four weeks behind a few months down the road which means I then miss my certificate expiry deadline which then means I need to fill out another ticket which then puts me further behind and so on ad infinitum.
  • This kind of shit is why developers are just saying “#YOLO! Screw you Ops! LEEEEROY JENKINS! We are moving to the Cloud!” and ignoring all this on-prem, organizational pain and doing DevDev (it’s like DevOps but it leads to hilarious brokenness in other new and exciting ways).
  • Public Sector IT runs on chaos, disorder and the frustrations of people just trying to Do Things. See anything ever written by Kafka.
  • ITIL. I thought it was worth mentioning twice because that’s how much overhead it adds (by design).

 

Until next time, may your tickets be speedily resolved.

The HumbleLab: Storage Spaces with Tiers – Making Pigs Fly!

I have mixed feelings about homelabs. It seems ludicrous to me that in a field that changes as fast as IT that employers do not invest in training. You would think on-the-clock time dedicated to learning would be an investment that would pay itself back in spades. I also think there is something psychologically dangerous in working your 8-10 hour day and then going home and spending your evenings and weekends studying/playing in your homelab. Unplugging and leaving computers behind is pretty important, in fact I find the more and more I do IT the less interest I have in technology in general. Something, something, make an interest a career and then learn to hate it. Oh well.

That being said, IT is a fast changing field and if you are not keeping up one way or another, you are falling behind. A homelab is one way to do this, plus sometimes it is kind of nice to just do stuff without attending governance meetings or submitting to the tyranny of your organization’s change control board.

Being the cheapskate that I am, I didn’t want to go out spend thousands of my own dollars on hardware like all the cool cats in r/homelab so I just grabbed some random crap lying around work, partly just to see how much use I could squeeze out of it.

Dell OptiPlex 990 (circa 2012)

  • Intel i7-2600, 3.4GHz 4 Cores, 8 Threads, 256KB L2, 8MB L3
  • 16GBs, Non-EEC, 1333MHz DDR3
  • Samsung SSD PM830, 128GBs SATA 3.0 Gb/s
  • Samsung SSD 840 EVO 250GBs SATA 6.0 Gb/s
  • Seagate Barracuda 1TB SATA 3.0 Gb/s

The OptiPlex shipped with just the 128GB SSD which only had enough storage capacity to host the smallest of Windows virtual machines so I scrounged up the two other disks from other desktops that were slated for recycling. I am particularly proud of the Seagate because if the datecode on the drive is to be believed it was originally manufactured sometime in late 2009.

A bit of a pig huh? Let’s see if we can make this little porker fly.

A picture of the inside of HumbleLab

Oh yeah… look at that quality hardware and cable management. Gonna be hosting prod workloads on this baby.

I started out with a pretty simple/lazy install of Windows Server 2012 R2 and the Hyper-V role. At this point in time I only had the original 128GB SSD that operating system was installed on and the ancient Seagate being utilized for .VHD/.VHDX storage.

Performance was predictably abysmal, especially once I got a SQL VM setup and “running”:

IOmeter output

At this point, I added in the other 256GB SSD, destroyed the volume I was using for .VHD/.VHDX storage and recreated it using Storage Spaces. I don’t have much to say about Storage Spaces here since I have such a simple/stupid setup. I just created a single Storage Pool using the 256GB SSD and 1TB SATA drive. Obviously with only two disks I was limited to a Simple Storage Layout (no disk redundancy/YOLO mode). I did opt to create a larger 8GB Write Cache using PowerShell but other than that I pretty much just clicked through the wizard in Server Manager:

 

Let’s see how we did:

IOMeter Results with Storage Tiers

A marked improvement! We tripled our IOPS from a snail-like 234 to a tortoise-like 820 and managed to reduce the response time from 14ms to 5ms. The latency reduction is probably the most important. We generally shoot for under 2ms for our production workloads but considering the hardware 5-6ms isn’t bad at all.

 

What if I just run .VHDX file directly on the shared 128GB SSD that the Hyper-V host is utilizing without any Storage Tiers involved at all?

Hmm… not surprisingly the results are even better but what was surprising is by how much.  We are looking at sub 2ms latency and about four and half times more IOPS than what my Storage Spaces Virtual Disk can deliver.

Of course benchmarks, especially quick and dirty ones like this, are very rarely the whole story and likely do not even come close to simulating your true workload but at least it gives us a basic picture of what my aging hardware can do: SATA = Glacial, Storage Tiers with SSD Caching=OK, SSD=Good. It also illustrates just how damn fast SSDs are. If you have a poorly performing application, moving it over to SSD storage is likely going to be the single easiest thing you can do to improve its performance. Sure, the existing bottleneck in the codebase or database design is still there, but does that matter anymore if everything is moving 4x faster? Like they say, Hardware is Cheap, Developers are Expensive.

I put this together prior to the general release of Server 2016 so it would be interesting to see if running this same setup on 2016’s implementation of Storage Spaces with ReFS instead of NTFS would yield better results. It also would be interesting to refactor the SQL database and at the very least place the TempDB, SysDBs and Log files directly onto to host’s 128GB SSD. A project for another time I guess…

Until next time… may your pigs fly!

A flying pig powered by a rocket

Additional reading / extra credit:

Don’t Build Private Clouds? Then What Do We Build?

Give Subbu Allamaraju’s blog post Don’t Build Private Clouds a read if you have not yet. I think it is rather compelling but also wrong in a sense. In summation: 1) Your workload is not as special as you think it is, 2) your private cloud isn’t really a “cloud” since it lacks the defining scale, resiliency, automation framework, PaaS/SaaS and self-service on-demand functionality that a true cloud offering like AWS, Azure or Google has and 3) your organization is probably doing a poor job of building a private cloud anyway.

Now lets look at my team – we maintain a small Cisco FlexPod environment – about 14 ESXi hosts, 1.5TBs RAM and about 250TBs of storage. We support about 600 users and I am primary for the following:

  • Datacenter Virtualization: Cisco UCS, Nexus 5Ks, vSphere, NetApp and CheckPoint firewalls
  • Server Infrastructure: Platform support for 150 VMs, running mostly either IIS or SQL
  • SCCM Administration (although one of our juniors has taken over the day to day tasks)
  • Active Directory Maintenance and Configuration Management through GPOs
  • Team lead responsibilities under the discretion of my manager for larger projects with multiple groups and stakeholders
  • Escalation point for the team, point-of-contact for developer teams
  • Automation and monitoring of infrastructure and services

My-day-to-day consists of work supporting these focus areas – assisting team members with a particularly thorny issue, migrating in-house applications onto new VMs, working with our developer teams to address application issues, existing platform maintenance, holding meetings talking about all this work with my team, attending meetings talking about all this work with my managers, sending emails about all this work to the business stakeholders and a surprising amount of tier-1 support (see here and here).

If we waved our magic wand and moved everything into the cloud tomorrow, particularly into PaaS where the real value to cost sweet spot seems to be, what would I have left to do? What would I have left to build and maintain?

Nothing. I would have nothing left to build.

Almost all of my job is working on back-end infrastructure, doing platform support or acting as an human API/”automation framework”. As Subbu states, I am a part of the cycle of “brittle, time-consuming, human-operator driven, ticket based on-premises infrastructure [that] brews a culture of mistrust, centralization, dependency and control“.

I take a ticket saying, “Hey, we need a new VM.” and I run some PowerShell scripts to create and provision above said new VM in a semi-automated fashion, I then copy the contents of the older VM’s IIS directory over. I then notice that our developers are passing credentials in plaintext back and forth through web forms and .XML files between different web services which kicks off a whole week’s worth of work to re-do all their sites in HTTPS. I then setup a meeting to talk about these changes with my team (cross training) and if we are lucky  someone upstream actually gets to my ticket and these changes go live. This takes about three to four weeks optimistically.

In the new world our intrepid developer tweaks his Visual Studio deployment settings and his application gets pushed to an Azure WebApp which comes baked in with geographical redundancy, automatic scale-out/scale-up, load-balancing, a dizzying array of backup and recovery options, integration with SaaS authentication providers, PCI/OSI/SOC compliance and the list goes on. This takes all of five minutes.

However here is where I think Subbu get its wrong: Of our 150 VMs, about 50% of them belong to those “stateful monoliths”. They are primarily composed of line-of-business applications with proprietary code bases that we don’t have access to or they are legacy applications built on things like PowerBuilder and no one understands how they work anymore. They are spread out across 10 to 20 VMs to provide segmentation but have huge monolithic database designs. It would cost us millions of dollars to re-factor the application into a design that could truly take advantage of cloud services in their PaaS form. Our other option would be cloud-based IaaS which is not that different from the developer’s perspective than what we are currently doing except that it costs more.

I am not even going to touch on our largest piece of IT spend which is a line-of-business application that has “large monolithic databases running on handcrafted hardware.” in the form of an IBM z/OS mainframe. Now our refactoring cost is in the ten of millions of dollars.

 

If this magical cloud world comes to pass what do I build? What do I do?

  • Like some kind of carrion lord, I rule over my decaying infrastructure and accumulated technical debt until everything legacy has been deprecated and I am no longer needed.
  • I go full retar… err… endpoint management. I don’t see desktops going away anytime soon despite all this talk of tablets, mobile devices and BYOD.
  • On-prem LAN networking will probably stick around but unfortunately this is all contracted out in my organization.
  • I could become a developer.
  • I could become a manager.
  • I could find another field of work.

 

Will this magical cloud world come to pass?

Maybe in the real world but I have a hard time imaging how it work for us. We are so far behind in terms of technology and so organizationally dysfunctional that I cannot see how moving 60% of our services from on-prem IaaS to cloud-based IaaS would make sense, even if leadership could lay off all of the infrastructure support people like myself.

Our workloads aren’t special. They’re just stupid and it would cost a lot of money to make them less stupid.

 

The real pearl of wisdom…

The state of [your] infrastructure influences your organizational culture.Of all things in that post, I think this is the most perceptive as it is in direct opposition to everything our leadership has been saying about IT consolidation. The message we have continually been hearing for the last year and a half is that IT Operations is a commodity service – the technology doesn’t matter, the institutional knowledge doesn’t matter, the choice of vendor doesn’t matter, the talent doesn’t matter: It is all essentially the same and it is just a numbers game to find the implementation that is the most affordable.

As a nerd-at-heart I have always disagreed with this position because I believe your technology choices determine what is possible (i.e., if you need a plane but you get a boat that isn’t going to work out for you) but the insight here that I have never really deeply considered is that your choice of technology drastically effects how you do things. It effects your organization’s cultural orientation to IT. If you are a Linux shop, does that technology choice precede your dedication to continuous integration, platform-as-code and remote collaboration? If you are a Windows shop, does that technology choice precede your stuffy corporate culture of ITIL misery and on-premise commuter hell? How much does our technological means of accomplishing our economic goals shape our culture? How much indeed?

 

Until next time, keep your stick on the ice.

SCCM SUP Failing to Download Updates – Invalid Certificate Error

I am currently re-building my System Center lab which includes re-installing and re-configuring a basic SCCM instance. I was in the process of getting my Software Update Point (SUP) setup and a few of my SUP groups failed to download their respective updates and deploy correctly. I dimly remember working through this issue in our production environment a few years ago when we moved from SCCM 2012 to 2012 R2 and I cursed myself for not taking better notes and so here we are atoning for our past sins!

Here’s the offending error:

SCCM SUP Download Error Dialog - Invalid Cert

 

SCCM is a chatty little guy and manages to generate some 160 or so different log files, spread out across a number of different locations (see the highly recommended MSDN blog article, A List of SCCM Log Files). Identifying and locating the relevant logs to whatever issue you are troubleshooting is about half the battle with SCCM. Unfortunately I couldn’t seem to find patchdownloader.log in its expected location of SMS_CCM\Logs. Turns out if you are running the Configuration Manager Console from a RDP session patchdownloader.log will get stored in C:\Users\%USERNAME%\AppData\Local\Temp instead of SMS\Logs. Huh. In my case, I am RDPing to the SCCM Server and running the console but I wonder if I run it from a client workstation whether the resulting log will end up locally on that workstation in my %TEMP% folder or whether it will end up on the SCCM SUP server in SMS_CCM\Logs… an experiment for another day I guess.

 

Here’s the juicy bits:

 

A couple of interesting things to note from here:

  • We get the actual error code (0x80073633) which you can use CMTrace’s Error Lookup functionality to “resolve” back to the human readable one the Console presents you with. Sometimes this turns out to be useful information.
  • We get the download location for the update
  • We get the distribution package location that the update is being downloaded to

If I manually browse to the wsus.ds.download.windowsupdate.com URL I manage to download the update without issues. No certificate validation issues which one would expect considering that the connection is going over HTTP according to the log. Makes one wonder how the resulting error was related to an “invalid certificate”…

OK. How do I fix it? Well like most things SCCM the solution is as stupid as it is brilliant. Manually download the update from the Microsoft Update Catalog. Go find the offending update in its respective Software Update Group by referencing the KB number and download it again but this time set your Download Location to the directory that already contains it.

Whoops. Didn’t work.

Take a look at the first attempt to download the content… SCCM Is looking for Windows10.0-KB3172989-x64.cab so it can be downloaded into my %TEMP% directory and then eventually moved off the Deployment Package’s source location at \\SCCM\Source Files\Windows\Updates\2016.

The file I downloaded is not named Windows10.0-KB3172989-x64.cab – it’s actually an .msu file. Use 7-Zip or similar tool to pull the .cab file out of it and now it SCCM SUP should successfully “download” the the update and ship it off to source location for your Deployment Package.

 

FFFUUUU Internet Explorer… a rant about an outage

I am not normally a proponent of hating on Microsoft, mostly because I think much of the hate they get for design decisions is simply because people do not take the time to understand how Microsoft’s new widget of the month works and why it works that way. I also think it is largely pointless. All Hardware Sucks, All Software Sucks once you really start to dig around under the hood. That and Microsoft doesn’t really give a shit about what you want and why you want it. If you are an enterprise customer they have you by the balls and you and Microsoft both know it. You are just going to have to deal with tiles, the Windows Store and all the other consumer centric bullshit that is coming your way regardless of how “enterprise friendly” your sales rep says Microsoft is.

That being said, I cannot always take my own medicine of enlightened apathy and Stockholm Syndrome and this is one of those times. We had a Windows Update get deployed this week that broke about 60% – 75% of our fleet, specifically Internet Explorer 11. Unfortunately we have a few line-of-business web applications that rely on it. You can imagine how that went.

Now there are a lot of reasons why this happened but midway through my support call where we are piecing together an uninstallation script to remove all the prerequisites of Internet Explorer 11 I had what I call a “boss epiphany”. A “boss epiphany” is when you step out your technical day-to-day and start asking bigger questions and is so named because my boss has a habit of doing this. I generally find it kind of annoying in a good-natured way because I feel like there is a disregard for the technical complexities that I have to deal with in order to make things work but I can’t begrudge that he cuts to the heart of the matter. And six hours into our outage what was the epiphany… “Why is this so fucking hard? We are using Microsoft’s main line-of-business browser (Internet Explorer) and their main line-of-business tool for managing workstations in an enterprise environment (SCCM).”

The answer is complicated from (my) technical perspective but the “boss epiphany” is a really good point. This shit should be easy. It’s not. Or I suck at it. Or maybe both. AND that brings me to my rant. Why in the name of Odin’s beard is software deployment and management in Windows so stupid? All SCCM is doing is really just running an installer. For all its “Enterprisy-ness” it just runs whatever stupid installer you get from Adobe, Microsoft or Oracle. There’s no standardization, no packaging or no guarantee anything will actually be atomic. Even MSI installers can do insane things – like accept arguments in long form (TRANSFROMS=stupidapp.mst) but not short form (/t stupidapp.mst) or my particular favorite, search for ProductKey registry keys to uninstall any older version of the application, and then try to uninstall it via the original .MSI. This fails horribly when that .MSI lives in a non-persistent client side cache (C:\Windows\ccmcache). Linux was created by a bunch of dope-smoking neckbeards and European commies and they have had solid standardized package management for like ten years. I remember taking a Debian Stable install up to Testing, and then down-grading to Stable and then finally just upgrading the whole thing to Unstable. AND EVERYTHING WORKED (MOSTLY). Lets see you try that kind of kernel and user land gymnastics with Windows. Maybe I just have not spent enough supporting Linux to hate it yet but I cannot help but admire the beauty of apt-get update && apt-get upgrade when most of my software deployments means gluing various .EXEs and registry keys together with batch files or PowerShell. It’s 2016 and this is how we are managing software deployments? I feel like I’m taking crazy pills here.

 

Lets look at the IEAK as a specific example since I suspect that’s half the reason I got us into this mess. The quotes from this r/sccm thread are perfect here:

  • “IEAK can’t handle pre reqs cleanly. Also ‘installs’ IE11 and marks it as successful if it fails due to prereqs”
  • “Dittoing this. IEAK was a nightmare.”
  • “IEAK worked fine for us apart from one issue. When installing it would fail to get a return from the WMI installed check of KB2729094 quick enough so it assumed it wasn’t installed and would not complete the IE11 install.”
  • “It turns out that even though the IEAK gave me a setup file it was still reaching out to the Internet to download the main payload for IE”
  • “I will never use IEAK again for an IE11 deployment, mainly for the reason you stated but also the CEIP issue.”

And that’s the supported, “Enterprise” deployment method. If you start digging around on the Internet, you see there are people out there deploying Internet Explorer 11 with Task Sequences, custom batch files, custom PowerShell scripts and the PowerShell Deployment Toolkit. Again, the technical part of me understands that Internet Explorer is a complicated piece of software and that there are reasons it is deployed this way but ultimately if it is easier for me to deploy Firefox with SCCM than Internet Explorer… well that just doesn’t seem right now does it?

Until next time… throw your computer away and go outside. Computers are dumb.

Can’tBan… Adventures with Kanban

Comic about Agile Programming

. . .

We started using Kanban in our shop about six months ago. This in of itself is interesting considering we are nominally an ITIL shop and the underlying philosophies of ITIL and Kanban seem diametrically opposed. Kanban, at least from my cursory experience is focused on speeding up the flow of work, identifying bottlenecks and meeting customers’ requirements more responsively. It is all about reducing “cycle time”, that is the time it takes to move a unit of work through to completion. ITIL is all about slowing the flow of work down and adding rigor and business oversight into IT processes. A side effect of this is that the cycle time increases.

If you are not familiar with Kanban the idea is simple. Projects get decomposed into discrete tasks, tasks get pulled through the system from initiation to finish and each stage of the project is represented by a queue. Queues’ have work in progress (WIP) limits which means only so many task can be in a single queue at the same time. The backlog is where everything you want to get done sits before you actually start working on it. DO YOU WANT TO KNOW MORE?

As I am sure the one reader of my blog knows, I simultaneously struggle with time management and I am also fascinated by it. What do I think about Kanban? I have mixed feelings.

The Good

  • Kanban is very visual. I like visual things – walk me through your application’s architecture over the phone and I have no idea what you have just told me five minutes later. Show me a diagram and I will get it. This appeal of course is personal and will vary widely depending on the individual.
  • Work in progress (WIP) limits! These are a fantastic concept. The idea that your team can only process so much work in a given unit of time and that constantly context switching between tasks has an associated cost is obvious to those in the trenches but not so much to those higher powers that exist beyond the Reality Impermeability Layer of upper management. If you literally show them there is not enough room in the execution queue for another task they will start to get it. All of sudden you and your leadership can start asking the real questions… why is task A being worked on before task Z? Do you need more resources to complete all the given tasks? Maybe task C can wait awhile? Why is task G moving so slowly? Why are we bottlenecked at this phase?
  • Priorities are made explicit. If I ever have doubt about what I am expected to be working on I can just check the execution queue. If my manager wants me to work on another task that is outside the execution queue, then we can have a discussion about whether or not to bump something back or hold the current “oh hey, can you take care of this right now?” task in the backlog. I cannot understate how awesome this. It makes the cost of context switching visible, keeps my tactical work aligned with my manager’s strategic goals, and makes us think about what tasks matter most and in what order they should get done. This is so much better than the weekly meeting, where more and more tasks get dumped into some nebulous to-do list that my team struggles through while leadership wonders why the “Pet Project of the Month” isn’t finished yet.

The Interesting

  • The scope of work that you set as a singular “task” is really important. If a single task is too large then it doesn’t accurately map to the work being done on a day-to-day basis and you lose out on Kanban’s ability to bring bottlenecks and patterns to the surface where they can be dealt with. If the tasks are to small then you end up spending too much time in the “meta-analysis” of figuring out what task is where instead of actually accomplishing things.
  • The type of work you decide to count as a Kanban task also has a huge effect on how your Kanban actually “runs”. Do you track break/fix, maintenance tasks, meetings, projects, all of the above? I think this really depends on how your team works and what they work on so there is no hard or fast answer here.
  • Some team members are more equal than others. We set our WIP limit to Number of Team Members * 2… the idea being that two to three tasks is about all a single person can really focus on and still be effective (i.e., “The Rule Of Threes”). Turns out though in practice that 60% of tasks are owned by only 20% team. Huh. I guess that would be called a bottleneck?

The Bad

  • Your queues need to actually be meaningful. Just having separate queues named “Initiation”, “Documentation”, “Sign-off” only works if you have discrete actions that are expected for the tasks in those queues. In our shop what I have found is only one queue matters: the execution queue. We have other queues but since they do not have requirements and WIP limits attached to them they are essentially just to-do lists. If a task goes into the Documentation queue, then you better damn well document your system before you move the task along. What we have is essentially a one queue Kanban system with a single WIP limit. If we restructured our Kanban process and truly pulled a task through each queue from beginning to finish I think we would see much more utility.
  • Flow vs. non-flow. An interesting side of effect of not having strong queue requirements is that tasks don’t really “flow”. For example: We are singularly focused on the execution queue and so every time I finish a task it gets moved onto the documentation queue where it piles up with all the other stuff I never documented. Now instead of backing off and making time for our team to document before pulling more work into the system I re-focus on whatever task just got promoted into the execution queue. Maybe this is why our documentation sucks so much? What this should tell us is 1) We have too many items in the documentation queue for new work, 2) the documentation queue needs a smaller WIP limit, 3) we need to make the hard decision to put off work until documentation is done if we actually want documentation and 4) documentation is work and work takes time. If we never give staff the time to document then we will end up with no documentation. I don’t necessarily thing everything needs to be pulled through each queue. Break/Fix work is often simple, ephemeral and if your ticket system doesn’t suck ,self-documenting. You could handle these types of tasks with a standalone queue.
  • Queues should have time-limits. You only have one of two states regarding a given unit of work, you are either actively working on it or you are not. Kanban should have the same relationship with tasks in a given queue. If a task has sat in the planning queue for a week without any actual planning occurring then it should be removed. Either the next queue is full (bottleneck), the planning queue is full (bottleneck/WIP limit to high) or your team is not working on your Kanban tasks (other larger systemic problems). Aggressively “reset” tasks by sending them to the backlog if no work is being performed on them and enforce your queue requirements otherwise all you have done is create six different “to-do-whenever-we-have-spare-time-which-is-never-lists” that just collect tasks.
  • Our implementation of Kanban does not work as a time management tool because we only track “project” work. Accordingly very little of my time is actually spent on the Kanban tasks since I am also doing break/fix, escalations, monitoring and preventive maintenance. This really detracts from the overall benefit of managing priorities, making then explicit and limiting context switching since our Kanban board only represents at best 25% of my team’s work.

In conclusion there are some things I really like about Kanban and with some tweaks I think our implementation could have a lot of utility. I am not convinced it will mix well with our weird combination of ITIL processes but no real help desk (see: Who Needs Tickets Anyway? and Those are Rookie Numbers according to r/sysadmin). We are getting value out of Kanban but it needs some real changes before it becomes just one more process of vague effectiveness.

It will be interesting to see where we are in another six months.

Until next time, keep your stick on the ice.

The Big Squeeze, Predictions in Pessimism


Layoff notice or stay the hell away from Alaska when oil is cheap… from sysadmin

 

I thought this would be a technical blog acting as a surrogate for my participation on ServerFault but instead it has morphed into some kind of weird meta-sysadmin blog/soap box/long-form of a reply on r/sysadmin. I guess I am OK with that…

Alaska is a boom and bust economy and despite having a lot going for us fiscally, a combination of our tax structure, oil prices and the Legislature’s approach to the ongoing budget deficit, we are doing our best to auger our economy into the ground. Time for a bit of gallows humor to commiserate with u/Clovis69! The best part of predictions is you get to see how hilariously uninformed you were down the road! Plus, if you are going to draw straws you might as well take bets on who gets the shortest one.

Be forewarned, I am not an economist, I am not even really that informed and if you are my employer, future or otherwise, I am largely being facetious.

The Micro View (what will happen to me and my shop)

  • We will take another 15-20% personnel cuts in IT operations (desktop, server and infrastructure support). That will bring us to close to a 45% reduction in staff since 2015.
  • We will take on additional IT workload as our programming teams continue to lose personnel and consequently shed operational tasks they were doing independently.
  • We will be required to adopt a low-touch, automation-centric support model in order to cope with the workload. We will not have the resources to do the kind of interrupt-driven, in-person support we do now. This is a huge change from our current culture.
  • We will lean really hard on folks that know SCCM, PowerShell, Group Policy and other automation frameworks. Tier-2/Tier-3 will come under more pressure as the interrupt rate increases due to the reduction in Tier-1 staff.
  • Team members that do not adopt automation frameworks will find themselves doing whatever non-automatable grunt work there is left. They will also be more likely to lose their jobs.
  • We will lose a critical team member that is performing this increased automation work as they can simply get paid better elsewhere without having a budget deficit hanging over their head.
  • If we do not complete our consolidation work to standardize and bring silo-ed teams together before we lose what little operational capacity we have left our shop will slip into full blown reactive mode. Preventive maintenance will not get done and in two years time things will be Bad (TM). I mean like straight-up r/sysadmin horror story Bad (TM).
  • I would be surprised if I am still in the same role in the same team.
  • We will somehow have even more meetings.

The Macro View (what will happen to my organization)

Preliminary plans to consolidate IT operations were introduced back in early 2015. In short, our administrative functions including IT operations, are largely decentralized and done at the department level. This leads to a lot of redundant work being performed, poor alignment of IT to the business goals of the organization as a whole, the inability to capture or recover value from economies of scale and widely disparate resources, functionality and service delivery. At a practical level, what this means is there are a whole lot of folks like myself all working to assimilate new workload, standardize it and then automate it as we cope with staff reduction. We are all hurriedly building levers to help us move more and more weight but no one has stopped to say, “Hey guys, if we all work together to build one lever we can move things that are an order of magnitude heavier,” consequently as valiant as our individual efforts are we are going to fail. If I lose four people out of a team of eight, no level of automation that I can come up with will keep our heads above water.

At this point I am not optimistic about our chances for success. The tempo of a project is often determined by its initial pace. I have never seen an IT project move faster as time goes on in the public sector; generally it moves slower and slower as it grinds through the layers of bureaucracy and swims upstream against the prevailing current of institutional inertia and resistance. It has been over a year without any progress that is visible to the rank-and-file staff such as myself and we only have about one, maybe two years, of money left in the piggy bank before we find that the income side of our balance sheet is only 35% of our expenses. To make things even more problematic entities that do want to give up control have had close to two years to actively position themselves to protect their internal IT.

I want IT consolidation to succeed. It seems like the only possible way to continue to provide a similar level of service in the face of a 30-60% staff reduction. I mean, what the hell else are we going to do? Are we going to keep doing things the same way until we run out of money, turn the lights off and go home? If it takes one person on my team to run SCCM for my 800 endpoints, and three people from your team to run SCCM for your 3000 endpoints, how much do you want to bet the four of them could run SCCM for all 12,000 of our endpoints? I am pretty damn confident they could. And this scenario repeats everywhere. We are bailing out our boats, and in each boat is one piece of a high volume bilge pump but we don’t trust each other and no one is talking and we are all moving in a million different directions instead of stopping, collectively getting over whatever stupid pettiness that keeps us from actually doing something smart for once and putting together our badass high volume bilge pump. We will either float together or drown separately.

I seem to recall a similar problem from our nation’s history…

Benjamin Franklin's Join or Die Political Cartoon

Things in our Datacenter that Annoy Me

Or alternatively how I learned to stop worrying about the little things…

In this post, I complain about little details that show my true colors as a some kind of pedantic, semi-obsessive, detail oriented system administrator. I mean, I try to play it cool but inside I am really freaking out, man! Not really but also kind of yes. More on that later.

 

Our racks are not deep or wide enough

Our racks were not sized correctly initially. They are quite “shallow”. A Dell 730 on ReadyRails is about 28″ deep. This is a pretty standard rack mounting depth for full-size rackmount equipment. In our racks, we only have about 4-6″ of space remaining between the posts and the door in the back of the rack. This complicates cabling since we do not have a lot of room to work with but it really gets annoying with PDUs. See below.

The combination of the shallow depth and lack of width, leads to weird PDU configurations

PDU Setup

The racks are too shallow to mount the PDUs parallel with the posts with the plugs facing out towards the door and too narrow to stack both PDUs on one side. The PDUs end up being mounted sideways where they stick out into the area between the posts, blocking airflow and making cabling a pain in the ass.

Check out u/tasysadmin’s approach which is much improved over ours. The extra depth and width allows both power circuits (you do have two redundant power circuits, right?) to move over to one side of the rack and slide into the gap between the posts and casing. This has a whole bunch of benefits: airflow is not restricted, you have more working space for cabling, your power does not have to cross the back of the rack and you can separate your data and your power.

Beautiful Rack Cabling

This also means that some of our racks have the posts moved in beyond the standard rack mounting depth of 28″ in order to better accommodate our PDUs, the result of which is that I only have two out five racks that can accommodate a Dell PowerEdge.

Data and power not separated

You ideally want to run power on one side of the rack and data on the other. Most people will cite electromagnetic interference as a good reason for doing this but I have yet to see a problem caused by it (knock on wood). That being said, it is still a good idea to put some distance between the two, much like your recently divorced aunt and uncle at family functions. There are plenty of other good reasons for keeping data and power separate, most of which center around cabling hygiene – it helps keep things much cleaner because your data cables tend to run up and down the rack posts headed for things like your top-of-rack switch, whereas your power needs to go other places (i.e., equipment). It is a lot easier to bundle cables if they more or less share the same cable path.

Cannot access cable tray because of PDU cables

Cable Tray

This is just another version of “data and power are not separated”. Our power and data both come in at the top of the rack. This means our 10 4/C AWG feeds for each PDU which are about .5″ in diameter are draped across our cabling tray which just sits on top of the racks instead of being suspended by a ladder bar (another great injustice!). I bet these guys generate quite the electromagnetic field. It would nice if they were more than 4″ away from some of 10 Gbps interconnects, huh? This arrangement also means the cable tray is huge pain to use. You have to move all the PDU power cables off of it, then pop the lid off in segments to move your cables around. Or you can just run them all over the top of the rack and hope that the fiber survives like we do. Again. Not ideal.

Inconsistent fastener use for mounting equipment

This one sounds kind of innocuous but it is one of those small little details that makes your life so much easier. Pick a fastener type and size and stay with it. I am partial to M6 because the larger threads are harder to strip out and the head has more surface area for your driver’s bit to contact with. It is pretty annoying to change tools each time the fastener type is different instead of just setting your torque level on your driver and going for it. Also – don’t even think of using self-tapping fasteners. They make cage nuts and square holes in rack posts for a reason.

Improper rail mounting and/or retention

Your equipment comes with mounting instructions and you should probably follow them. Engineers calculate how much weight a particular rail can bear and then figure out that you need four fasteners of grade X on each rail to adequately support the equipment. This is all condensed into some terrible IKEA-level instructions which makes you shake your head as you wonder why your vendor could not afford a better technical writer for the obscene price of whatever equipment you are racking. Once you decipher these arcane incantations you should probably follow them. Don’t skip installing cage nuts and fasteners – if they say you need four, then you need four. It only takes two more minutes to do the job right.

AND FOR THE LOVE OF $DIETY INSTALL WHATEVER HARDWARE IS REQUIRED TO RETAIN THE EQUIPMENT IN THE RAILS! Seriously. This is a safety issue. I am not sure why this step is skipped and people just set things on the rails without using the screws to retain it to the posts but racks move, earthquakes happen and this shit is heavy. I think most our disk shelves are about 50 pounds. You do not want that falling out of the rack and onto your intern’s head.

Use ReadyRails (or vendor equivalent)

For about $80 dollars you can have a universal, tool-less rail that installs in about 30 seconds. I would call that a good investment.

Inconsistent inventory tagging locations

I am guessing your shop maintains an inventory system and you probably have little inventory tags you affix to equipment. Do your best to make the place where the inventory tag goes consistent and readable once everything is rack and stacked. The last thing you want to do is pull an entire rack apart because some auditor wants you find the magical inventory tag stuck on some disk shelf in the middle of 12 shelf aggregate.

It would also be a good idea to put your inventory tag in your documentation so you do not have to play a yearly game of “find the missing inventory tag”.

Cable labeling is not consistent (just use serialized cables)

I suck at cable labeling and documentation in general (see here) so this is a bit hypocritical, nevertheless I find that there are four stages in cable labeling: nothing, consistent labeling of port and device on each end, confusion as labeled cables are reused but the label is not changed and finally adoption of serialized cables where each end has a unique tag that is documented.

This is largely personal preference but the general rules are simple: keep it clean, keep it consistent and keep it current (your documentation that is). The only thing worse than a unlabeled cable is a mislabeled cable.

Gaps in rack mount devices

Shelf Gap

Why? Just why? I don’t know and will probably never know… but my best guess is the rail on the top shelf was slightly bent during installation and then when we needed to add another shelf later the rail interfered with it. 10 minutes originally could of have saved 10 hours down the road. If it turns out I am one post hole short of being able to install another shelf I get to move all the workloads off this aggregate, pull out all the disk shelves until I reach this one, fix or replace the rail, re-rack and re-cable everything, re-create the aggregate and then move the workloads back.

 

Now that I have complained a bit (I am sure r/sysadmin will say that I have it way to easy)  I get to talk about the real lesson here: none of this shit matters.

On one level they do. All these little oversights accumulate technical debt that eventually comes back and bites you on the ass and doing it right the first time is the easiest and most efficient way. On other hand, none of this stuff directly breaks things. The fact that the power and data cabling are too close together for my comfort or that there is small gap in one of the disk shelf stacks does not cause outages. We have plenty of things that do however and those demand my attention. So collectively, lets take a deep breath, let it go, and stop worrying about it. It’ll get fixed someday.

The other lesson here is that nothing is temporary. If you cut a corner, particularly with physical equipment, that corner will remain cut until the equipment is retired. It is just too hard and costly to correct these kinds of oversights once you are in production. If you are putting a new system up, take some time to plan it out – consider the failure domain, how much fault tolerance and redundancy you need, labeling, inventory and all those other little things. You only get to stand this system up once. Go slow and give it some forethought, you may thank yourself one day.