Salary, Expectations and Automation

It has been an interesting few months. We have had a few unexpected projects pop up and I have ended up owning most of them. This led to me feel pretty beaten down and a little bit demoralized. I don’t like missing deadlines and I don’t like constantly switching from one task to the next without ever making headway. It’s not my preferred way to work.

One thing that I am continually trying to remind myself is that I should use the team. I don’t have to own everything nor should I so I started creating tickets on the behalf of my users (we don’t have a policy requiring tickets) and just dumping them into our generic queue so someone else could pick them up.

Guess what happened? They sat there. Now there are a few reasons why things played out this way (see this post) but you can imagine this was not the result I was hoping for. I was hoping my tier-2 folks would of jumped in and grabbed some of these requests:

  • Review the GPOs applied to a particular group of servers and modify them to accommodate a new service account
  • Review some NTFS permissions and restructure them to be more granular
  • Create a new IIS site along with the corresponding certificate and coordinate with our AD team to get the appropriate DNS records put in place
  • Help one of our dev teams re-platform / upgrade a COTS application
  • Re-configure IIS on a particular site to support HTTPS.

Part of the reason we have so much work right now is that we are assuming the responsibility for a department that previously had their own internal IT staff (Yay! Budget cuts!). Not everyone was happy with giving up “their IT guys” and so during our team meetings we started reviewing work in the queue that was not getting moved along.

A bunch of these unloved tickets were “mine”, that is to say, they were originally requests that came directly to me, that I then created a ticket for hoping to bump it back into the queue. This should sound familiar. The consensus though was that it was “my work” and that I was not being diligent enough in keeping track of the ticket queue.

Please bear in mind for the next two paragraphs, that we have a small 12 person team. It is not difficult for us to get a hold of another team member.

I’ll unpack the latter idea first. In a sense, I agree. I could do a better job of watching the queue but that’s simply because I was not watching it. My perception was, that as someone who is nominally at the top of our support tier is that our help desk watches the queue, catches interrupts from customers and then escalates stuff if they need assistance. I was thinking my tickets should come from my team and not directly from the queue.

The former idea I’m a little less sympathetic too. It’s not “my work”, it’s the team’s work, right? And here is where those sour grapes start to ferment… that list of tickets up there does not seem like “tier-3 work” to me. It seems junior sysadmins’ work. If that is not the case then I have to ask the question: What are those guys doing instead? If that’s not “work” that tier-1/tier-2 handle then what is?

In the end, of course, I took the tickets and did the work, which of course put me even further behind on some of my projects.

I have puzzled over our ticket system, support process and team dynamics quite a bit (see here, here and here) and there is a lot of different threads one could pull on, but a new explanation came to mind after this exercise: Maybe our tier-2 guys are not doing this work because they can’t? Maybe they just don’t have the skills to do those kinds of things and maybe it’s not realistic to expect people to have that level of skill, independence and work ethic for what we pay them? I hate this idea. I hate it because if that’s truly the case there is very little I can do to fix it. I don’t control our training budget or assign team priorities or have any ability to negotiate graduated raises matched with a corresponding training plan. I don’t do employee evaluations and I cannot put someone on an improvement plan and I certainly cannot let an employee go. But I really don’t like this idea because it feels like I’m crapping on my team. I don’t like it because it makes me feel guilty.

But our are salaries and expectations unrealistic?

  • Help Desk Staff (Tier-1) – $44k  – $50k per year
  • Junior Sysadmins (Tier-2) – $58k – $68 per year
  • Sysadmins (Tier-3) – $68k – 78k per year

It’s a pretty normal “white collar” setup: salaried, no overtime eligibility, with health insurance and a 401k with a decent employer match. We can’t really do flexible work schedules or work-from-home but we do have a pretty generous paid leave policy. However – this is Alaska, where everything is as expensive as the scenery is beautiful. A one bedroom rental will run you at least $1200 a month plus utilities which can easily be a few hundred dollars in the winter depending on your heating source. Gasoline is on average a dollar more per gallon than whatever it is currently in the Lower 48. Childcare is about $1100 a month per kiddo for kids under three. Your standard “American dream” three bedroom, two bath house will cost you around $380,000. All things being equal, it is about 25% more expensive to live here than your average American city so when you think about these wages knock a quarter of them off to adjust for cost of living.

Those wages don’t look so hot anymore huh? Maybe there is a reason (other than our State’s current recession) that most IT positions in my organization take at least six months to fill. The talent pool is shallow and not that many people are interested in wading in.

We all have our strengths and weaknesses. I suspect our team is much like others with a spectrum of talent but I think the cracks are beginning to show… as positions are cut, work is not being evenly distributed and fewer and fewer team members are taking more and more of the work. I suspect that’s because these team members have to skills to eat that workload with automation. They can use PowerShell to do account provisioning instead of clicking through Active Directory Users and Computes. They can use SCCM to install Visio instead of RDPing and pressing Next-Next-Finish on each computer. A high performing team member would realize that the only way they could do that much work was learn some automation skills. A low performing team member would do what instead? I’m not sure. But maybe, just maybe, as we put increasing pressure on our tier-1 and tier-2 staff to “up their skills” and to “eat the work”, we are not being realistic.

Would you expect someone making 44k – 51k a year in Cost of Living adjusted wages to be an SCCM wizard? Or pickup PowerShell?

Are we asking to much of our staff? What would you expect someone paid these wags to be able to do? Like all my posts – I have no answers, only questions but hopefully I’m asking the right ones.

Until next time, stay frosty!

Veeam, vSphere Tags and PowerCLI – A Practical Use-case

vSphere tags have been around for a while but up until quite recently I had yet find a particularly compelling reason to use them. We are a small enough shop that it just didn’t make sense to use them for categorization purposes with only 200 virtual machines.

This year we hopped on the Veeam bandwagon and while I was designing our backup jobs I found the perfect use-case for vSphere tags. I wanted a mechanism that provided us a degree of flexibility so I could protect different “classes” of virtual machines to the degree appropriate to their criticality but also was automatic since manual processes have a way of not getting done. I also did not want an errant admin to place a virtual machine in the wrong folder or cluster and then have the virtual machine be silently excluded from any backup jobs. The solution was to use vSphere tags and PowerShell with VMware’s PowerCLI module.

If we look at the different methods you can use for targeting source objects in Veeam Backup Jobs, tags seem to offer the most flexibility. Targeting based on clusters works fine as long as you want to include everything on a particular cluster. Targeting based on hosts works as long you don’t expect DRS to move a virtual machine to another host that is not included in your Backup Job. Datastores initially seemed to make the most sense since our different virtual machine “classes” roughly correspond to what datastore they are using (and hence what storage they are on) but not every VM falls cleanly into each datastore “class”. Folders would functionally work the same way here as tags but tags are just a bit cleaner. If I create a folder structure for a new line of business application, I don’t have to go back into each of my Veeam jobs and update their targeting to add the new folders, I just tag the folders appropriately and away I go.

Tags also work great for exclusions. We run a separate vSphere cluster for our SQL workloads primarily for licensing reasons (SQL Datacenter licensing for the win!). I just setup a tag for the whole cluster, use it for targeting VMs for my application-aware Veeam Backup Jobs and also to exclude those virtual machines from the standard Backup Jobs (why back it up twice? Especially when one backup is not transaction-consistent?).

How about checking to see if there are any virtual machines that missed a tag assignment and therefore are not being backed up?

PowerShell to the rescue once again!

 

Another simple little snippet that will send you an email if you managed to inadvertently place a virtual machine into a location where it is not getting a tag assignment and therefore not getting automatically included in a backup job.
 

Until next time!

Quick and dirty PowerShell snippet to get Dell Service Tag

The new fiscal year is right around the corner for us. This time of year brings all kinds of fun for us Alaskans, spring king salmon runs, our yearly dose of three days worth of good weather and licensing true-up and hardware purchases. Now there’s about a million different ways to skin this particular cat but here’s a quick a dirty method with PowerShell.

 

If you don’t have access to the ridiculously useful Test-NetConnection cmdlet then you probably should upgrade to PowerShell v5 since unlike most Microsoft products PowerShell seems to actually improve with each version but baring that you can just open a TCP socket by instantiating the appropriate .NET object.

 

The slickest way I have ever seen this done though was with SCCM and the Dell Command Integration Suite for System Center which can generate a warranty status report for everything in your SCCM Site by connecting to database, grabbing the all the service tags and then sending that up to Dell’s warranty status API to get all kinds of juicy information like model, service level, ship date, and warranty status. Unfortunately since this was tremendously useful the team overseeing the warranty status web service decommissioned it abruptly back in 2016. Thanks for nothing ya jerks!

 

Use PowerCLI to get a list of old snapshots

VMware’s snapshot implementation is pretty handy. A quick click and you have a rollback point for whatever awesomeness your Evil Knievel rockstar developer is about to rollout. There are, as expected, some downsides.

Snapshots are essentially implemented as differencing disks and at the point-in-time you make a snapshot a child .vmdk file is created that tracks any disk changes from that point forward. When you delete the snapshot the changed blocks are merged back into the parent .vmdk. If you decide to rollback all the changes you just revert back to original parent disk which remained unchanged.

There are some performance implications:

  • Disk I/O is impacted for the VM as ESXi has to work a bit harder to write the new changes to child .vmdk file but reference the parent .vmdk file for unchanged blocks.
  • The child .vmdk files grow over time as more and more changes accumulate. The bigger it gets the more the disk I/O impacted. As you can imagine multiple snapshots can certainly impact disk performance.
  • When the snapshot is deleted, the process of merging the changes in the child .vmdk back into the parent .vmdk can be very slow depending on the size of the child .vmdk. Ask my DBA how long it took to delete a snapshot after he restored a 4TB database…

They also don’t make a great backup mechanism in of themselves:

  • The parent and child .vmdk are located on the same storage and if that storage disappears so does the production data (the parent .vmdk) and the “backup” (the child .vmdk) with it.
  • They are not transaction consistent. The hypervisor temporarily freezes I/O in order to take the snapshot without doing some kind of application-aware quiescing (although VMware Tools can quiesce the guest file system and memory).

 

For all these reasons I like to keep track of what snapshots are out there and how long they have been hanging around. Luckily PowerCLI makes this pretty easy:

 
This little snippet will connect to your vCenter servers, go through all the VMs, locate any snapshots older than seven days and then send you a nicely formatted email:

 

If you haven’t invested a little time in learning PowerShell/PowerCLI, I highly recommend you start. It is such an amazing toolset and force multiplier that I cannot imagine working without these days.

Until next time, stay frosty.

Extra credit reading:

 

Prometheus and Sisyphus: A Modern Myth of Developers and Sysadmins

I am going to be upfront with you. You are about to read a long and meandering post that will seem almost a little too whiny at times where I talk some crap about our developers and their burdens (applications). I like our dev teams and I like to think I work really well with their leads so think of this post as a bit of satirical sibling rivalry and underneath the hyperbole and good nature-ed teasing there might be a small, “little-t” truth.

That truth is that operations, whether it’s the database administrator, the network team, the sysadmins or the help desk, always, always, always gets the short straw and that is because collectively we own “the forest” that the developers tend to their “trees” in.

I have a lot to say about the oft-repeated sysadmin myth about “how misunderstood sysadmins are” and how the they just seem to get stepped on all the time and so on and so on. I am not a big fan of the “special snowflake sysadmin syndrome” and I am especially not a fan of it when it is used as an excuse to be rude or unprofessional but that being said, I think it is worth stating that even I know I am half full of crap when I say sysadmins always get the short straw.

OK disclaimers are all done! Lets tell some stories!

 

DevOps – That means I get Local Admin right?

My organization is quite granular and each of our departments more or less maintain their own development teams supporting their own mission-specific applications along with either a developer that essentially fulfils an operations role or a single operations guy doing support solely for that department. The central ops team maintains things like the LAN, Active Directory, the virtualization platform and so on. If the powers that be wanted a new application for their department, the developers would request the required virtual machines, the ops team would spin up a dozen VMs off of a template, join them to AD, give the developers local admin and off we go.

Much like Bob Belcher, all the ops guys could do is “complain the whole time”.

 

This arrangement led to some amazing things that break in ways that are too awesome to truly describe:

  • We have an in-house application that uses SharePoint as a front-end, calls some custom web services tied to a database or two that auto-populates an Excel spreadsheet that is used for timekeeping. Everyone else just fills out the spreadsheet.
  • We have another SharePoint integrated application, used ironically enough for compliance training, that passes your Active Directory credentials in plaintext through two or three servers all hosting different web services.
  • Our deployment process is essentially to copy everything off your workstation onto the IIS servers.
  • Our revision control is: E:\WWW\Site, E:\WWW\Site (Copy), E:\WWW-Site-Dev McDeveloper
  • We have an application that manages account on-boarding, a process of which is already automated by our Active Directory team. Naturally they conflict.
  • We had at one point in time, four or five different backup systems all of which used BackupExec for some insane reason, three of which backed up the same data.
  • We managed to break a production IIS server by restoring a copy of the test database.
  • And then there’s Jenga: Enterprise Edition…

 

Jenga: Enterprise Edition – Not so fun when it needs four nines of uptime.

A satirical (but only just) rendering of one our application’s design pattern that I call “The Spider Web”

What you are looking at is my humorous attempt to scribble out a satirical sketch of one of our line-of-business applications which managed to actually turn out pretty accurate. The Jenga application is so named because all the pieces are interconnected in ways that turn the prospect of upgrading any of it into the project of upgrading all it. Ready? Ere’ we go!

It’s built around a core written in a language that we have not had any on-staff expertise in for the better part of ten years. In order to provide the functionality that the business needed as the application aged, the developers wrote new “modules” in other languages that essentially just call APIs or exposed services and then bolted them on. The database is relatively small, around 6 TBs, but almost 90% of it is static read-only data that we cannot separate out which drastically reduces the things our DBA and myself can do in terms of recovery, backup and replication and performance optimization. There is no truly separate development or testing environments so we use snapshot copies to expose what appear to be “atomic” copies of the production data (which contains PII!) on two or three other servers so our developers can validate application operations against it. We used to do this with manual fricking database restores, which was god damned expensive in terms of time and storage. There are no less than eight database servers involved but the application cannot be distributed or setup in some kind of multi-master deployment with convergence so staff at remote sites suffer abysmal performance if anything resembling contention happens on their shared last-mile connections.  The “service accounts” are literally user accounts that the developers use to RDP to the servers, start the application’s GUI, and then enable the application’s various services via interacting with above mentioned GUI (any hick-up in the RDP session and *poof* there goes that service). The public facing web server directly queries the production database). The internally consumed pieces of the application and the externally consumed pieces are co-mingled, meaning an outage anywhere is an outage everywhere. It also means we cannot segment the application in public and internally facing pieces. The client requires a hard-coded drive map to run since application upgrades are handled internally with copy jobs which essential replace all the local .DLLs on a workstation when new ones are detected and last but not least it runs on an EOL version of MSSQL.

Whew. That’s was a lot. Sorry about that. Despite that the fact that a whole department pretty much lives or dies by this application’s continued functionality our devs have not made much progress in re-architecturing and modernizing it. This really is not their fault but it does not change the fact that my team has an increasingly hard time keeping this thing running in a satisfactory manner.

 

Operations: The Digital Custodian Team.

In the middle of a brain storming session where we were trying to figure out how to move Jenga to a new virtualization infrastructure, all on a weekend when I will be traveling in order to squeeze the outage into the only period within the next two months that was not going to be unduly disruptive I began to feel like my team was getting screwed. They have more developers supporting this application than we have in our whole operations team and it is on us to figure out how to move Jenga without losing any blocks or having any lengthy service windows? What are those guys actually working on over there? Why am I trying to figure out which missing .DLL from .NET 1.0 needs be imported onto the new IIS 8.5 web servers so some obscure service that no really one understands runs in a supported environment? Why does operations own the life-cycle management? Why aren’t the developers updating and re-writing code to reflect the underlying environmental and API changes each time a new server OS is released with a new set of libraries? Why are our business expectations for application reliability so widely out-of-sync with what the architecture can actually deliver? Just what in the hell is going on here!

Honestly. I don’t know but it sucks. It sucks for the customers, it sucks for the devs but mostly I feel like it sucks for my team because we have to support four other line-of-business applications. We own the forest right? So when a particular tree catches on fire they call us to figure out what to do. No one mentions that we probably should expect trees wrapped in paraffin wax and then doused in diesel fuel to catch on fire. When we point out that tending trees in this manner probably won’t deliver the best results if you want something other than a bonfire we get met with a vague shrug.

Is this how it works? Your team of rockstar, “creative-type”, code-poets whip up some kind of amazing business application, celebrate and then hand it off to operations where we have to figure out how to keep it alive as the platform and code base age into senility for the next 20 years? I mean who owns the on-call phone for all these applications… hint: it’s not the dev team.

I understand that sometimes messes happen… just why does it feel like we are the only ones cleaning it up?

 

You’re not my Supervisor! Organizational Structure and Silos!

Bureaucratium ad infinitum.

 

At first blush I was going to blame my favorite patsy, Process Improvement and the inspid industry around it for this current state of affairs but after some thought I think the real answer here is something much simpler: the dev team and my team don’t work for the same person. Not even close. If we play a little game of “trace the organizational chart” we have five layers of management before we reach a position that has direct reports that eventually lead to both teams. Each one of those layers is a person – with their own concerns, motivations, proclivities and spin they put on any given situation. The developers and operations team (“dudes that work”), more or less, agree that the design of the Jenga application is Not a Good Thing (TM). But as each team gets told to move in a certain direction by each layer of management our efforts and goals diverge. No amount of fuzzy-wuzzy DevOps or new-fangled Agile Standup Kanban Continuous Integration Gamefication Buzzword Compliant bullshit is ever going to change that. Nothing makes “enemies” out of friends faster than two (or three or four) managers maneuvering for leverage and dragging their teams along with them. I cannot help but wonder what our culture would be like if the lead devs sat right next to me and we established project teams out of our combined pool of developer and operations talent as individual department’s put forth work. What would things be like if our developers were not chained to some stupid line-of-business application from the late ’80s, toiling away to polish a turd and implement feature requests like some kind of modern Promethian myth? What would things be like if our operations team was not constantly trying to figure out how to make old crap run while our budgets and staff are whittled away, snatching victory from defeat time and time again only to watch the cycle of mistakes repeat itself again and again like some kind Sisyphean dystopia with cubicles? What if we could sit down together and I dunno… fix things?

Sorry there are no great conclusions or flashes of prophetic insight here, I am just as uninformed as the rest of the masses, but I cannot help but think, maybe, maybe we have too many chefs in the kitchen arguing about the menu. But then again, what do I know? I’m just the custodian.

Until next time, stay frosty.

The HumbleLab: Windows Server 2016, ReFS and “no sufficient eligible resources” Storage Tier Errors

Well, that didn’t last too long did it? Three months after getting my Windows Server 2012 R2 based HumbleLab setup I tore it down  to start fresh.

As a refresher The HumbleLab lives on some pretty humble hardware:

Dell OptiPlex 990 (circa 2012)

  • Intel i7-2600, 3.4GHz 4 Cores, 8 Threads, 256KB L2, 8MB L3
  • 16GBs, Non-EEC, 1333MHz DDR3
  • Samsung SSD PM830, 128GBs SATA 3.0 Gb/s
  • Samsung SSD 840 EVO 250GBs SATA 6.0 Gb/s
  • Seagate Barracuda 1TB SATA 3.0 Gb/s

However I did managed to scrounge up a Hitachi/HGST Ultrastar 7K3000 3TB SATA drive in our parts bin that was manufactured in April 2011 to swap places with the eight year old Seagate drive.  Not only is the Hitachi drive three years newer but it also has three times as much capacity bringing a whopping 3TBs of raw storage to the little HumbleLab! Double win!

My OptiPlex lacks any kind of real storage management and my Storage Pool was configured in Simple Storage Layout which just stripes the data across all the drives in the Storage Pool. It also should go without saying that I am not using any of Storage Space’s Failover Clustering or Scale-Out functionality. I couldn’t think of simple way to swap my SATA drives other than to export my Virtual Machines, destroy the Storage Pool, swap the drives and recreate it. The only problem is I didn’t really have any readily available temporary storage that I could dump my VMs on and my lab was kind of broken so I just nuked everything and started over with a fresh install of Server 2016 which I wanted to upgrade to anyway. Oh well, sometimes the smartest way forward is kind of stupid.

Not much to say about the install process but I did run across the same “storage pool does not have sufficient eligible resources” issue creating my Storage Pool.

Neat! There’s still a rounding error in the GUI. Never change Microsoft. Never change.

According to the Internet’s most accurate source of technical information, Microsoft’s TechNet Forums, there is a rounding error in how disks are presented in the wizard. I guess what happens is when you want to use all 2.8TBs of your disk, the numbers don’t match up exactly with the actual capacity and consequently the wizard fails as it tries to create a Storage Tier bigger than the underlying disk. I guess. I mean it seems plausible at least. If you specify the size in GBs or even MBs supposedly that will work but naturally it didn’t work for me and I ended up trying to create my new Virtual Disk using PowerShell. I slowly backed off the size of my Storage Tiers from the total capacity of the underlying disks until it worked with 3GBs worth of slack space. A little disappointing that the wizard doesn’t automagically do this for you and doubly disappointing that this issue is still present in Server 2016.

Here’s my PowerShell snippet:

 

Now for the big reveal? How’d we do?

Not bad at all for running on junk! We were able to squeeze a bit more go juice out of the HumbleLab with Server 2016 and ReFS! We bumped the IOPS up to 2240 from 880 and reduced latency down to sub 2ms numbers from 4ms which is amazing considering what we are running this on.

I think that this performance increase is largely due to the combination of how Storage Tiers and ReFS are implemented in Server 2016 and not due to ReFS’s block cloning technology which is focused on optimizing certain types of storage operations associated with virtualization workloads. As I understand it, Storage Tiers previously were “passive” in the sense that a scheduled task would move hot data onto SSD tiers and cooling/cold data back onto HDD tiers whereas in Server 2016 Storage Tiers and ReFS can do realtime storage optimization. Holy shmow! Windows Server is starting to look like a real operating system these days! There are plenty of gotchas of course and it is not really clear to me whether they are talking about Storage Spaces / Storage Tiers or Storage Spaces Direct but either way I am happy with the performance increase!

Until next time!

 

Kafka in IT: How a Simple Change Can Take a Year to Implement

Public sector IT has never had a reputation of being particularly fast-moving or responsive. In fact, it seems to have a reputation for being staffed by apathetic under-skilled workers toiling away in basements and boiler rooms supporting legacy, “mission-critical”, monolithic applications that sit half-finished and half-deployed by their long-gone and erstwhile overpaid contractors (*cough* Deloitte, CGI *cough*). This topic might seem familiar… Budget Cuts and Consolidation and Are GOV IT teams apathetic?

Why do things move so slow, especially in a field that demands the opposite? I don’t have an answer to that larger question but I do have an object lesson, well maybe what I really have is part-apology, part-explanation and part-catharsis. Gather around and hear The tale of how a single change to our organization’s perimeter proxy devices took a year!

 

03/10

We get a ticket stating that one of our teams’ development servers is no longer letting them access it via UNC share or RDP. I assign one of our tier-2 guys to take a look and a few days later it gets escalated to me. The server will not respond to any incoming network traffic, but if I access it via console and send traffic out it magically works. This smells suspiciously like a host-based firewall acting up but our security team swears up and down our Host Intrusion Protection software is in “detect” mode and I verified that we have disabled the native Windows firewall. I open up a few support tickets with our vendors and start chasing such wild geese as a layer-2 disjoint in our UCS fabric and “asymmetric routing” issues. No dice. Eventually someone gets the smart idea to move the IP address to another VM to try and narrow the issue down to either the VM or the environment. It’s the VM (of course it is)! These shenanigans take two weeks.

04/01

I finish re-platforming the development server onto a new Server 2012 R2 virtual machine. This in-of-itself would be worth a post since the best way I can summarize our deployment methodology is “guess-and-check”. Anyway, the immediate issue is now resolved. YAY!

05/01

I rebuild the entire development, testing, staging and production stack and migrate everything over except the production server which is publically accessible. The dev team wants to do a soft cutover instead of just moving the IP address to the new server. This means we will need have our networking team make some changes to the proxy perimeter devices.

05/15

I catch up on other work and finish the roughly ten pages of forms, diagrams and a security plan that are required for a perimeter device change request.

06/06

I open a ticket upstream, discuss the change with the network team and make some minor modifications to the ticket.

06/08

I filled out the wrong forms and/or I filled them out incorrectly. Whoops.

06/17

After a few tries I get the right forms and diagrams filled out. The ticket gets assigned to the security team for approval.

06/20

Someone from the security team picks up the ticket and begins to review it.

07/06

Sweet! Two weeks later my change request gets approval from the security team (that’s actually pretty fast). The ticket gets transfered back to the networking team which begins to work on implementation.

07/18

I create a separate ticket to track the required SSL/TLS certificate I will need for the HTTPS-enabled services on the server. This ticket follows a similar parallel process, documentation is filled out and validated, goes to the security team for approval and then back to the networking team for implementation. My original ticket for the perimeter change is still being worked on.

08/01

A firmware upgrade on the perimeter devices breaks high availability. The network team freezes all new work until the issue is corrected (they start their internal change control process for emergency break/fix issues).

08/24

The server’s HTTPS certificate has to be replaced before it expires at the end of the month. Our dev’s business group coughs up the few hundred dollars. We had planned to use the perimeter proxies’ wildcard certificate for no extra cost but oh well, too late.

09/06

HA restored! Wonderful! New configuration changes are released to the networking team for implementation.

10/01

Nothing happens upstream… I am not sure why.  I call about once a week and hear, we are swamped, two weeks until implementation. Should be soon.

11/03

The ticket gets transferred to another member of the network team and within a week the configuration change is ready for testing.

11/07

The dev team discovers an issue. Their application is relying on the originating client IP address for logging and what basically amounts to “two-factor authentication” (i.e., a username is tied to an IP address). This breaks fantastically once the service gets moved behind a web proxy. Neat.

11/09

I work with the dev lead and the networking team to come up with a solution. Turns out we can pass the originating IP address through the proxies but it changes the variable server-side that their code needs to reference.

11/28

Business leaders say that the code change is a no-go. We are about to hit their “code/infrastructure freeze” period that last from December to April. Fair enough.

12/01

We hit the code freeze. Things open back up again in mid-April. Looking ahead, I already have infrastructure work scheduled late April and early May which brings us right around to June: one year.

EDIT: The change was committed on 05/30 and we passed our rollback period on 06/14. As of 06/19 I just submitted the last ticket to our networking team to remove the legacy configuration.

 

*WHEW* Let’s take a break. Here’s doge to entertain you during the intermission:

 

My team is asking for a change that involves taking approximately six services that are already publically accessible via a legacy implementation, moving those services to a single IP address and placing an application proxy between the Big Bad Internet and the hosting servers. Nothing too crazy here.

Here’s some parting thoughts to ponder.

  • ITIL. Love it or hate it ITIL adds a lot of drag. I hope it adds some value.
  • I don’t mean to pick on the other teams but it clearly seems like they don’t have enough resources (expertise, team members, whatever they need they don’t have enough of it).
  • I could have done better with all the process stuff on my end. Momentum is important so I probably should not of let some of that paperwork sit for as long as it did.
  • The specialization of teams cuts both ways. It is easy to slip from being isolated and silo-ed to just basic outright distrust, and when you assume that everyone is out to get you (probably because that’s what experience has taught you) then you C.Y.A. ten times till Sunday to protect yourself and your team. Combine this with ITIL for a particularly potent blend of bureaucratic misery.
  • Centralized teams like networking and security that are not embedded in different business groups end up serving a whole bunch of different masters. All of whom are going in different directions and want different things. In our organization this seems to mean that the loudest, meanest person who is holding their feet to the SLA gets whatever they want at the expense of their quieter customers like myself.
  • Little time lags magnify delay as the project goes on. Two weeks in security approval limbo puts me four weeks behind a few months down the road which means I then miss my certificate expiry deadline which then means I need to fill out another ticket which then puts me further behind and so on ad infinitum.
  • This kind of shit is why developers are just saying “#YOLO! Screw you Ops! LEEEEROY JENKINS! We are moving to the Cloud!” and ignoring all this on-prem, organizational pain and doing DevDev (it’s like DevOps but it leads to hilarious brokenness in other new and exciting ways).
  • Public Sector IT runs on chaos, disorder and the frustrations of people just trying to Do Things. See anything ever written by Kafka.
  • ITIL. I thought it was worth mentioning twice because that’s how much overhead it adds (by design).

 

Until next time, may your tickets be speedily resolved.

The HumbleLab: Storage Spaces with Tiers – Making Pigs Fly!

I have mixed feelings about homelabs. It seems ludicrous to me that in a field that changes as fast as IT that employers do not invest in training. You would think on-the-clock time dedicated to learning would be an investment that would pay itself back in spades. I also think there is something psychologically dangerous in working your 8-10 hour day and then going home and spending your evenings and weekends studying/playing in your homelab. Unplugging and leaving computers behind is pretty important, in fact I find the more and more I do IT the less interest I have in technology in general. Something, something, make an interest a career and then learn to hate it. Oh well.

That being said, IT is a fast changing field and if you are not keeping up one way or another, you are falling behind. A homelab is one way to do this, plus sometimes it is kind of nice to just do stuff without attending governance meetings or submitting to the tyranny of your organization’s change control board.

Being the cheapskate that I am, I didn’t want to go out spend thousands of my own dollars on hardware like all the cool cats in r/homelab so I just grabbed some random crap lying around work, partly just to see how much use I could squeeze out of it.

Dell OptiPlex 990 (circa 2012)

  • Intel i7-2600, 3.4GHz 4 Cores, 8 Threads, 256KB L2, 8MB L3
  • 16GBs, Non-EEC, 1333MHz DDR3
  • Samsung SSD PM830, 128GBs SATA 3.0 Gb/s
  • Samsung SSD 840 EVO 250GBs SATA 6.0 Gb/s
  • Seagate Barracuda 1TB SATA 3.0 Gb/s

The OptiPlex shipped with just the 128GB SSD which only had enough storage capacity to host the smallest of Windows virtual machines so I scrounged up the two other disks from other desktops that were slated for recycling. I am particularly proud of the Seagate because if the datecode on the drive is to be believed it was originally manufactured sometime in late 2009.

A bit of a pig huh? Let’s see if we can make this little porker fly.

A picture of the inside of HumbleLab

Oh yeah… look at that quality hardware and cable management. Gonna be hosting prod workloads on this baby.

I started out with a pretty simple/lazy install of Windows Server 2012 R2 and the Hyper-V role. At this point in time I only had the original 128GB SSD that operating system was installed on and the ancient Seagate being utilized for .VHD/.VHDX storage.

Performance was predictably abysmal, especially once I got a SQL VM setup and “running”:

IOmeter output

At this point, I added in the other 256GB SSD, destroyed the volume I was using for .VHD/.VHDX storage and recreated it using Storage Spaces. I don’t have much to say about Storage Spaces here since I have such a simple/stupid setup. I just created a single Storage Pool using the 256GB SSD and 1TB SATA drive. Obviously with only two disks I was limited to a Simple Storage Layout (no disk redundancy/YOLO mode). I did opt to create a larger 8GB Write Cache using PowerShell but other than that I pretty much just clicked through the wizard in Server Manager:

 

Let’s see how we did:

IOMeter Results with Storage Tiers

A marked improvement! We tripled our IOPS from a snail-like 234 to a tortoise-like 820 and managed to reduce the response time from 14ms to 5ms. The latency reduction is probably the most important. We generally shoot for under 2ms for our production workloads but considering the hardware 5-6ms isn’t bad at all.

 

What if I just run .VHDX file directly on the shared 128GB SSD that the Hyper-V host is utilizing without any Storage Tiers involved at all?

Hmm… not surprisingly the results are even better but what was surprising is by how much.  We are looking at sub 2ms latency and about four and half times more IOPS than what my Storage Spaces Virtual Disk can deliver.

Of course benchmarks, especially quick and dirty ones like this, are very rarely the whole story and likely do not even come close to simulating your true workload but at least it gives us a basic picture of what my aging hardware can do: SATA = Glacial, Storage Tiers with SSD Caching=OK, SSD=Good. It also illustrates just how damn fast SSDs are. If you have a poorly performing application, moving it over to SSD storage is likely going to be the single easiest thing you can do to improve its performance. Sure, the existing bottleneck in the codebase or database design is still there, but does that matter anymore if everything is moving 4x faster? Like they say, Hardware is Cheap, Developers are Expensive.

I put this together prior to the general release of Server 2016 so it would be interesting to see if running this same setup on 2016’s implementation of Storage Spaces with ReFS instead of NTFS would yield better results. It also would be interesting to refactor the SQL database and at the very least place the TempDB, SysDBs and Log files directly onto to host’s 128GB SSD. A project for another time I guess…

Until next time… may your pigs fly!

A flying pig powered by a rocket

Additional reading / extra credit:

Don’t Build Private Clouds? Then What Do We Build?

Give Subbu Allamaraju’s blog post Don’t Build Private Clouds a read if you have not yet. I think it is rather compelling but also wrong in a sense. In summation: 1) Your workload is not as special as you think it is, 2) your private cloud isn’t really a “cloud” since it lacks the defining scale, resiliency, automation framework, PaaS/SaaS and self-service on-demand functionality that a true cloud offering like AWS, Azure or Google has and 3) your organization is probably doing a poor job of building a private cloud anyway.

Now lets look at my team – we maintain a small Cisco FlexPod environment – about 14 ESXi hosts, 1.5TBs RAM and about 250TBs of storage. We support about 600 users and I am primary for the following:

  • Datacenter Virtualization: Cisco UCS, Nexus 5Ks, vSphere, NetApp and CheckPoint firewalls
  • Server Infrastructure: Platform support for 150 VMs, running mostly either IIS or SQL
  • SCCM Administration (although one of our juniors has taken over the day to day tasks)
  • Active Directory Maintenance and Configuration Management through GPOs
  • Team lead responsibilities under the discretion of my manager for larger projects with multiple groups and stakeholders
  • Escalation point for the team, point-of-contact for developer teams
  • Automation and monitoring of infrastructure and services

My-day-to-day consists of work supporting these focus areas – assisting team members with a particularly thorny issue, migrating in-house applications onto new VMs, working with our developer teams to address application issues, existing platform maintenance, holding meetings talking about all this work with my team, attending meetings talking about all this work with my managers, sending emails about all this work to the business stakeholders and a surprising amount of tier-1 support (see here and here).

If we waved our magic wand and moved everything into the cloud tomorrow, particularly into PaaS where the real value to cost sweet spot seems to be, what would I have left to do? What would I have left to build and maintain?

Nothing. I would have nothing left to build.

Almost all of my job is working on back-end infrastructure, doing platform support or acting as an human API/”automation framework”. As Subbu states, I am a part of the cycle of “brittle, time-consuming, human-operator driven, ticket based on-premises infrastructure [that] brews a culture of mistrust, centralization, dependency and control“.

I take a ticket saying, “Hey, we need a new VM.” and I run some PowerShell scripts to create and provision above said new VM in a semi-automated fashion, I then copy the contents of the older VM’s IIS directory over. I then notice that our developers are passing credentials in plaintext back and forth through web forms and .XML files between different web services which kicks off a whole week’s worth of work to re-do all their sites in HTTPS. I then setup a meeting to talk about these changes with my team (cross training) and if we are lucky  someone upstream actually gets to my ticket and these changes go live. This takes about three to four weeks optimistically.

In the new world our intrepid developer tweaks his Visual Studio deployment settings and his application gets pushed to an Azure WebApp which comes baked in with geographical redundancy, automatic scale-out/scale-up, load-balancing, a dizzying array of backup and recovery options, integration with SaaS authentication providers, PCI/OSI/SOC compliance and the list goes on. This takes all of five minutes.

However here is where I think Subbu get its wrong: Of our 150 VMs, about 50% of them belong to those “stateful monoliths”. They are primarily composed of line-of-business applications with proprietary code bases that we don’t have access to or they are legacy applications built on things like PowerBuilder and no one understands how they work anymore. They are spread out across 10 to 20 VMs to provide segmentation but have huge monolithic database designs. It would cost us millions of dollars to re-factor the application into a design that could truly take advantage of cloud services in their PaaS form. Our other option would be cloud-based IaaS which is not that different from the developer’s perspective than what we are currently doing except that it costs more.

I am not even going to touch on our largest piece of IT spend which is a line-of-business application that has “large monolithic databases running on handcrafted hardware.” in the form of an IBM z/OS mainframe. Now our refactoring cost is in the ten of millions of dollars.

 

If this magical cloud world comes to pass what do I build? What do I do?

  • Like some kind of carrion lord, I rule over my decaying infrastructure and accumulated technical debt until everything legacy has been deprecated and I am no longer needed.
  • I go full retar… err… endpoint management. I don’t see desktops going away anytime soon despite all this talk of tablets, mobile devices and BYOD.
  • On-prem LAN networking will probably stick around but unfortunately this is all contracted out in my organization.
  • I could become a developer.
  • I could become a manager.
  • I could find another field of work.

 

Will this magical cloud world come to pass?

Maybe in the real world but I have a hard time imaging how it work for us. We are so far behind in terms of technology and so organizationally dysfunctional that I cannot see how moving 60% of our services from on-prem IaaS to cloud-based IaaS would make sense, even if leadership could lay off all of the infrastructure support people like myself.

Our workloads aren’t special. They’re just stupid and it would cost a lot of money to make them less stupid.

 

The real pearl of wisdom…

The state of [your] infrastructure influences your organizational culture.Of all things in that post, I think this is the most perceptive as it is in direct opposition to everything our leadership has been saying about IT consolidation. The message we have continually been hearing for the last year and a half is that IT Operations is a commodity service – the technology doesn’t matter, the institutional knowledge doesn’t matter, the choice of vendor doesn’t matter, the talent doesn’t matter: It is all essentially the same and it is just a numbers game to find the implementation that is the most affordable.

As a nerd-at-heart I have always disagreed with this position because I believe your technology choices determine what is possible (i.e., if you need a plane but you get a boat that isn’t going to work out for you) but the insight here that I have never really deeply considered is that your choice of technology drastically effects how you do things. It effects your organization’s cultural orientation to IT. If you are a Linux shop, does that technology choice precede your dedication to continuous integration, platform-as-code and remote collaboration? If you are a Windows shop, does that technology choice precede your stuffy corporate culture of ITIL misery and on-premise commuter hell? How much does our technological means of accomplishing our economic goals shape our culture? How much indeed?

 

Until next time, keep your stick on the ice.

SCCM SUP Failing to Download Updates – Invalid Certificate Error

I am currently re-building my System Center lab which includes re-installing and re-configuring a basic SCCM instance. I was in the process of getting my Software Update Point (SUP) setup and a few of my SUP groups failed to download their respective updates and deploy correctly. I dimly remember working through this issue in our production environment a few years ago when we moved from SCCM 2012 to 2012 R2 and I cursed myself for not taking better notes and so here we are atoning for our past sins!

Here’s the offending error:

SCCM SUP Download Error Dialog - Invalid Cert

 

SCCM is a chatty little guy and manages to generate some 160 or so different log files, spread out across a number of different locations (see the highly recommended MSDN blog article, A List of SCCM Log Files). Identifying and locating the relevant logs to whatever issue you are troubleshooting is about half the battle with SCCM. Unfortunately I couldn’t seem to find patchdownloader.log in its expected location of SMS_CCM\Logs. Turns out if you are running the Configuration Manager Console from a RDP session patchdownloader.log will get stored in C:\Users\%USERNAME%\AppData\Local\Temp instead of SMS\Logs. Huh. In my case, I am RDPing to the SCCM Server and running the console but I wonder if I run it from a client workstation whether the resulting log will end up locally on that workstation in my %TEMP% folder or whether it will end up on the SCCM SUP server in SMS_CCM\Logs… an experiment for another day I guess.

 

Here’s the juicy bits:

 

A couple of interesting things to note from here:

  • We get the actual error code (0x80073633) which you can use CMTrace’s Error Lookup functionality to “resolve” back to the human readable one the Console presents you with. Sometimes this turns out to be useful information.
  • We get the download location for the update
  • We get the distribution package location that the update is being downloaded to

If I manually browse to the wsus.ds.download.windowsupdate.com URL I manage to download the update without issues. No certificate validation issues which one would expect considering that the connection is going over HTTP according to the log. Makes one wonder how the resulting error was related to an “invalid certificate”…

OK. How do I fix it? Well like most things SCCM the solution is as stupid as it is brilliant. Manually download the update from the Microsoft Update Catalog. Go find the offending update in its respective Software Update Group by referencing the KB number and download it again but this time set your Download Location to the directory that already contains it.

Whoops. Didn’t work.

Take a look at the first attempt to download the content… SCCM Is looking for Windows10.0-KB3172989-x64.cab so it can be downloaded into my %TEMP% directory and then eventually moved off the Deployment Package’s source location at \\SCCM\Source Files\Windows\Updates\2016.

The file I downloaded is not named Windows10.0-KB3172989-x64.cab – it’s actually an .msu file. Use 7-Zip or similar tool to pull the .cab file out of it and now it SCCM SUP should successfully “download” the the update and ship it off to source location for your Deployment Package.