Author Archives: kce

The HumbleLab: Desk of Death – Build a PXE Boot DBAN Station using pfSense and CentOS

I am become Death, destroyer of data.

Data sanitation is a required albeit tedious part of being a Systems Administrator. Last week I noticed that we had quite a few old machines piling up around our office and finding myself with a few spare hours I decided to see if I could speed up the process of scrubbing their hard drives using the ubiquitous Derik’s Boot and Nuke (DBAN). An important note, the mathematics and mechanics of using disks scrubbing software is actually fairly complex, especially when Solid State Drives are involved. My recommendation to you is to check with your security team to make sure your method of disk scrubbing is sufficient. Deguassing and physical destruction using an NSA approved methodology may be the only approved method for certain types of organizations. If you don’t have a security team or a regulatory compliance scheme that you have to comply with that specifies what your media disposal standards are the NIST standards are great place to start.

Disclaimers out of the way, now onto the fun stuff. Here’s what the old system looked like: A help desk tech would boot a CD, load DBAN into memory, select the drive/s to be wiped and then press Enter. This took a few minutes in addition to the time it took to setup the old workstation with power, a keyboard and a monitor. Here’s what I wanted: Plug in a monitor, keyboard and a network cable, turn the computer on, Press F12, pick Network Boot and move on.

To do this I setup an isolated VM on my lab (The HumbleLab) to run a TFTP service in conjunction with the DHCP services already offered by pfSense. Using an old workstation would of also worked just fine if you didn’t have an existing lab setup.

1. Install CentOS and configure networking

You will need to get a minimal install of CentOS setup. There are plenty of guides out there but this one is pretty nice. There’s always the official Red Hat documentation too. Configure a static IP address, fire off a quick su -c yum update to update your system and that should be enough to get you going. I had two physical NICs in my Hyper-V hosts so I dedicated one to my “DBAN” network, created a External Virtual Network with it and added a vNIC connected to that virtual switch to my new CentOS DBAN server.

2. Install and configure the TFTP service

Now we need to get our SYSLINUX bootloaders installed and a TFTP service setup. Lets install both and copy the SYSLINUX bootloaders into the tftpboot path.

Create a pxelinux.cfg directory along with a default configuration file.

Fire up your favorite text editor and edit the newly created configuration file:

Make sure that clients can pull the resulting files from the tftpboot directory:

Last but not least go ahead and start the TFTP service. You could choose to enable the service so it starts automatically but I personally like to start and stop it at will as a kind of weak safety check since it does automatically boot DBAN without user intervention.

 

3. Create a new interface and DHCP scope in pfSense

Unfortunately I lost my screenshots here so we’ll have to just go by feel. You will need to perform the following steps:

  • Add a new vNIC to your pfSesne VM that is connected to your DBAN External Virtual Network.
  • Use the pfSense web interface to assign the new vNIC as an interface.
  • Create a corresponding DHCP scope for your DBAN subnet. This is a good place to stop and test your current configuration. Plug a laptop in and see if you get a DHCP lease and test connectivity to both the pfSense interface and the CentOS DBAN server. If you’re not getting a lease and/or cannot contact both of those interfaces you will need to correct whatever is wrong with your network configuration before you proceed.
  • Modify your DHCP scope to “enable network booting”. Specify the IP address of the DBAN server, set the filename to ‘pxelinux.0’

4. DESTROY!

Plug a victim computer into your DBAN network choose boot from network. You should be presented with that glorious blue screen of impending doom.

 

Deploying VLC Media Player with SCCM

VLC Media Player is an F/OSS media player that supports a dizzying array of media formats. It’s a great example of one of those handy but infrequently used applications that are not included in our base image but generate help desk tickets when an user needs to view a live feed or listen to a meeting recording. Instead of just doing the Next-Next-Finish dance, lets package and deploy it out with SCCM. The 30 minutes to package, test and deploy VLC will pay us back in folds when our help desk no longer has to manually install the software. This reduces the time it takes to resolve these tickets and ensures that the application gets installed in a standardized way.

Start by grabbing the appropriate installer from VideoLAN’s website and copying to whatever location you use to store your source installers for SCCM. Then fire up the Administrative Console and create a New Application (Software Library – Applications – Create Application). We don’t get an .MSI installer so unfortunately we are actually going to have to do a bit of work, pick Manually specify the application information.

Next up, fill out all the relevant general information. There’s a tendency to skimp here but you might as well take the 10 seconds to provide some context and comments. You might save your team members or yourself some time in the future.

I generally make an effort to provide an icon for the Application Catalog and/or Software Center as well. Users may not know what “VLC Media Player” is but they may recognize the orange traffic cone. Again. It doesn’t take much up front work to prevent a few tickets.

Now you need to add a Deployment Type to your Application. Think of the Application as the metadata wrapped around your Deployment Types which are the actual installers. This lets you pull the logic for handling different types of clients, prerequisites and requirements away from other places like separate Collections for Windows 7 32-bit and 64-bit clients and just have one Application with two Deployment Types (a 32-bit installer and a 64-bit installer) that gets deployed to a more generic Collection. As previously mentioned, we don’t have an .MSI installer so we will have to manually specify the deployment installation/uninstallation strings along with the detection logic.

  • Installation: vlc-2.2.8-win32.exe /L=1033 /S –no-qt-privacy-ask –no-qt-updates-notif
  • Uninstallation: %ProgramFiles(x86)%\VideoLAN\VLC\uninstall.exe /S

If you review the VLC documentation you can see that /L switch specifies the language, /S switch specifies a silent install and the –no-qt-privacy-ask –no-qt-updates-notif sets the first-run settings so users don’t receive the prompt.

Without having a MSI’s handy ProductCode for setting up our Detection Logic we will have to rely on something a little more basic: Checking to see if the vlc.exe is present to tell the client whether or not the Application is actually installed. I also like to add a Version check as well so that older installs of VLC are not detected and are subsequently eligible for being upgraded.

  • Setting Type: File System
  • Type: File
  • Path: %ProgramFile(x86)%\VideoLAN\VLC
  • File or folder name: vlc.exe
  • Property: Version
  • Operator: Equals
  • Value: 2.2.8

Last but not least you need to set the User Experience settings. These are all pretty self explanatory. I do like to actually set the maximum run time and estimated installation time to something relevant for the application that way if the installer hangs it doesn’t just sit there for two hours before the agent kills it.

 

From there you should be able to test and deploy your new application! VLC Media Player is a great example of the kind of “optional” that you could just deploy as Available to your entire workstation fleet and close tickets requesting a media player with instructions on how to use the Software Center.

 

 

Until next time, stay frosty!

SCCM, Asset Intelligence and Adobe SWID Tags

Licensing. It is confusing, constantly changing and expensive. It is that last part that our managers really care about come true-up time and so a request in the format of, “Can you give me a report of all the installs of X and how many licenses of A and B we are using?” comes across your desk. Like many of the requests the come across your desk as a System Administrator these can be deceptively tricky. This post will focus on Adobe’s products.

How many installs of Adobe Acrobat XI do we have?

There are a bunch of canned reports that help you right off the bat under Monitoring – Reporting – Reports – Software – Companies and Products. If you don’t have a Reporting Services Point installed yet then get on it! The following reports are a decent start:

  • Count all inventoried products and versions
  • Count inventoried products and versions for a specific product
  • Count of instances of specific software registered with Add or Remove Programs

You may find that these reports are less accurate that you’d hope. I think of them as the “raw” data and while they are useful they don’t gracefully handle things like the difference between “Adobe Systems” and “Adobe Systems Inc.” and detect those as two separate publishers. Asset Intelligence adds a bit of, well, intelligence and allows you to get reports that are more reflective of the real world state of your endpoints.

Once you get your Asset Intelligence Synchronization Point installed (if you don’t have one already) you need to enable some Hardware Inventory Classes. Each of these incurs a minor performance penalty during the Software Inventory client task so you probably only want to enable the classes you think you will need. I find the SMS_InstalledSoftware and SMS_SoftwareTag classes to be the most useful by far so maybe start there.

You can populate these WMI classes by running the Machine Policy Retrieval & Evaluation Cycle client task followed by the Software Inventory cycle. You should now be able to get some juicy info:

 

Lots of good stuff in there, huh? Incidentally if you need a WMI class that tracks software installs to write PowerShell scripts against SMS_InstalledSoftware is far superior to the Win32_Product class because any queries to Win32_Product will cause installed MSIs to be re-configured (KB974524). This is particularly troublesome if there is a SCCM Configuration Item that is repeatedly doing this (here).

There are some great reports that you get from SMS_InstalledSoftware:

  • Software 0A1 – Summary of Installed Software in a Specific Collection
  • Software 02D – Computers with a specific software installed
  • Software 02E  – Installed software on a specific computer
  • Software 06B – Software by product name

All those reports give you a decent count of how many installs you have of a particular piece of software. That takes care of the first part of the request. How about the second?

 

What kind of installs of Adobe Acrobat XI do we have?

Between 2008 and 2010 Adobe started implementing the ISO/IEC 19770-2 SWID tag standard in their products for licensing purposes. Adobe has actually done a decent job at documenting their SWID tag implementation as well as provided information on how decode the LeID. The SWID tag is an XML file that contains all the relevant licensing information for a particular endpoint, including such goodies as the license type, product type (Standard, Pro, etc.) and the version. This information gets pulled out of the SWID tag and populates the SMS_SoftwareTag class on your clients.

 

That’s a pretty good start but if we create a custom report using the following SQL query we can get something that looks Manager Approved (TM)!

 

Until next time, stay frosty.

Morale Is Low, Workload Is Up

Earlier this month, I came back from lunch and I could tell something was off. One of my team members, lets call her Elaine, who is by far the the most upbeat, relentlessly optimistic and quickest to laugh off any of our daily trials and tribulations was silent, hurriedly moving around and uncharacteristically short with customers and coworkers. Maybe she was having a bad day I wondered as I made a mental note to keep tabs on her for the week to see if she bounced back to her normal self. When her attitude didn’t change after a few days then I was really worried.

Time to earn my team lead stripes so I took her aside and asked her what’s up. I could hear the steam venting as she started with, “I’m just so f*****g busy”. I decided to shut up and listen as she continued. There was a lot to unpack: She was under-pressure to redesign our imaging process to incorporate a new department that got rolled under us, she was handling the majority of our largely bungled Office 365 Exchange Online post-migration support and she was still crushing tickets on the help desk with the best of them. The straw that broke the camel’s back – spending a day to clean-up her cubicle that was full of surplus equipment because someone commented that our messy work area looked unprofessional…  “I don’t have time for unimportant s**t like that right now!” as she continued furiously cleaning.

The first thing I did and asked her what the high priority task of the afternoon was and figured out how to move it somewhere else. Next I recommended that she finish her cleaning, take off early and then take tomorrow off. When someone is that worked up, myself included, generally a great place to start is to get some distance between you and whatever is stressing you out until you decompress a bit.

Next I started looking through our ticket system to see if I could get some supporting information about her workload that I could take to our manager.

Huh. Not a great trend.

That’s an interesting uptick that just so happens to coincide with us taking over the support responsibilities for the previously mentioned department. We did bring their team of four people over but only managed to retain two in the process. Our workload increased substantially too since we not only had to continue to the maintain the same service level but we now have the additional challenge of performing discovery, taking over the administration and standardizing their systems (I have talked about balancing consolidation projects and workload before). It was an unfortunate coincidence that we had to schedule our Office 365 migration at the same time due to a scheduling conflict. Bottom line: We increased our workload by a not insignificant amount and lost two people. Not great a start.

I wonder how our new guys (George and Susan) are doing? Lets take a look at the ticket distribution, shall we?

Huh. Also not a great trend.

Back in December 2016 it looks like Elaine started taking on more and more of the team’s tickets. August of 2017 was clearly a rough month for the team as we started eating through all that additional workload but noticeably that workload was not being distributed evenly.

Here is another view that I think really underlines the point.

Yeah. That sucks for Elaine.

As far back as a year Elaine has been handling about 25% of our tickets and since then her percentage of the tickets has increased to close to 50%. What makes this worse is not only has the absolute quantity of tickets in August more than doubled compared to the average of the 11 preceding months but the relative percentage of her contribution has doubled as well. This is bad and I should of noticed, a long time ago.

Elaine and I had a little chat about this situation and here’s what I distilled out of it:

  • “If I don’t take the tickets they won’t get done”
  • “I’m the one that learns new stuff as it comes along so then I’m the one that ends up supporting it”
  • “There’s too many user requests for me to get my project work done quickly”

Service Delivery and Business Processes. A foe beyond any technical lead.

This is where my power as a technical lead ends. It takes a manager or possibly even an executive to address these issues but I can do my best to advocate for my team.

The first issue is actually simple. Elaine needs to stop taking it upon herself to own the majority of the tickets. If the tickets aren’t in the queue then no one else will have the opportunity to take them. If the tickets linger, that’s not Elaine’s problem, that’s a service delivery problem for a manager to solve.

The second issue is a little harder since it is fundamentally about the ability of staff to learn as they go, be self-motivated and be OK with just jumping into a technology without any real guidance or training. Round after round of budget cuts has decimated our training budget and increased our tempo to point where cross training and knowledge sharing is incredibly difficult. I routinely hear, “I don’t know anything about X. I never had any training on X. How am I supposed to fix X!” from team members and as sympathetic as I am about how crappy of a situation that is there is nothing I can do about it. The days of being an “IT guy” that can go down The Big Blue Runbook of Troubleshooting are over. Every day something new that you have never seen before is broken and you just have to figure it out.

Elaine is right though – she is punching way above her weight, the result of which is that she owns more and more the support burden as technology changes and as our team fails to evenly adopt the change. A manager could request some targeted training or maybe some force augmentation from another agency or contracting services. Neither are particularly likely outcomes given our budget unfortunately.

The last one is a perennial struggle of the sysadmin: Your boss judges your efficacy by your ability to complete projects, your users (and thus your boss’ peers via the chain of command) judge your efficacy by your responsiveness to service requests. These two standards are in direct competition. This is such as common and complicated problem that there is a fantastic book about it: Time Management for Systems Administrators

The majority of the suggestions to help alleviate this problem require management buy-in and most of them our shop doesn’t have: A easy to use ticket system with notification features, a policy stating that tickets are the method of requesting support in all but the most exigent of circumstances, a true triage system, a rotating interrupt blocker position and so on. The best I can do here is to recommend to Elaine to develop some time management skills, work on healthy coping skills (exercise, walking, taking breaks, etc.) and doing regular one-on-one sessions with our manager so Elaine has a venue for discussing these frustrations privately so at least if they cannot be solved they can acknowledged.

I brought a sanitized version of this to our team manager and we made some substantial progress. He reminded me that George and Susan have only been on our team for a month and that it will take some time for them to come up to speed before they can really start eating through the ticket queue. He also told Elaine, that while her tenacity in the ticket queue is admirable she needs to stop taking so many tickets so the other guys have a chance. If they linger, well, we can cross that bridge when we come to it.

The best we can do is wait and see. It’ll be interesting to see what happens as George and Susan adjust to our team and how well the strategy of leaving tickets unowned to encourage team members to grab them works out.

Until next time, stay frosty.

 

Five things to not screw up with SCCM

With great power comes great responsibility

Uncle Ben seemed like a pretty wise dude when when he dropped this particular knowledge bomb on Peter Parker. As sysadmins we should already be aware of the tremendous amount of power that has been placed into our hands. Using tools like SCCM further serve to underline this point and while I think SCCM is an amazing product and has the ability to be a fantastic force multiplier you can also reduce your business’ infrastructure to ashes within hours if you use it wrong. I can think of two such events where an SCCM Administrator has mistakenly done some tremendous damage: In 2014 a Windows 7 deployment re-imaged most of the computers, including their servers at Emory University and another unfortunate event where a contractor managed to accomplish the same thing at the Commonwealth Bank of Australia back in the early 2000s.

There are a few things you can do to enjoy the incredible automation, configuration and standardization benefits of SCCM while reducing your likelihood of an R.G.E.

Dynamic Collection Queries

SCCM is all about performing an action on large groups of computers. Therefore it is absolutely imperative that your Collections ACTUALLY CONTAIN THE THINGS YOU THINK THEY DO. Your Collections need to start large and gradually get smaller using a sort of matryoshka doll scheme based on dynamic queries and limiting Collections. You should double/triple/quadruple check your dynamic queries to make sure they are doing what you think they are doing when you create them. It is wise to review these queries on a regular basis to make sure an underlying change in something like Active Directory OU structure or naming convention hasn’t caused your query to match 2000 objects instead of your intended 200. Finally, I highly recommend spot-checking Collection members of your targeted Collection before deploying anything particular hairy and/or when deploying to a large Collection because no matter how diligent we are, we all make mistakes.

Maintenance Windows

“The bond traders are down! The bond traders are down! Cry and hue! Panic! The CIO is on his way to your boss’s office!” Not what you want to hear at 7:00 AM as you are just starting on your first cup of coffee, huh? You can prevent this by making sure your Maintenance Windows are setup correctly. SCCM will do what you tell it to do and if you tell it to allow the agent to reboot at 11:00AM instead of 11:00PM, that’s what’s going to happen.

I like setting up an entirely separate Collection hierarchy that is used solely for setting Maintenance Windows and include my other Collections as members. This prevents issues where the same Collection is used for both targeting and scheduling. It also reduces Maintenance Window sprawl where machines are members of multiple Collections all with different Maintenance Windows. It’s important to consider that Maintenance Windows are “union-ed” so if you have a client in Collection A with a Maintenance Window of 20:00 – 22:00 and in Collection B with a Maintenance Window of 12:00 – 21:00 that client can reboot anywhere between 12:00 – 22:00. There’s nothing more annoying than a workstation that was left in a forgotten testing Collection with a Maintenance Window spanning the whole business day – especially after the technician was done testing and that workstation was delivered to some Department Director.

I am also a huge fan of the idea of a “Default Maintenance Window” where you have a Maintenance Window that is in the past and non-reoccurring that all SCCM clients are a member of. This means that no matter what happens with a computer’s Collection membership it isn’t just going to randomly reboot if it has updates queued up and its current Maintenance Window policy is inadvertently removed.

Last but not least, and this goes for really anything that is scheduled in SCCM, pay attention to date and time. Watch for AM versus PM, 24-hour time vs. 12-hour time,  new day rollover (i.e., 08/20 11:59PM to 08/21 12:00PM) and UTC versus local time.

Required Task Sequences

Of all the things in SCCM this is probably one of the most dangerous. Task Sequences generally involve re-partitioning, re-formatting and re-imaging a computer which has the nice little side effect of removing everything previously on it. You’ll notice that both of those incidents I mentioned at the start of this post were caused by Task Sequences that inadvertently ran on a much larger group of computers than was intended. As a general guideline, I council staff to avoid deploying Task Sequences as Required outside of the Unknown Computers Collection. The potential to nuke your line of business application servers and replace them with Windows 10 is reduced if you have done your fundamentals right in setting up your Collections but I still recommend deploying to small Collections, making your Deployment Available instead of Required (especially if you are testing), restricting who can deploy Task Sequences and password protecting the Task Sequence. I would much rather reboot severs to clear the WinPE environment than recover them from backups.

Automatic Deployment Rules

Anything in SCCM that does stuff automatically deserves some scrutiny. Automatic Deployment Rules are another version of Dynamic Collection Queries. You want to use them and they make your life easier but you need to be sure that they do what you think that they do, especially before they blast out this month’s patches to the All Clients collection instead of the Patch Testing collection. Deployment templates can make it harder to screw up your SUP deployments and once again pay attention to the advertisement and deadline time watching for mistakes with UTC vs. local time or +1 day rollover, the Maintenance Window behavior and which collection you are deploying to. And please, please, please test your SUP groups first before deploying them widely. You too can learn from our mistakes.

Source Files Management and Organization

A messy boat is a dangerous boat. There is a tendency for the source files directory that you are using to store all your installers for Application and Package builds to just descends into chaos over time. This makes it increasingly difficult to figure out what installers are still being used and what stuff was part of some long forgotten test. What’s important here is that you have a standard for file organization and you enforce it with an iron fist.

I like to break things out like this:

A picture depicting the Source Files folder structure

Organizing your source files… It’s a Good Thing.

It’s a pretty straight forward scheme but you get the idea: Applications – Vendor – Software Title – Version and Bitness – Installer. You may need to add more granularity to your Software Updates Deployment Package folders depending on your available bandwidth and how many updates you are deploying in a single SUP group. We have had good results with grouping them by year but then again we are not an agency with offices all over rural Alaska.

 

Mitigation Techniques

There are a few techniques you can use to prevent yourself from doing something terrible.

Roll-based Access Control

You can think of Security Scopes as the largest possible number of clients a single admin can break. If you have a big enough team, the clever use of RBAC will allow you limit how much damage individual team members can do. For example: You could divide your 12 person SCCM team into three sub-teams and use RBAC to limit each sub-team to only being able to manage 1/3 of your clients. You could take this idea a step further and give your tier-1 help desk the ability to do basic “non-dangerous” actions but still allow them the ability to use SCCM to perform their job. This is pretty context specific but there is a lot you can do with RBAC to limit the potential scope of an Administrator’s actions.

Application Requirements (Global Conditions)

You can use Application Requirements as a basic mechanism to prevent bad things from happening if they are deployed to the wrong Collection inadvertently.

Look at all these nice, clean servers… it would be a shame if someone accidentally deployed the Java JRE to all of them, wouldn’t it? Well, if you put in a Requirement that checks the value of ProductType in the Win32_OperatingSystem WMI class to ensure the client has a workstation operating system then the Application will fail its Requirements check and won’t be installed on those servers.

 

There’s so much in WMI that you could build some WQL queries that prevent “dangerous” applications from meeting a Requirement of clients outside its intended deployment.

 

PowerShell Killswitch

SCCM is a pull-based architecture. An implication of this is once the clients have a bad policy they are going to act on it. The first thing you should do if you discover a policy is stomping on your clients is to try and limit the damage by preventing unaffected clients from pulling it. A simple PowerShell script that stops the IIS App Pools backing your Management Points and Distribution Points will act as a crude but effective kill switch. By having this script prepped and ready to go you can immediately stop the spread of something bad and then focus your efforts on correcting the mistake.

Sane Client Settings

There is a tendency to crank up some of the client-side polling frequencies in smaller SCCM implementations in order to make things “go faster” however another way to look at the polling interval is that this is the the period of time it takes for all of your clients to have received a bad policy and possibly acted on it. If your client policy polling interval is 15 minutes that means in 15 minutes you will have re-imaged all your clients if you really screwed up and deployed a Required Task Sequence to All Systems. The longer the polling frequency, the more time you have to identify a bad policy, stop it and begin rebuilding before it has nuked your whole fleet.

Team Processes

A few simple soft processes can go a long way. If you are deploying out an Application or Updates to your whole fleet, send out a notification to your business leaders. People are generally more forgiving of mistakes when they are notified of significant changes first. Perform a gradual roll-out over a week or two instead of blasting out your Office 365 installation application to all 500 workstations at once. Setting sane scheduling and installation deadlines in your Deployments helps here too.

If you are doing something that could be potentially dangerous, grab a coworker and do pilot/co-pilot for the deployment. You (the pilot) perform the work but you walk your coworker (the co-pilot) through each step and have them verify it. Putting a second pair of eyes on a deployment avoids things like inadvertently clicking the “Allow clients to restart outside of Maintenance Windows” checkbox. Next time you need to do this deployment switch roles – Bam! Instant cross training!

Don’t be in a hurry. Nine times out of ten, the dangerous thing is simple to deploy but the simple settings cannot be wrong. Take your time to do things right and push back when you are given unrealistic schedules or asked to deploy things outside of your roll-out process. In the mountains we like to say, slow is fast and fast is dead. In SCCM I like to say, slow is fast, and fast is fired.

Read-Only Friday is the holiest of days on the Sysadmin calendar. Keep it with reverence and respect.

Consider enabling the High Risk Deployment Setting. If you do this make sure you tune the settings so your admins don’t get alert fatigue and just learn to click next, next, finish or eventually they will click next, next, finish and go “oops”.

 

I hope this is helpful. If you have other ideas on how not blow up everything with SCCM feel free to comment. I’m always up for learning something new!

Until next time, stay frosty.

 

 

 

Salary, Expectations and Automation

It has been an interesting few months. We have had a few unexpected projects pop up and I have ended up owning most of them. This led to me feel pretty beaten down and a little bit demoralized. I don’t like missing deadlines and I don’t like constantly switching from one task to the next without ever making headway. It’s not my preferred way to work.

One thing that I am continually trying to remind myself is that I should use the team. I don’t have to own everything nor should I so I started creating tickets on the behalf of my users (we don’t have a policy requiring tickets) and just dumping them into our generic queue so someone else could pick them up.

Guess what happened? They sat there. Now there are a few reasons why things played out this way (see this post) but you can imagine this was not the result I was hoping for. I was hoping my tier-2 folks would of jumped in and grabbed some of these requests:

  • Review the GPOs applied to a particular group of servers and modify them to accommodate a new service account
  • Review some NTFS permissions and restructure them to be more granular
  • Create a new IIS site along with the corresponding certificate and coordinate with our AD team to get the appropriate DNS records put in place
  • Help one of our dev teams re-platform / upgrade a COTS application
  • Re-configure IIS on a particular site to support HTTPS.

Part of the reason we have so much work right now is that we are assuming the responsibility for a department that previously had their own internal IT staff (Yay! Budget cuts!). Not everyone was happy with giving up “their IT guys” and so during our team meetings we started reviewing work in the queue that was not getting moved along.

A bunch of these unloved tickets were “mine”, that is to say, they were originally requests that came directly to me, that I then created a ticket for hoping to bump it back into the queue. This should sound familiar. The consensus though was that it was “my work” and that I was not being diligent enough in keeping track of the ticket queue.

Please bear in mind for the next two paragraphs, that we have a small 12 person team. It is not difficult for us to get a hold of another team member.

I’ll unpack the latter idea first. In a sense, I agree. I could do a better job of watching the queue but that’s simply because I was not watching it. My perception was, that as someone who is nominally at the top of our support tier is that our help desk watches the queue, catches interrupts from customers and then escalates stuff if they need assistance. I was thinking my tickets should come from my team and not directly from the queue.

The former idea I’m a little less sympathetic too. It’s not “my work”, it’s the team’s work, right? And here is where those sour grapes start to ferment… that list of tickets up there does not seem like “tier-3 work” to me. It seems junior sysadmins’ work. If that is not the case then I have to ask the question: What are those guys doing instead? If that’s not “work” that tier-1/tier-2 handle then what is?

In the end, of course, I took the tickets and did the work, which of course put me even further behind on some of my projects.

I have puzzled over our ticket system, support process and team dynamics quite a bit (see here, here and here) and there is a lot of different threads one could pull on, but a new explanation came to mind after this exercise: Maybe our tier-2 guys are not doing this work because they can’t? Maybe they just don’t have the skills to do those kinds of things and maybe it’s not realistic to expect people to have that level of skill, independence and work ethic for what we pay them? I hate this idea. I hate it because if that’s truly the case there is very little I can do to fix it. I don’t control our training budget or assign team priorities or have any ability to negotiate graduated raises matched with a corresponding training plan. I don’t do employee evaluations and I cannot put someone on an improvement plan and I certainly cannot let an employee go. But I really don’t like this idea because it feels like I’m crapping on my team. I don’t like it because it makes me feel guilty.

But our are salaries and expectations unrealistic?

  • Help Desk Staff (Tier-1) – $44k  – $50k per year
  • Junior Sysadmins (Tier-2) – $58k – $68 per year
  • Sysadmins (Tier-3) – $68k – 78k per year

It’s a pretty normal “white collar” setup: salaried, no overtime eligibility, with health insurance and a 401k with a decent employer match. We can’t really do flexible work schedules or work-from-home but we do have a pretty generous paid leave policy. However – this is Alaska, where everything is as expensive as the scenery is beautiful. A one bedroom rental will run you at least $1200 a month plus utilities which can easily be a few hundred dollars in the winter depending on your heating source. Gasoline is on average a dollar more per gallon than whatever it is currently in the Lower 48. Childcare is about $1100 a month per kiddo for kids under three. Your standard “American dream” three bedroom, two bath house will cost you around $380,000. All things being equal, it is about 25% more expensive to live here than your average American city so when you think about these wages knock a quarter of them off to adjust for cost of living.

Those wages don’t look so hot anymore huh? Maybe there is a reason (other than our State’s current recession) that most IT positions in my organization take at least six months to fill. The talent pool is shallow and not that many people are interested in wading in.

We all have our strengths and weaknesses. I suspect our team is much like others with a spectrum of talent but I think the cracks are beginning to show… as positions are cut, work is not being evenly distributed and fewer and fewer team members are taking more and more of the work. I suspect that’s because these team members have to skills to eat that workload with automation. They can use PowerShell to do account provisioning instead of clicking through Active Directory Users and Computes. They can use SCCM to install Visio instead of RDPing and pressing Next-Next-Finish on each computer. A high performing team member would realize that the only way they could do that much work was learn some automation skills. A low performing team member would do what instead? I’m not sure. But maybe, just maybe, as we put increasing pressure on our tier-1 and tier-2 staff to “up their skills” and to “eat the work”, we are not being realistic.

Would you expect someone making 44k – 51k a year in Cost of Living adjusted wages to be an SCCM wizard? Or pickup PowerShell?

Are we asking to much of our staff? What would you expect someone paid these wags to be able to do? Like all my posts – I have no answers, only questions but hopefully I’m asking the right ones.

Until next time, stay frosty!

Veeam, vSphere Tags and PowerCLI – A Practical Use-case

vSphere tags have been around for a while but up until quite recently I had yet find a particularly compelling reason to use them. We are a small enough shop that it just didn’t make sense to use them for categorization purposes with only 200 virtual machines.

This year we hopped on the Veeam bandwagon and while I was designing our backup jobs I found the perfect use-case for vSphere tags. I wanted a mechanism that provided us a degree of flexibility so I could protect different “classes” of virtual machines to the degree appropriate to their criticality but also was automatic since manual processes have a way of not getting done. I also did not want an errant admin to place a virtual machine in the wrong folder or cluster and then have the virtual machine be silently excluded from any backup jobs. The solution was to use vSphere tags and PowerShell with VMware’s PowerCLI module.

If we look at the different methods you can use for targeting source objects in Veeam Backup Jobs, tags seem to offer the most flexibility. Targeting based on clusters works fine as long as you want to include everything on a particular cluster. Targeting based on hosts works as long you don’t expect DRS to move a virtual machine to another host that is not included in your Backup Job. Datastores initially seemed to make the most sense since our different virtual machine “classes” roughly correspond to what datastore they are using (and hence what storage they are on) but not every VM falls cleanly into each datastore “class”. Folders would functionally work the same way here as tags but tags are just a bit cleaner. If I create a folder structure for a new line of business application, I don’t have to go back into each of my Veeam jobs and update their targeting to add the new folders, I just tag the folders appropriately and away I go.

Tags also work great for exclusions. We run a separate vSphere cluster for our SQL workloads primarily for licensing reasons (SQL Datacenter licensing for the win!). I just setup a tag for the whole cluster, use it for targeting VMs for my application-aware Veeam Backup Jobs and also to exclude those virtual machines from the standard Backup Jobs (why back it up twice? Especially when one backup is not transaction-consistent?).

How about checking to see if there are any virtual machines that missed a tag assignment and therefore are not being backed up?

PowerShell to the rescue once again!

 

Another simple little snippet that will send you an email if you managed to inadvertently place a virtual machine into a location where it is not getting a tag assignment and therefore not getting automatically included in a backup job.
 

Until next time!

Quick and dirty PowerShell snippet to get Dell Service Tag

The new fiscal year is right around the corner for us. This time of year brings all kinds of fun for us Alaskans, spring king salmon runs, our yearly dose of three days worth of good weather and licensing true-up and hardware purchases. Now there’s about a million different ways to skin this particular cat but here’s a quick a dirty method with PowerShell.

 

If you don’t have access to the ridiculously useful Test-NetConnection cmdlet then you probably should upgrade to PowerShell v5 since unlike most Microsoft products PowerShell seems to actually improve with each version but baring that you can just open a TCP socket by instantiating the appropriate .NET object.

 

The slickest way I have ever seen this done though was with SCCM and the Dell Command Integration Suite for System Center which can generate a warranty status report for everything in your SCCM Site by connecting to database, grabbing the all the service tags and then sending that up to Dell’s warranty status API to get all kinds of juicy information like model, service level, ship date, and warranty status. Unfortunately since this was tremendously useful the team overseeing the warranty status web service decommissioned it abruptly back in 2016. Thanks for nothing ya jerks!

 

Use PowerCLI to get a list of old snapshots

VMware’s snapshot implementation is pretty handy. A quick click and you have a rollback point for whatever awesomeness your Evil Knievel rockstar developer is about to rollout. There are, as expected, some downsides.

Snapshots are essentially implemented as differencing disks and at the point-in-time you make a snapshot a child .vmdk file is created that tracks any disk changes from that point forward. When you delete the snapshot the changed blocks are merged back into the parent .vmdk. If you decide to rollback all the changes you just revert back to original parent disk which remained unchanged.

There are some performance implications:

  • Disk I/O is impacted for the VM as ESXi has to work a bit harder to write the new changes to child .vmdk file but reference the parent .vmdk file for unchanged blocks.
  • The child .vmdk files grow over time as more and more changes accumulate. The bigger it gets the more the disk I/O impacted. As you can imagine multiple snapshots can certainly impact disk performance.
  • When the snapshot is deleted, the process of merging the changes in the child .vmdk back into the parent .vmdk can be very slow depending on the size of the child .vmdk. Ask my DBA how long it took to delete a snapshot after he restored a 4TB database…

They also don’t make a great backup mechanism in of themselves:

  • The parent and child .vmdk are located on the same storage and if that storage disappears so does the production data (the parent .vmdk) and the “backup” (the child .vmdk) with it.
  • They are not transaction consistent. The hypervisor temporarily freezes I/O in order to take the snapshot without doing some kind of application-aware quiescing (although VMware Tools can quiesce the guest file system and memory).

 

For all these reasons I like to keep track of what snapshots are out there and how long they have been hanging around. Luckily PowerCLI makes this pretty easy:

 
This little snippet will connect to your vCenter servers, go through all the VMs, locate any snapshots older than seven days and then send you a nicely formatted email:

 

If you haven’t invested a little time in learning PowerShell/PowerCLI, I highly recommend you start. It is such an amazing toolset and force multiplier that I cannot imagine working without these days.

Until next time, stay frosty.

Extra credit reading:

 

Prometheus and Sisyphus: A Modern Myth of Developers and Sysadmins

I am going to be upfront with you. You are about to read a long and meandering post that will seem almost a little too whiny at times where I talk some crap about our developers and their burdens (applications). I like our dev teams and I like to think I work really well with their leads so think of this post as a bit of satirical sibling rivalry and underneath the hyperbole and good nature-ed teasing there might be a small, “little-t” truth.

That truth is that operations, whether it’s the database administrator, the network team, the sysadmins or the help desk, always, always, always gets the short straw and that is because collectively we own “the forest” that the developers tend to their “trees” in.

I have a lot to say about the oft-repeated sysadmin myth about “how misunderstood sysadmins are” and how the they just seem to get stepped on all the time and so on and so on. I am not a big fan of the “special snowflake sysadmin syndrome” and I am especially not a fan of it when it is used as an excuse to be rude or unprofessional but that being said, I think it is worth stating that even I know I am half full of crap when I say sysadmins always get the short straw.

OK disclaimers are all done! Lets tell some stories!

 

DevOps – That means I get Local Admin right?

My organization is quite granular and each of our departments more or less maintain their own development teams supporting their own mission-specific applications along with either a developer that essentially fulfils an operations role or a single operations guy doing support solely for that department. The central ops team maintains things like the LAN, Active Directory, the virtualization platform and so on. If the powers that be wanted a new application for their department, the developers would request the required virtual machines, the ops team would spin up a dozen VMs off of a template, join them to AD, give the developers local admin and off we go.

Much like Bob Belcher, all the ops guys could do is “complain the whole time”.

 

This arrangement led to some amazing things that break in ways that are too awesome to truly describe:

  • We have an in-house application that uses SharePoint as a front-end, calls some custom web services tied to a database or two that auto-populates an Excel spreadsheet that is used for timekeeping. Everyone else just fills out the spreadsheet.
  • We have another SharePoint integrated application, used ironically enough for compliance training, that passes your Active Directory credentials in plaintext through two or three servers all hosting different web services.
  • Our deployment process is essentially to copy everything off your workstation onto the IIS servers.
  • Our revision control is: E:\WWW\Site, E:\WWW\Site (Copy), E:\WWW-Site-Dev McDeveloper
  • We have an application that manages account on-boarding, a process of which is already automated by our Active Directory team. Naturally they conflict.
  • We had at one point in time, four or five different backup systems all of which used BackupExec for some insane reason, three of which backed up the same data.
  • We managed to break a production IIS server by restoring a copy of the test database.
  • And then there’s Jenga: Enterprise Edition…

 

Jenga: Enterprise Edition – Not so fun when it needs four nines of uptime.

A satirical (but only just) rendering of one our application’s design pattern that I call “The Spider Web”

What you are looking at is my humorous attempt to scribble out a satirical sketch of one of our line-of-business applications which managed to actually turn out pretty accurate. The Jenga application is so named because all the pieces are interconnected in ways that turn the prospect of upgrading any of it into the project of upgrading all it. Ready? Ere’ we go!

It’s built around a core written in a language that we have not had any on-staff expertise in for the better part of ten years. In order to provide the functionality that the business needed as the application aged, the developers wrote new “modules” in other languages that essentially just call APIs or exposed services and then bolted them on. The database is relatively small, around 6 TBs, but almost 90% of it is static read-only data that we cannot separate out which drastically reduces the things our DBA and myself can do in terms of recovery, backup and replication and performance optimization. There is no truly separate development or testing environments so we use snapshot copies to expose what appear to be “atomic” copies of the production data (which contains PII!) on two or three other servers so our developers can validate application operations against it. We used to do this with manual fricking database restores, which was god damned expensive in terms of time and storage. There are no less than eight database servers involved but the application cannot be distributed or setup in some kind of multi-master deployment with convergence so staff at remote sites suffer abysmal performance if anything resembling contention happens on their shared last-mile connections.  The “service accounts” are literally user accounts that the developers use to RDP to the servers, start the application’s GUI, and then enable the application’s various services via interacting with above mentioned GUI (any hick-up in the RDP session and *poof* there goes that service). The public facing web server directly queries the production database). The internally consumed pieces of the application and the externally consumed pieces are co-mingled, meaning an outage anywhere is an outage everywhere. It also means we cannot segment the application in public and internally facing pieces. The client requires a hard-coded drive map to run since application upgrades are handled internally with copy jobs which essential replace all the local .DLLs on a workstation when new ones are detected and last but not least it runs on an EOL version of MSSQL.

Whew. That’s was a lot. Sorry about that. Despite that the fact that a whole department pretty much lives or dies by this application’s continued functionality our devs have not made much progress in re-architecturing and modernizing it. This really is not their fault but it does not change the fact that my team has an increasingly hard time keeping this thing running in a satisfactory manner.

 

Operations: The Digital Custodian Team.

In the middle of a brain storming session where we were trying to figure out how to move Jenga to a new virtualization infrastructure, all on a weekend when I will be traveling in order to squeeze the outage into the only period within the next two months that was not going to be unduly disruptive I began to feel like my team was getting screwed. They have more developers supporting this application than we have in our whole operations team and it is on us to figure out how to move Jenga without losing any blocks or having any lengthy service windows? What are those guys actually working on over there? Why am I trying to figure out which missing .DLL from .NET 1.0 needs be imported onto the new IIS 8.5 web servers so some obscure service that no really one understands runs in a supported environment? Why does operations own the life-cycle management? Why aren’t the developers updating and re-writing code to reflect the underlying environmental and API changes each time a new server OS is released with a new set of libraries? Why are our business expectations for application reliability so widely out-of-sync with what the architecture can actually deliver? Just what in the hell is going on here!

Honestly. I don’t know but it sucks. It sucks for the customers, it sucks for the devs but mostly I feel like it sucks for my team because we have to support four other line-of-business applications. We own the forest right? So when a particular tree catches on fire they call us to figure out what to do. No one mentions that we probably should expect trees wrapped in paraffin wax and then doused in diesel fuel to catch on fire. When we point out that tending trees in this manner probably won’t deliver the best results if you want something other than a bonfire we get met with a vague shrug.

Is this how it works? Your team of rockstar, “creative-type”, code-poets whip up some kind of amazing business application, celebrate and then hand it off to operations where we have to figure out how to keep it alive as the platform and code base age into senility for the next 20 years? I mean who owns the on-call phone for all these applications… hint: it’s not the dev team.

I understand that sometimes messes happen… just why does it feel like we are the only ones cleaning it up?

 

You’re not my Supervisor! Organizational Structure and Silos!

Bureaucratium ad infinitum.

 

At first blush I was going to blame my favorite patsy, Process Improvement and the inspid industry around it for this current state of affairs but after some thought I think the real answer here is something much simpler: the dev team and my team don’t work for the same person. Not even close. If we play a little game of “trace the organizational chart” we have five layers of management before we reach a position that has direct reports that eventually lead to both teams. Each one of those layers is a person – with their own concerns, motivations, proclivities and spin they put on any given situation. The developers and operations team (“dudes that work”), more or less, agree that the design of the Jenga application is Not a Good Thing (TM). But as each team gets told to move in a certain direction by each layer of management our efforts and goals diverge. No amount of fuzzy-wuzzy DevOps or new-fangled Agile Standup Kanban Continuous Integration Gamefication Buzzword Compliant bullshit is ever going to change that. Nothing makes “enemies” out of friends faster than two (or three or four) managers maneuvering for leverage and dragging their teams along with them. I cannot help but wonder what our culture would be like if the lead devs sat right next to me and we established project teams out of our combined pool of developer and operations talent as individual department’s put forth work. What would things be like if our developers were not chained to some stupid line-of-business application from the late ’80s, toiling away to polish a turd and implement feature requests like some kind of modern Promethian myth? What would things be like if our operations team was not constantly trying to figure out how to make old crap run while our budgets and staff are whittled away, snatching victory from defeat time and time again only to watch the cycle of mistakes repeat itself again and again like some kind Sisyphean dystopia with cubicles? What if we could sit down together and I dunno… fix things?

Sorry there are no great conclusions or flashes of prophetic insight here, I am just as uninformed as the rest of the masses, but I cannot help but think, maybe, maybe we have too many chefs in the kitchen arguing about the menu. But then again, what do I know? I’m just the custodian.

Until next time, stay frosty.