Monthly Archives: September 2017

Morale Is Low, Workload Is Up

Earlier this month, I came back from lunch and I could tell something was off. One of my team members, lets call her Elaine, who is by far the the most upbeat, relentlessly optimistic and quickest to laugh off any of our daily trials and tribulations was silent, hurriedly moving around and uncharacteristically short with customers and coworkers. Maybe she was having a bad day I wondered as I made a mental note to keep tabs on her for the week to see if she bounced back to her normal self. When her attitude didn’t change after a few days then I was really worried.

Time to earn my team lead stripes so I took her aside and asked her what’s up. I could hear the steam venting as she started with, “I’m just so f*****g busy”. I decided to shut up and listen as she continued. There was a lot to unpack: She was under-pressure to redesign our imaging process to incorporate a new department that got rolled under us, she was handling the majority of our largely bungled Office 365 Exchange Online post-migration support and she was still crushing tickets on the help desk with the best of them. The straw that broke the camel’s back – spending a day to clean-up her cubicle that was full of surplus equipment because someone commented that our messy work area looked unprofessional…  “I don’t have time for unimportant s**t like that right now!” as she continued furiously cleaning.

The first thing I did and asked her what the high priority task of the afternoon was and figured out how to move it somewhere else. Next I recommended that she finish her cleaning, take off early and then take tomorrow off. When someone is that worked up, myself included, generally a great place to start is to get some distance between you and whatever is stressing you out until you decompress a bit.

Next I started looking through our ticket system to see if I could get some supporting information about her workload that I could take to our manager.

Huh. Not a great trend.

That’s an interesting uptick that just so happens to coincide with us taking over the support responsibilities for the previously mentioned department. We did bring their team of four people over but only managed to retain two in the process. Our workload increased substantially too since we not only had to continue to the maintain the same service level but we now have the additional challenge of performing discovery, taking over the administration and standardizing their systems (I have talked about balancing consolidation projects and workload before). It was an unfortunate coincidence that we had to schedule our Office 365 migration at the same time due to a scheduling conflict. Bottom line: We increased our workload by a not insignificant amount and lost two people. Not great a start.

I wonder how our new guys (George and Susan) are doing? Lets take a look at the ticket distribution, shall we?

Huh. Also not a great trend.

Back in December 2016 it looks like Elaine started taking on more and more of the team’s tickets. August of 2017 was clearly a rough month for the team as we started eating through all that additional workload but noticeably that workload was not being distributed evenly.

Here is another view that I think really underlines the point.

Yeah. That sucks for Elaine.

As far back as a year Elaine has been handling about 25% of our tickets and since then her percentage of the tickets has increased to close to 50%. What makes this worse is not only has the absolute quantity of tickets in August more than doubled compared to the average of the 11 preceding months but the relative percentage of her contribution has doubled as well. This is bad and I should of noticed, a long time ago.

Elaine and I had a little chat about this situation and here’s what I distilled out of it:

  • “If I don’t take the tickets they won’t get done”
  • “I’m the one that learns new stuff as it comes along so then I’m the one that ends up supporting it”
  • “There’s too many user requests for me to get my project work done quickly”

Service Delivery and Business Processes. A foe beyond any technical lead.

This is where my power as a technical lead ends. It takes a manager or possibly even an executive to address these issues but I can do my best to advocate for my team.

The first issue is actually simple. Elaine needs to stop taking it upon herself to own the majority of the tickets. If the tickets aren’t in the queue then no one else will have the opportunity to take them. If the tickets linger, that’s not Elaine’s problem, that’s a service delivery problem for a manager to solve.

The second issue is a little harder since it is fundamentally about the ability of staff to learn as they go, be self-motivated and be OK with just jumping into a technology without any real guidance or training. Round after round of budget cuts has decimated our training budget and increased our tempo to point where cross training and knowledge sharing is incredibly difficult. I routinely hear, “I don’t know anything about X. I never had any training on X. How am I supposed to fix X!” from team members and as sympathetic as I am about how crappy of a situation that is there is nothing I can do about it. The days of being an “IT guy” that can go down The Big Blue Runbook of Troubleshooting are over. Every day something new that you have never seen before is broken and you just have to figure it out.

Elaine is right though – she is punching way above her weight, the result of which is that she owns more and more the support burden as technology changes and as our team fails to evenly adopt the change. A manager could request some targeted training or maybe some force augmentation from another agency or contracting services. Neither are particularly likely outcomes given our budget unfortunately.

The last one is a perennial struggle of the sysadmin: Your boss judges your efficacy by your ability to complete projects, your users (and thus your boss’ peers via the chain of command) judge your efficacy by your responsiveness to service requests. These two standards are in direct competition. This is such as common and complicated problem that there is a fantastic book about it: Time Management for Systems Administrators

The majority of the suggestions to help alleviate this problem require management buy-in and most of them our shop doesn’t have: A easy to use ticket system with notification features, a policy stating that tickets are the method of requesting support in all but the most exigent of circumstances, a true triage system, a rotating interrupt blocker position and so on. The best I can do here is to recommend to Elaine to develop some time management skills, work on healthy coping skills (exercise, walking, taking breaks, etc.) and doing regular one-on-one sessions with our manager so Elaine has a venue for discussing these frustrations privately so at least if they cannot be solved they can acknowledged.

I brought a sanitized version of this to our team manager and we made some substantial progress. He reminded me that George and Susan have only been on our team for a month and that it will take some time for them to come up to speed before they can really start eating through the ticket queue. He also told Elaine, that while her tenacity in the ticket queue is admirable she needs to stop taking so many tickets so the other guys have a chance. If they linger, well, we can cross that bridge when we come to it.

The best we can do is wait and see. It’ll be interesting to see what happens as George and Susan adjust to our team and how well the strategy of leaving tickets unowned to encourage team members to grab them works out.

Until next time, stay frosty.

 

Five things to not screw up with SCCM

With great power comes great responsibility

Uncle Ben seemed like a pretty wise dude when when he dropped this particular knowledge bomb on Peter Parker. As sysadmins we should already be aware of the tremendous amount of power that has been placed into our hands. Using tools like SCCM further serve to underline this point and while I think SCCM is an amazing product and has the ability to be a fantastic force multiplier you can also reduce your business’ infrastructure to ashes within hours if you use it wrong. I can think of two such events where an SCCM Administrator has mistakenly done some tremendous damage: In 2014 a Windows 7 deployment re-imaged most of the computers, including their servers at Emory University and another unfortunate event where a contractor managed to accomplish the same thing at the Commonwealth Bank of Australia back in the early 2000s.

There are a few things you can do to enjoy the incredible automation, configuration and standardization benefits of SCCM while reducing your likelihood of an R.G.E.

Dynamic Collection Queries

SCCM is all about performing an action on large groups of computers. Therefore it is absolutely imperative that your Collections ACTUALLY CONTAIN THE THINGS YOU THINK THEY DO. Your Collections need to start large and gradually get smaller using a sort of matryoshka doll scheme based on dynamic queries and limiting Collections. You should double/triple/quadruple check your dynamic queries to make sure they are doing what you think they are doing when you create them. It is wise to review these queries on a regular basis to make sure an underlying change in something like Active Directory OU structure or naming convention hasn’t caused your query to match 2000 objects instead of your intended 200. Finally, I highly recommend spot-checking Collection members of your targeted Collection before deploying anything particular hairy and/or when deploying to a large Collection because no matter how diligent we are, we all make mistakes.

Maintenance Windows

“The bond traders are down! The bond traders are down! Cry and hue! Panic! The CIO is on his way to your boss’s office!” Not what you want to hear at 7:00 AM as you are just starting on your first cup of coffee, huh? You can prevent this by making sure your Maintenance Windows are setup correctly. SCCM will do what you tell it to do and if you tell it to allow the agent to reboot at 11:00AM instead of 11:00PM, that’s what’s going to happen.

I like setting up an entirely separate Collection hierarchy that is used solely for setting Maintenance Windows and include my other Collections as members. This prevents issues where the same Collection is used for both targeting and scheduling. It also reduces Maintenance Window sprawl where machines are members of multiple Collections all with different Maintenance Windows. It’s important to consider that Maintenance Windows are “union-ed” so if you have a client in Collection A with a Maintenance Window of 20:00 – 22:00 and in Collection B with a Maintenance Window of 12:00 – 21:00 that client can reboot anywhere between 12:00 – 22:00. There’s nothing more annoying than a workstation that was left in a forgotten testing Collection with a Maintenance Window spanning the whole business day – especially after the technician was done testing and that workstation was delivered to some Department Director.

I am also a huge fan of the idea of a “Default Maintenance Window” where you have a Maintenance Window that is in the past and non-reoccurring that all SCCM clients are a member of. This means that no matter what happens with a computer’s Collection membership it isn’t just going to randomly reboot if it has updates queued up and its current Maintenance Window policy is inadvertently removed.

Last but not least, and this goes for really anything that is scheduled in SCCM, pay attention to date and time. Watch for AM versus PM, 24-hour time vs. 12-hour time,  new day rollover (i.e., 08/20 11:59PM to 08/21 12:00PM) and UTC versus local time.

Required Task Sequences

Of all the things in SCCM this is probably one of the most dangerous. Task Sequences generally involve re-partitioning, re-formatting and re-imaging a computer which has the nice little side effect of removing everything previously on it. You’ll notice that both of those incidents I mentioned at the start of this post were caused by Task Sequences that inadvertently ran on a much larger group of computers than was intended. As a general guideline, I council staff to avoid deploying Task Sequences as Required outside of the Unknown Computers Collection. The potential to nuke your line of business application servers and replace them with Windows 10 is reduced if you have done your fundamentals right in setting up your Collections but I still recommend deploying to small Collections, making your Deployment Available instead of Required (especially if you are testing), restricting who can deploy Task Sequences and password protecting the Task Sequence. I would much rather reboot severs to clear the WinPE environment than recover them from backups.

Automatic Deployment Rules

Anything in SCCM that does stuff automatically deserves some scrutiny. Automatic Deployment Rules are another version of Dynamic Collection Queries. You want to use them and they make your life easier but you need to be sure that they do what you think that they do, especially before they blast out this month’s patches to the All Clients collection instead of the Patch Testing collection. Deployment templates can make it harder to screw up your SUP deployments and once again pay attention to the advertisement and deadline time watching for mistakes with UTC vs. local time or +1 day rollover, the Maintenance Window behavior and which collection you are deploying to. And please, please, please test your SUP groups first before deploying them widely. You too can learn from our mistakes.

Source Files Management and Organization

A messy boat is a dangerous boat. There is a tendency for the source files directory that you are using to store all your installers for Application and Package builds to just descends into chaos over time. This makes it increasingly difficult to figure out what installers are still being used and what stuff was part of some long forgotten test. What’s important here is that you have a standard for file organization and you enforce it with an iron fist.

I like to break things out like this:

A picture depicting the Source Files folder structure

Organizing your source files… It’s a Good Thing.

It’s a pretty straight forward scheme but you get the idea: Applications – Vendor – Software Title – Version and Bitness – Installer. You may need to add more granularity to your Software Updates Deployment Package folders depending on your available bandwidth and how many updates you are deploying in a single SUP group. We have had good results with grouping them by year but then again we are not an agency with offices all over rural Alaska.

 

Mitigation Techniques

There are a few techniques you can use to prevent yourself from doing something terrible.

Roll-based Access Control

You can think of Security Scopes as the largest possible number of clients a single admin can break. If you have a big enough team, the clever use of RBAC will allow you limit how much damage individual team members can do. For example: You could divide your 12 person SCCM team into three sub-teams and use RBAC to limit each sub-team to only being able to manage 1/3 of your clients. You could take this idea a step further and give your tier-1 help desk the ability to do basic “non-dangerous” actions but still allow them the ability to use SCCM to perform their job. This is pretty context specific but there is a lot you can do with RBAC to limit the potential scope of an Administrator’s actions.

Application Requirements (Global Conditions)

You can use Application Requirements as a basic mechanism to prevent bad things from happening if they are deployed to the wrong Collection inadvertently.

Look at all these nice, clean servers… it would be a shame if someone accidentally deployed the Java JRE to all of them, wouldn’t it? Well, if you put in a Requirement that checks the value of ProductType in the Win32_OperatingSystem WMI class to ensure the client has a workstation operating system then the Application will fail its Requirements check and won’t be installed on those servers.

 

There’s so much in WMI that you could build some WQL queries that prevent “dangerous” applications from meeting a Requirement of clients outside its intended deployment.

 

PowerShell Killswitch

SCCM is a pull-based architecture. An implication of this is once the clients have a bad policy they are going to act on it. The first thing you should do if you discover a policy is stomping on your clients is to try and limit the damage by preventing unaffected clients from pulling it. A simple PowerShell script that stops the IIS App Pools backing your Management Points and Distribution Points will act as a crude but effective kill switch. By having this script prepped and ready to go you can immediately stop the spread of something bad and then focus your efforts on correcting the mistake.

Sane Client Settings

There is a tendency to crank up some of the client-side polling frequencies in smaller SCCM implementations in order to make things “go faster” however another way to look at the polling interval is that this is the the period of time it takes for all of your clients to have received a bad policy and possibly acted on it. If your client policy polling interval is 15 minutes that means in 15 minutes you will have re-imaged all your clients if you really screwed up and deployed a Required Task Sequence to All Systems. The longer the polling frequency, the more time you have to identify a bad policy, stop it and begin rebuilding before it has nuked your whole fleet.

Team Processes

A few simple soft processes can go a long way. If you are deploying out an Application or Updates to your whole fleet, send out a notification to your business leaders. People are generally more forgiving of mistakes when they are notified of significant changes first. Perform a gradual roll-out over a week or two instead of blasting out your Office 365 installation application to all 500 workstations at once. Setting sane scheduling and installation deadlines in your Deployments helps here too.

If you are doing something that could be potentially dangerous, grab a coworker and do pilot/co-pilot for the deployment. You (the pilot) perform the work but you walk your coworker (the co-pilot) through each step and have them verify it. Putting a second pair of eyes on a deployment avoids things like inadvertently clicking the “Allow clients to restart outside of Maintenance Windows” checkbox. Next time you need to do this deployment switch roles – Bam! Instant cross training!

Don’t be in a hurry. Nine times out of ten, the dangerous thing is simple to deploy but the simple settings cannot be wrong. Take your time to do things right and push back when you are given unrealistic schedules or asked to deploy things outside of your roll-out process. In the mountains we like to say, slow is fast and fast is dead. In SCCM I like to say, slow is fast, and fast is fired.

Read-Only Friday is the holiest of days on the Sysadmin calendar. Keep it with reverence and respect.

Consider enabling the High Risk Deployment Setting. If you do this make sure you tune the settings so your admins don’t get alert fatigue and just learn to click next, next, finish or eventually they will click next, next, finish and go “oops”.

 

I hope this is helpful. If you have other ideas on how not blow up everything with SCCM feel free to comment. I’m always up for learning something new!

Until next time, stay frosty.