Tag Archives: documentation

Things in our Datacenter that Annoy Me

Or alternatively how I learned to stop worrying about the little things…

In this post, I complain about little details that show my true colors as a some kind of pedantic, semi-obsessive, detail oriented system administrator. I mean, I try to play it cool but inside I am really freaking out, man! Not really but also kind of yes. More on that later.

 

Our racks are not deep or wide enough

Our racks were not sized correctly initially. They are quite “shallow”. A Dell 730 on ReadyRails is about 28″ deep. This is a pretty standard rack mounting depth for full-size rackmount equipment. In our racks, we only have about 4-6″ of space remaining between the posts and the door in the back of the rack. This complicates cabling since we do not have a lot of room to work with but it really gets annoying with PDUs. See below.

The combination of the shallow depth and lack of width, leads to weird PDU configurations

PDU Setup

The racks are too shallow to mount the PDUs parallel with the posts with the plugs facing out towards the door and too narrow to stack both PDUs on one side. The PDUs end up being mounted sideways where they stick out into the area between the posts, blocking airflow and making cabling a pain in the ass.

Check out u/tasysadmin’s approach which is much improved over ours. The extra depth and width allows both power circuits (you do have two redundant power circuits, right?) to move over to one side of the rack and slide into the gap between the posts and casing. This has a whole bunch of benefits: airflow is not restricted, you have more working space for cabling, your power does not have to cross the back of the rack and you can separate your data and your power.

Beautiful Rack Cabling

This also means that some of our racks have the posts moved in beyond the standard rack mounting depth of 28″ in order to better accommodate our PDUs, the result of which is that I only have two out five racks that can accommodate a Dell PowerEdge.

Data and power not separated

You ideally want to run power on one side of the rack and data on the other. Most people will cite electromagnetic interference as a good reason for doing this but I have yet to see a problem caused by it (knock on wood). That being said, it is still a good idea to put some distance between the two, much like your recently divorced aunt and uncle at family functions. There are plenty of other good reasons for keeping data and power separate, most of which center around cabling hygiene – it helps keep things much cleaner because your data cables tend to run up and down the rack posts headed for things like your top-of-rack switch, whereas your power needs to go other places (i.e., equipment). It is a lot easier to bundle cables if they more or less share the same cable path.

Cannot access cable tray because of PDU cables

Cable Tray

This is just another version of “data and power are not separated”. Our power and data both come in at the top of the rack. This means our 10 4/C AWG feeds for each PDU which are about .5″ in diameter are draped across our cabling tray which just sits on top of the racks instead of being suspended by a ladder bar (another great injustice!). I bet these guys generate quite the electromagnetic field. It would nice if they were more than 4″ away from some of 10 Gbps interconnects, huh? This arrangement also means the cable tray is huge pain to use. You have to move all the PDU power cables off of it, then pop the lid off in segments to move your cables around. Or you can just run them all over the top of the rack and hope that the fiber survives like we do. Again. Not ideal.

Inconsistent fastener use for mounting equipment

This one sounds kind of innocuous but it is one of those small little details that makes your life so much easier. Pick a fastener type and size and stay with it. I am partial to M6 because the larger threads are harder to strip out and the head has more surface area for your driver’s bit to contact with. It is pretty annoying to change tools each time the fastener type is different instead of just setting your torque level on your driver and going for it. Also – don’t even think of using self-tapping fasteners. They make cage nuts and square holes in rack posts for a reason.

Improper rail mounting and/or retention

Your equipment comes with mounting instructions and you should probably follow them. Engineers calculate how much weight a particular rail can bear and then figure out that you need four fasteners of grade X on each rail to adequately support the equipment. This is all condensed into some terrible IKEA-level instructions which makes you shake your head as you wonder why your vendor could not afford a better technical writer for the obscene price of whatever equipment you are racking. Once you decipher these arcane incantations you should probably follow them. Don’t skip installing cage nuts and fasteners – if they say you need four, then you need four. It only takes two more minutes to do the job right.

AND FOR THE LOVE OF $DIETY INSTALL WHATEVER HARDWARE IS REQUIRED TO RETAIN THE EQUIPMENT IN THE RAILS! Seriously. This is a safety issue. I am not sure why this step is skipped and people just set things on the rails without using the screws to retain it to the posts but racks move, earthquakes happen and this shit is heavy. I think most our disk shelves are about 50 pounds. You do not want that falling out of the rack and onto your intern’s head.

Use ReadyRails (or vendor equivalent)

For about $80 dollars you can have a universal, tool-less rail that installs in about 30 seconds. I would call that a good investment.

Inconsistent inventory tagging locations

I am guessing your shop maintains an inventory system and you probably have little inventory tags you affix to equipment. Do your best to make the place where the inventory tag goes consistent and readable once everything is rack and stacked. The last thing you want to do is pull an entire rack apart because some auditor wants you find the magical inventory tag stuck on some disk shelf in the middle of 12 shelf aggregate.

It would also be a good idea to put your inventory tag in your documentation so you do not have to play a yearly game of “find the missing inventory tag”.

Cable labeling is not consistent (just use serialized cables)

I suck at cable labeling and documentation in general (see here) so this is a bit hypocritical, nevertheless I find that there are four stages in cable labeling: nothing, consistent labeling of port and device on each end, confusion as labeled cables are reused but the label is not changed and finally adoption of serialized cables where each end has a unique tag that is documented.

This is largely personal preference but the general rules are simple: keep it clean, keep it consistent and keep it current (your documentation that is). The only thing worse than a unlabeled cable is a mislabeled cable.

Gaps in rack mount devices

Shelf Gap

Why? Just why? I don’t know and will probably never know… but my best guess is the rail on the top shelf was slightly bent during installation and then when we needed to add another shelf later the rail interfered with it. 10 minutes originally could of have saved 10 hours down the road. If it turns out I am one post hole short of being able to install another shelf I get to move all the workloads off this aggregate, pull out all the disk shelves until I reach this one, fix or replace the rail, re-rack and re-cable everything, re-create the aggregate and then move the workloads back.

 

Now that I have complained a bit (I am sure r/sysadmin will say that I have it way to easy)  I get to talk about the real lesson here: none of this shit matters.

On one level they do. All these little oversights accumulate technical debt that eventually comes back and bites you on the ass and doing it right the first time is the easiest and most efficient way. On other hand, none of this stuff directly breaks things. The fact that the power and data cabling are too close together for my comfort or that there is small gap in one of the disk shelf stacks does not cause outages. We have plenty of things that do however and those demand my attention. So collectively, lets take a deep breath, let it go, and stop worrying about it. It’ll get fixed someday.

The other lesson here is that nothing is temporary. If you cut a corner, particularly with physical equipment, that corner will remain cut until the equipment is retired. It is just too hard and costly to correct these kinds of oversights once you are in production. If you are putting a new system up, take some time to plan it out – consider the failure domain, how much fault tolerance and redundancy you need, labeling, inventory and all those other little things. You only get to stand this system up once. Go slow and give it some forethought, you may thank yourself one day.

Documentation or how I wasted an hour

As if confirming my own  tendency to “do as I say, not what I do” I just wasted about an hour this morning trying to figure out why a newly created virtual machine was not correctly registering its hostname with Active Directory via DynamicDNS. Of course, this was a series of errors greatly exasperated by the fact that I had only had two out my required four cups of coffee and I stayed up too late watching the ironically named and absolutely hilarious Workaholics.

Let’s review, shall we?

  • Being tired and trying to do something mildly complicated
  • Allowing myself to become distracted by an interrupt task in the middle of this work
  • Not verifying the accuracy of our documentation prior to assigning the IP address in question to the virtual machine
  • Screwing up and assigning the IP address to the wrong virtual machine (both hostname and subnet octets are very similar)
  • Not reading the instrumentation; the output of ipconfig /all plainly said “(Duplicate)” Duh.

All of these factors made what should of been a 15 minute troubleshooting task stretch out into an hour.

Root cause: The IP address I picked for one of the virtual machines was already in use and the documentation was not updated to reflect this.

Potential solutions: I dunno… how about keeping our documentation updated (easier said than done)? Or better yet, stop using a “documentation system” for IP addresses that relies on discretionary operational practices (i.e., an Excel Spreadsheet stored on SharePoint) and use something like IPAM. Maybe, instead of going down the ol’ “runlist” of potential problems, I should of stopped, gathered a bit more information before I proceeded with troubleshooting? The issue was right there in the ipconfig output. I was looking *right* at it. I guess that is the difference between looking and seeing.

In short . . . happy Monday you jerks.

 

facepalm

 

The Art and Burden of Documentation

I have been thinking a lot about documentation lately, mostly about my own shortcomings and trying to understand why the act of documenting seems so difficult and why the quality of the documentation that does get done is often found lacking. Good documentation and good documentation practices are such a fundamental part of the health of an IT shop you would think we as a field would be better at it. My experience is limited and anecdotal (whose is not?) but I have yet to see a shop with solid documentation and solid documentation practices. This extends to myself as well. I can look back at my various positions and roles and there are very few where I actually felt satisfied with the quality of my documentation.

Read on for “aksysadmin’s made-up principles of how to not suck at documentation and do other things good too”.

 

1. Develop a standardized format, platform and process from the bottom up. Your team uses this, not you.

Leadership has a tendency to standardize on a single standard, platform or process. This is generally considered a good thing. The problem is, leadership does not write technical documentation. We do. And what standard makes sense to them, may not make any sense to the technical staff (*cough* ITIL *cough*). What platform seems adequate to them, may seem unwieldy to sysadmins (*cough* SharePoint *cough*).  Standardization may generally be considered a good thing but forcing a standard, platform or process on a team without input or understanding their problem domain and use case is generally considered a bad thing. The harder you make it for your team to document, the less likely they will be to perform a task they are already unlikely to perform.

2. Don’t document how, document why

This is part an internal challenge (IT staff documenting the wrong things) and an external challenge (leadership requiring the wrong kind of things to be documented). I see lots of documentation that is essentially a re-hashed version of a vendor’s manual. Ninety-nine percent of the time your vendor has exhaustive resources on how to do something. It is right there. In the manual. Go read it. Unless it is incredibly unintuitive, and sometimes it is, why would you waste your precious time re-writing an authoritative set of information into a non-authoritative set that requires your team to maintain it? Reading and understanding vendor documentation should be considered a fundamental skill, if your guys cannot read or are unwilling to read vendor manuals you have other problems that need addressing.

What you should document, is why you did things. You will not remember why this particular group was setup or why things are this way instead of that way in six months and your successor certainly will have no idea. Use your documentation to provide context and meaning.

3. Document where to find things

Documenting why something is the way it is great but it is also important to document where things are. I am talking about, things like IP Addresses, Organizational Charts, Passwords and so on. This is another opportunity to avoid work, err work more efficiently. Chances are many of these things have authoritative sources maintained by other people or tools. Why write your IP Addresses down manually in an Excel Spreadsheet when you can use a tool like IPAM to track them? Why track the phone tree for your different workgroups when Active Directory can do that for you? Why spend time doing stuff that is already done? Why indeed?

Figuring out what stuff to document in the where category can be hard to do. I have found the easiest way to do this is to pretend you are brand new. Better yet if you have a brand new team member ask him to track these kinds of information requests as he acquaints himself to your particular little piece of hell. What does he need to know right now to do your job? That is what your replacement will be asking himself after you have ascended.

4. Don’t document break/fix issues

Do not fill up your Wiki, SharePoint, OneNote or file share full of Word .docx with break/fix issues. Your infrastructure and process documentation should be broad and “provide context and meaning” which is pretty much the opposite kind of information than the kind break/fix issues are about – specific configurations, systems or problems.

You already have a place to “document” break/fix issues – it is called your damn ticket system. Use it. Document your fixes in your tickets. If your Tier-1 guys have a habit of closing all but the simplest of tickets with “done” or “fixed”, slap them (and probably yourself as well) and say that their future self just came back in time to hit them for making their job harder. If you do not have ticket system then you have other problems that need to be addressed.

5. Have a panic page

Take the really important stuff from the why documentation and the where documentation and make it into a panic page. A panic page is a short piece of documentation that contains all the information you or anyone else would need to have in order to deal with a “whoops” situation. Think things like vendor contact phone numbers, contract support entitlements, how to file and escalate a case and maybe where to find your co-worker’s emergency scotch. I borrowed this one from my supervisor and it turned out to be prescient suggestion on his part.

6. Have a hard copy

This is an extension of the panic page principle. Panic situations have a way of making electronic documentation inaccessible. “Oh but wait my documentation is in the cloud, I can get to it anywhere with my mobile device, oh I am so smart” you say, yeah well, you will be screwed when you drop your iPad or it runs out of batteries or you happen to live in Alaska which has comparable infrastructure to, say Afghanistan. Have a paper copy on hand, preferably two. Yes it will be harder to maintain but it will be a lot better than having no documentation if your file server explodes or the polar bears take over your data center.

7. Have designated documentation days

As a sysadmin, you generally do not have luxury of setting your own priorities. If your leadership wants documentation, instead of just saying “Hey, we need to document better” at your weekly staff meeting, they need to make it a priority. Nothing does this better than designating a day for documentation. Read-Only Friday is a good one because you are not making changes on Friday anyway, right… righhhhtttt? Of course, you are still going to get interruptions and tickets so designate one person as the interruption blocker and another team member as the documenter (borrowed from the Tom Limoncelli’s excellent Time Management for System Administrators). Rotate individuals as appropriate. These designated documentation days are your time, to make time, to actually Get Shit Done. All those little notes you meant to flesh out with some more context but never had time to… do it now. Organize your stuff. Clean it up. Review it for accuracy. Do this with a frequency related to how fast things change. Until leadership makes it a priority, you will always have another one that trumps it.

 

These ideas address some of those external and internal challenges you may have and that I know I have. I am more inclined to document stuff if I am not documenting dumb stuff that is already documented elsewhere. I will have an easier time finding the documentation we do have, if it is organized in a way that works for me and my team, both the consumer and producer of it. If I have dedicated time to actually perform the act of documenting it will probably get done. If not, then I am answerable to someone. Of course there can be a large gap between knowing what needs improvement and actually fixing it. Until then.

Stay frosty.