When Designing Products for System Administrators

System administrators are a usually very grumpy lot. In the 22 years of being a system administrator of some sort or another, it has been a rare job where I have not been split between 4 tasks, 3 managers, and multiple goals. Most of the people I have known have rarely become system administrators by choice but have been thrust into it by some outside influence (oh the printer works when you jiggle it.. your now the sysadmin.) Usually time for training or learning new things is non-existant... managers (even those who were once techies) seem to have an idea that a System Administrator just needs to see a new box and will know everything about it like some evil wizard from old. It is a myth that many of us System Administrators or Bastard Operators try to keep up in some form because a) we either don't feel confident enough already but are good at faking it, or b) we already overworked and need as much alone time as possible to keep what hardware we have running.
These are things that I have found most developers of system administration tools and configurations don't get. Looking at some of the insane configuration tools from AIX SMIT to HPUX SAM, we end up with configuation tools that you require about 6 months of deep training to be able to get it to boot and stay up reasonably well. Yes it does inspire a certain kind of brand loyalty.. a sort of passive aggressive one that is also something BOFH's and SA's are known for. For my part I try to evaluate any tool or product by what I call the 2am pager incident:

Take a product and get it running. Then spend 24-48 hours awake working on some other hot project that has to be done by X. After you finally get to sleep, set up the pager to wake you at 2am. If you can rebuild, configure and get the product working by 6am it can be used in production. If it can't be it is too complex or unreliable to be useful in any server environment. I came up with this test after having this happen multiple times at both my first consulting job and then my first job at a startup. I would never deal with Digital OSF or HPUX again because of this complexity, but found Solaris 2.4 and Red Hat Linux 3 to be perfectly workable (AIX was just fun to have smitty run and fall over.. at 6 am its always funny).

The second test is to get the new administrator to get it working. Always give them a 24-48 hour deadline and then 'break' it just after it has been put up. If the new administrator can fix the system without having to call technical support its a good sign. On the other hand if the sysadmin quits or sets the box on fire... its probably not something you want in production.

While I have been using these tests for 10+ years now to good results, I have found that most senior sysadmins have similar rules (though the can you configure it after a quart of jagermeister is just too extreme for me). In the end they come down to the following:

  1. Assume that your customer is under-trained, tired, but does not want too much to get in their way of finding and fixing a problem.  Clippy should only be there if you can set him on fire at 6am. 
  2. Put configuration files in obvious places. Putting some configuration files in /etc and others in /var/lib/moo and some elsewhere is definitely a killer."
  3. Document what flags do. No one wants to find out right before the presentation that '--clean' cleans up bad data but '--clean --clean' reformats the whole Database.
  4. Make sure that commands are simple, easy to remember and
  5. "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?" - Brian W. Kernighan and P.J. Plauger, The Elements of Programming Style, Second Edition
  6. "Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction." -- Albert Einstein 
  7.  "Everything should be made as simple as possible, but not simpler." -- [possibly] Albert Einstein
So when creating things for an enterprise, cloud, cluster, or small business system... please try to remember that the person who is either going praise you OR send fecal matter in the mail to you is going to be tired, irritable, and overworked when the problem occurs. [Not that I think that fecal matter should be sent to developers.. I have just seen it happen before]

No comments: