18 cardinal rules of systems administration

Cardinal rules of system administration

Rules to live by

It's not just knowing how to set up and maintain your servers and understanding how system commands work that makes you a good system administrator -- or even knowing how to fix things when something breaks down, how to monitor performance, how to manage backups, or how to craft superbly clever scripts. It's knowing these things and holding yourself to a set of cardinal rules that help to keep your systems running smoothly and your users happy.

Many of these rules you've probably heard numerous times. Some you’ve probably learned the hard way (when you got seriously burned). These are practices that have proven themselves valuable through decades of systems administration and helped a lot of us keep our cool when the going got hot.

See also:

Never do anything you can't back out of

Never do anything you can't back out of

Except for the simplest of changes, you should always have a back out plan. Are you prepared to undo the changes you are about to make? There are many ways to leave technical “bread crumbs” on the path your changes are taking you so that you can get back to the point at which you started. Make backup copies of files you're about to edit; you might not remember the previous settings in a complex configuration file. Make note of any problems that you run into. Apply your changes to a test system before you touch your production systems – and make sure those changes are successful before you move on.

Plan your changes well ahead of time, using peer review whenever possible. A second set of eyes might spot issues that you hadn’t considered.

Avoid making changes on Fridays

Avoid making changes on Fridays

Never change anything right before you're going to be absent for some number of days. Always let changes settle in and prove themselves before you leave your systems running without the benefit of you hovering over them.

Identify root causes

Identify root causes

Dig down to the root causes of any problems you run into. When in doubt, use the “five whys” rule. My server crashed. Why? Because it ran out of memory. Why? Because one process went nuts. Why? Because it got itself into a loop. Why? Because there was an error in its configuration file. Why? Because I edited that file right before I left the office Friday night and forgot to run tests to ensure that everything was running OK.

Practice your disaster recovery plans

Practice your disaster recovery plans

Practice your disaster recovery plans so you'll feel comfortable when you have to put them to use. If you don't practice them, two things happen. First, you won't be confident that they'll work and, second, you might feel unsure of the steps you need to take. Say, for example, you have to migrate a database to a server at a remote site. Do you know the commands to run? Is the database dump ready now or do you have to create it? Do you have any idea how long it will take to move the file? Will you be ready to bring up the database on the remote site? Do you have a set of tests to verify that it's running properly?

Never rely on a script that you haven't thoroughly tested

Never rely on a script that you haven't thoroughly tested

It's so easy to make mistakes. Test your scripts even if you've been scripting for decades and especially if someone else might someday run them. Test your scripts with and without arguments. Test your scripts by making the kinds of mistakes that someone else might make. Test your scripts.

Automate anything you have to do more than three times and anything that is complicated

Automate anything you have to do more than three times and anything that is complicated

Capture your most clever commands in aliases, functions, and scripts – and give them meaningful names. Commit the complicated processes that you perform to scripts so that you don't have to figure out the steps required and the complex commands more than once. You'll save yourself a lot of time and effort over the long haul and have a much easier time if and when you need someone else to do the work for you.

Document your work

Document your work

Document the processes that you run routinely. What are the things that you do that wouldn’t necessarily be obvious to someone else? Maybe you run a script that looks for warnings in your log files about servers running short of disk space and the humidity in the data center getting too high.

Add comments to your scripts. You might believe that the commands that you use are obvious, but they might not be so obvious if you stop using them and then have to come back to them a year or two from now. Don't sacrifice readability for conciseness; someone else might have to read your code. Always have enough of what you do written down that someone can take over your work when you decide that it's time for greener pastures or have a chance at a promotion.

Pay attention to your mistakes

Pay attention to your mistakes

Understanding the flaws in your thinking may be the only way to fully get past them. Pay attention to the kind of things you do wrong and notice when you make the same type of mistake more than once.

Maybe you forget to change default passwords or set up services that will break when a password expires. Maybe you don’t take the time to verify that your backups are usable. Maybe you forget to lock accounts when someone leaves the company. Whatever the issues, make a point of noting your oversights and finding reliable ways to remind yourself of those things you might too easily overlook.

Be more than a little paranoid

Be more than a little paranoid

Always ask yourself questions like "Could somebody misuse this?", "Could someone break this?", and "How is this service vulnerable?" Use limited permissions on all of the scripts that no one but the admins need to see. Thinking defensively could save you a lot of pain. And, when it comes to administering servers, paranoia is a virtue!

Be proactive

Be proactive

Not every problem will come screaming to your door. Spend your spare moments looking for problems and verifying that things are working as they should. Think about the kinds of things that can go wrong and check to see whether they are.

Automate as much of your checking for problems as you can, but find a way to ensure that problems come to your attention and that you’ll notice if the alerts you should be seeing stop arriving. Having warnings about services being down on one of your most critical servers landing in your spam folder is not going to win you any awards.

Pay a LOT of attention to security

Pay a LOT of attention to security

Your security efforts should be commensurate with the data you are protecting. Know what you're protecting. Know who "owns" the data you’re protecting. Follow best practices such as least privilege, regular patching, monitoring critical services, and vulnerability testing. Run only the services you need. Be on the alert for any signs of break-in or system compromise. Be prepared with an escalation path so that you know when and to whom you need to report signs of system compromise.

Don't ignore your log files

Don't ignore your log files

Routine monitoring of your logs files can alert you to problems long before they threaten the well-being of your servers and the services they support. Check for errors and warnings. Invest in a tool that monitors log messages or build your own scripts. No one has enough time to go through all of the messages that will be added to your log files.

Back up everything

Back up everything

Follow a good backup policy even if your servers are replicated. A replicated error is still an error. Test your backups. Make sure they're good *before* you need to rely on them.

Employ redundancy wherever you can afford it. To the extent possible, tolerate no single points of failure -- even yourself.

Consider everyone's time as valuable as your own

Consider everyone's time as valuable as your own

Systems administrators tend to be a tad on the arrogant side. We are the wizards in our own magical domains. Even so, get to meetings on time and get back to people quickly when they ask for help, even if just to say that you're working on their problems. Treat your customers with respect even when they can't find their way to the command line. They may be magicians in their own domains of expertise and, even when they’re not, they’re the reason we’re so important.

Keep your users informed

Keep your users informed

Make sure your users always know what to expect, especially when major upgrades are planned. It helps them to have confidence in you and trust the services they depend on. Communicate, be visible, use a ticketing system, and pay attention to how long it takes to resolve problems.

Go out of your way to be likable

Go out of your way to be likable

Systems administrators do not have to be unapproachable and arrogant. In fact some of the most skilled and really sharp sysadmins that I worked with over the years never displayed a shred of superiority. They didn't have to.

Never stop picking up new skills

Never stop picking up new skills

If you're not moving ahead, you're falling behind. Always be looking for new things to learn. You'll be ready to take on new responsibilities and maybe even survive a layoff. If you're not sure what skills you should pick up, check out some job descriptions at places you think you'd enjoy working. How do you measure up? What skills are in high demand? Can you put aside a little time each day to learn something new?

Seek a balanced life

Seek a balanced life

Balance your work life with things that you enjoy and maybe even find some activities that will reward you that have nothing to do with being an amazingly talented and insightful systems administrator. Don't tie your ego to a single hitching post. Even if you love your work, don't let it be the only thing that makes you feel accomplished or important. Even you can be thrown under the bus. Don’t let the tread marks scar you for life. You are not your job. Don’t be defined by office politics. See the games for what they are and seek to be someone you would admire.