Open source provides you an edge in two separate ways. First of all, open source software allows you to dig deeper and locate the root of a problem. An often cited advantage of open source software is the ability of users to correct bugs. Well, guess what, this doesn't happen all that often. On the other hand, what happens more frequently is that when you encounter a problem you can examine the source code to uncover its root and devise a workaround. One example I give involves the ls command taking an inordinate amount of time to complete in directories with a very large number of files. By looking at the source code of ls you will see that some of the ls options will force it to perform a stat system call on every file it lists. You can eliminate this overhead by judiciously choosing the options that you really need, for example by eliminating the coloring of the files according to their type.
A second advantage we gain from open source software and systems is the ability to learn from them by reading their code. In Code Reading and Code Quality I use more than one thousand examples from open source systems. Apart from illustrating specific concepts in a realistic setting, my hope is that readers will pick the habit of examining the source code they have at their disposal for learning better coding techniques.
LP: Can you give me a concrete example?
Consider clever data structures. Have a look at how the GNU C library implements the various character classification functions, like isalpha and isdigit. You will find that through an nifty indexing scheme and bit-ANDing operations a single integer array of 384 elements is used to store data for 12 classification functions. This implementation efficiently derives a function's result with a single lookup operation, and allows indexing through signed and unsigned characters, as well as the EOF constant. Or, have a look at how the Apache web server maps the schemes used for data access (like http, ftp, or https) into their corresponding TCP ports. The schemes are ingeniously ordered into a table by the frequency of their expected occurrence (http comes first and the obscure prospero scheme comes last). This will speed up the lookup for typical web server loads. These programming tricks are clever and can give you a significant performance edge. Nevertheless, you will not learn them in a typical Algorithms and Data Structures course; you have to dig into the source code to discover these gems.
LP: Software development organizations put a lot of emphasis on the process they use for creating software. Could similar ideas be applied to system administration?
I find that the development process has been overrated in the past. The agility movement, which puts emphasis on the working product instead of the process, is an expected backlash. In a creative activity like programming you need to address the product first and foremost, and this is why I examined concrete code examples, rather than abstract processes that supposedly led to them. Although a slapdash process will often result in disorderly code, a methodical process doesn't guarantee neat code. You need brilliance, creativity, and good taste to obtain programs that really shine.
Yet in the system administration field I feel there's often too little emphasis on the process. Practices that are taken for granted in modern software development, like configuration management, issue tracking, nightly builds, code reviews, refactoring, automated testing, and coding standards, have yet to make a significant impact in the field of system administration.
LP: Many of the practices you mentioned appear to be focused on code. How should for instance a system administrator apply the concept of nightly builds?
One significant property of well-run configuration management system is the storage of all assets (source code, documentation, bitmaps, tools) in the system's repository, and therefore the ability to perform a complete build by checking out the software base on a clean system. This task can be part of a software's nightly build procedure, and by setting things up in this way we ensure that we don't have any hidden dependencies living outside our configuration management system.
Moving this process to the field of system administration, I would expect that a test system is rebuilt nightly unattended from scratch using the operating system distribution files, appropriate scripts, and add-on packages. All needed elements would be stored on local file servers under a configuration management system like CVS. Such a practice obviates the all too common danger of having a running system depending on a tool that was once fetched over the net from a site that has ceased to exist.
LP: A lot of administration work focuses on maintaining existing systems. How can we improve on that situation?
I often discuss the fleeting notion of maintainability in terms of slightly more concrete attributes: analyzability, changeability, stability, and testability. If a system satisfies these attributes, then it will be easy to maintain. As an example, the startup sequence of a typical Unix system scores high in both analyzability and changeability. A single directory (such as rc.d or init.d) contains scripts that are executed for each subsystem. A system administrator can both read the scripts to understand what is going on, and modify them to change their behavior. However, this system suffers in terms of stability and testability. Until recently dependencies between subsystems were difficult to express, and this resulted in brittle configurations. Also there's still no standardized way to determine whether a particular subsystem has been correctly initialized, and whether it is running correctly.
For existing systems, any improvement in the directions I've outlined will result in a more maintainable whole. For example, some modern Unix systems allow the declarative specification of subsystem dependencies in terms of requires and provides relations. This feature improves analyzability, changeability, and stability.
Testability seems to be a tough nut to crack, especially when you're dealing with events that are difficult to reproduce. For instance, the correct setup of a UPS is very tricky. There are many low probability events that can wreck havoc. What happens when power is restored during the low battery shutdown sequence? What happens if there's a second power failure when the system boots with batteries that are still discharged? There are ways to deal with these events, but testing them isn't easy. I guess that duplicating the success of unit testing in the filed of system administration will prove an elusive goal.