What Can System Administrators Learn from Programmers?

Although we often hear about program bugs and techniques to get rid of them, we seldom see a similar focus in the field of system administration. LinuxPlanet asked Diomidis Spinellis, the author of the book Code Quality: The Open Source Perspective, for tips on what system administrators can learn from programmers.

LinuxPlanet: How would you judge the quality of a system’s setup?

Surprisingly, I would use the same attributes as those I employ for describing the quality of a software artifact: functionality, reliability, usability, efficiency, maintainability, and portability. I would ask questions like the following. What services does the system offer to its users? Has the system been setup in a way that it can run uninterrupted for months? How easy is it to manage its services, or to restore a file from a backup? Is there any waste in the utilization of the CPUs, the main memory, or the disk storage? How difficult would it be to upgrade the operating system or the installed applications? How difficult would it be to move the services to a different platform?

LP: So what can system administrators learn from programmers?

System administration is sadly a profession that doesn’t get the type of attention given to software development. This is unfortunate, because increasingly the reliability of an IT system depends as much on the software comprising the system as on the support infrastructure hosting it. Nowadays, especially when dealing with open source software, you don’t simply install an application; you often install with it a database server, an application server, a desktop environment, and libraries providing functionality like XML parsing and graphics drawing. Furthermore, for the application to work on the Internet you need network connectivity, routing, a working DNS server, and a mail transfer agent. And for reliable operation underneath you often deploy redundant servers, RAID disks, and ECC memory.

A well-setup system, say a Linux installation, has many quality attributes that are the same as those of a well-written program. Both system administrators and programmers can use similar techniques for implementing quality systems.

LP: Give me an example. I have a system with serious time performance problems. Where do I start?

I suggest that the first step you should undertake would be to characterize the system’s load. Using the top command available for Linux systems you will first see how your system spends its time. Near the screen’s top you will see a line like the following.

CPU states: 80.7% user, 17.1% system, 0.0% nice, 2.2% idle

You can typically distinguish three separate cases.

Most of the time is spent in the user state. This means that your system’s applications are primarily executing code directly in their own context. You will need to tune your applications, using profiling tools to locate the code hotspots and algorithmic improvements to optimize their operation.
Most of the time is spent in the system state. This is a situation where most of the time the kernel is executing code on the application’s behalf. In this case you will first monitor the applications using tools like strace to see how they interact with the operating system, and then use techniques like application-level caching or I/O multiplexing to minimize the operating system interactions.
Most time is spent in the idle state. Here you’re dealing with I/O bound problems. You’ll need to look at the way your application interacts with peripherals, like the storage system and the network. You can again use strace to see which I/O system calls take a long time to complete, and the vmstat, iostat, and netstat commands to see whether there’s more performance you can squeeze out of your peripherals. If indeed your peripherals are operating well below their rated throughput you can use larger buffers to optimize the interactions, otherwise you must minimize the interactions, again using algorithmic improvements or caching techniques.

LP: It looks like you can often trade space for time. Is this so?

Yes, I wrote that space and time form a ying/yang relationship, and there are many cases where you can escape a tight corner by trading one of the two to get more of the other. Consider a sluggish database system. You can often improve its performance by adding appropriate indexes, and even precomputing the results of some queries and storing them in new tables. Both the indexes and the new tables take additional space, but this space gives you increased performance. Now consider the opposite case where your data overflows your available storage. (Star Trek‘s “Space the final frontier” opening line was true in more than one sense.) If you have sufficient CPU time at your disposal you can devote that spare time to compress and decompress your data. Your MP3 player and digital camera have succeeded as products by adopting exactly this design tradeoff.

LinuxPlanet: What is the role of open source software like Linux in performance engineering?

Open source provides you an edge in two separate ways. First of all, open source software allows you to dig deeper and locate the root of a problem. An often cited advantage of open source software is the ability of users to correct bugs. Well, guess what, this doesn’t happen all that often. On the other hand, what happens more frequently is that when you encounter a problem you can examine the source code to uncover its root and devise a workaround. One example I give involves the ls command taking an inordinate amount of time to complete in directories with a very large number of files. By looking at the source code of ls you will see that some of the ls options will force it to perform a stat system call on every file it lists. You can eliminate this overhead by judiciously choosing the options that you really need, for example by eliminating the coloring of the files according to their type.

A second advantage we gain from open source software and systems is the ability to learn from them by reading their code. In Code Reading and Code Quality I use more than one thousand examples from open source systems. Apart from illustrating specific concepts in a realistic setting, my hope is that readers will pick the habit of examining the source code they have at their disposal for learning better coding techniques.

LP: Can you give me a concrete example?

Consider clever data structures. Have a look at how the GNU C library implements the various character classification functions, like isalpha and isdigit. You will find that through an nifty indexing scheme and bit-ANDing operations a single integer array of 384 elements is used to store data for 12 classification functions. This implementation efficiently derives a function’s result with a single lookup operation, and allows indexing through signed and unsigned characters, as well as the EOF constant. Or, have a look at how the Apache web server maps the schemes used for data access (like http, ftp, or https) into their corresponding TCP ports. The schemes are ingeniously ordered into a table by the frequency of their expected occurrence (http comes first and the obscure prospero scheme comes last). This will speed up the lookup for typical web server loads. These programming tricks are clever and can give you a significant performance edge. Nevertheless, you will not learn them in a typical Algorithms and Data Structures course; you have to dig into the source code to discover these gems.

LP: Software development organizations put a lot of emphasis on the process they use for creating software. Could similar ideas be applied to system administration?

I find that the development process has been overrated in the past. The agility movement, which puts emphasis on the working product instead of the process, is an expected backlash. In a creative activity like programming you need to address the product first and foremost, and this is why I examined concrete code examples, rather than abstract processes that supposedly led to them. Although a slapdash process will often result in disorderly code, a methodical process doesn’t guarantee neat code. You need brilliance, creativity, and good taste to obtain programs that really shine.

Yet in the system administration field I feel there’s often too little emphasis on the process. Practices that are taken for granted in modern software development, like configuration management, issue tracking, nightly builds, code reviews, refactoring, automated testing, and coding standards, have yet to make a significant impact in the field of system administration.

LP: Many of the practices you mentioned appear to be focused on code. How should for instance a system administrator apply the concept of nightly builds?

One significant property of well-run configuration management system is the storage of all assets (source code, documentation, bitmaps, tools) in the system’s repository, and therefore the ability to perform a complete build by checking out the software base on a clean system. This task can be part of a software’s nightly build procedure, and by setting things up in this way we ensure that we don’t have any hidden dependencies living outside our configuration management system.

Moving this process to the field of system administration, I would expect that a test system is rebuilt nightly unattended from scratch using the operating system distribution files, appropriate scripts, and add-on packages. All needed elements would be stored on local file servers under a configuration management system like CVS. Such a practice obviates the all too common danger of having a running system depending on a tool that was once fetched over the net from a site that has ceased to exist.

LP: A lot of administration work focuses on maintaining existing systems. How can we improve on that situation?

I often discuss the fleeting notion of maintainability in terms of slightly more concrete attributes: analyzability, changeability, stability, and testability. If a system satisfies these attributes, then it will be easy to maintain. As an example, the startup sequence of a typical Unix system scores high in both analyzability and changeability. A single directory (such as rc.d or init.d) contains scripts that are executed for each subsystem. A system administrator can both read the scripts to understand what is going on, and modify them to change their behavior. However, this system suffers in terms of stability and testability. Until recently dependencies between subsystems were difficult to express, and this resulted in brittle configurations. Also there’s still no standardized way to determine whether a particular subsystem has been correctly initialized, and whether it is running correctly.

For existing systems, any improvement in the directions I’ve outlined will result in a more maintainable whole. For example, some modern Unix systems allow the declarative specification of subsystem dependencies in terms of requires and provides relations. This feature improves analyzability, changeability, and stability.

Testability seems to be a tough nut to crack, especially when you’re dealing with events that are difficult to reproduce. For instance, the correct setup of a UPS is very tricky. There are many low probability events that can wreck havoc. What happens when power is restored during the low battery shutdown sequence? What happens if there’s a second power failure when the system boots with batteries that are still discharged? There are ways to deal with these events, but testing them isn’t easy. I guess that duplicating the success of unit testing in the filed of system administration will prove an elusive goal.

LinuxPlanet: So how can Linux distributions be improved?

Linux has been a success on many fronts. It has provided an affordable operating system alternative to Windows, it has popularized the concept of open source software, and it has been used by many as a platform to experiment with new ideas, spurring innovation. Yet, Linux distributions, especially commercial ones from which users have higher expectations of quality, have a lot of room for improvement. I would like to see distributions where everything provided binds together into a coherent whole: documentation, command-line interfaces, GUI tools, file formats.

One of the strengths of Unix has been the pervasive application of a few simple ideas and conventions throughout the system: all commands can process data read from their standard input and will print results on their standard output; all commands, interfaces, devices, and file formats are documented in separate sections of the online documentation; all configuration data is stored in textual files; code duplication is avoided through reusable libraries; all resources are available via the file system interface. Linux-based systems have contributed to these ideas—think of the rich data that is nowadays available through the /proc filesystem. Yet many of the distributed packages don’t play by these rules. I would prefer to see distributions that focus on consistency over choice. By eliminating renegade packages from their distribution (do users really need twelve different file archivers?) and contributing patches that improve the consistency of the selected packages, distributions would exert evolutionary pressure toward the provision of higher quality building blocks.

LP: What can the Linux community learn from the travails of the Windows Vista release?

Although we don’t know for sure what plagued Windows Vista (I think there’s sufficient material available to write a book with insights comparable to Brooks’s The Mythical Man Month), enough has been published by Microsoft insiders on blogs to give us a rough idea of the problems. I believe that Linux is not immune to them. More than twenty years ago Manny Lehman argued convincingly that the software’s structure decays through the evolutionary changes we apply to it, unless we invest effort to spruce its structure up. The agile software community has aptly termed the shortfall between added features and missing refactoring as technical debt. In open source development the addition of features is more glamorous than the refactoring of code. Refactoring can create short-term stability problems and sour developer relations in return for long-term benefits that are not easily perceivable.

My research group is heading SQO-OSS: a multi-national EU research project to create a software quality observatory for open source software. I believe that making the quality of open source software a measurable and observable attribute will be a small first step toward higher quality systems.

One of the attributes I think the Linux community should focus on is complexity. It appears that this delayed Vista, and it has the potential to burry Linux in the long term. There are many techniques to harness complexity. We need to increase modularity, and limit undesirable coupling. As a first step we need tools that can measure these attributes (you’ve probably heard that you can’t improve what you can’t measure) and pinpoint problems areas; as a second step we should setup incentive mechanisms for contributing code that will address the problems we’ll uncover.

LP: Any final thoughts?

The open source movement has been a catalyst for many aspects of software development. It has allowed us to learn from the code others have written, it gave us the opportunity to examine the quality of the code behind an application’s façade, it has provided us with many high quality development tools and reusable components, and it has shifted the attention of our community from a focus on the process that was on the verge of becoming fruitless back to the rich intricacies of the product. I was lucky to be able to explore these possibilities in Code Reading and Code Quality.

Currently, system administrators throughout the world toil solving similar problems again and again. Yes software uniquely allows us to store what Phillip Armour has perceptively termed “executable knowledge.” In the future I hope to see complex system administration tasks developed, packaged, and deployed using open source software collaboration methodologies.

About the Author

Diomidis Spinellis is an Associate Professor in the Department of Management Science and Technology at the Athens University of Economics and Business, Greece. His research interests include software engineering tools, programming languages, and computer security. He holds an MEng in Software Engineering and a PhD in Computer Science both from Imperial College London. He has published more than 100 technical papers in the areas of software engineering, information security, and ubiquitous computing. He has also written the two Open Source Perspective books: Code Reading (Software Development Productivity Award 2004), and Code Quality. He is a member of the IEEE Software editorial board, authoring the regular “Tools of the Trade” column.

Dr. Spinellis is a FreeBSD committer and the author of a number of open-source software packages, libraries, and tools. He is a member of the ACM, the IEEE, and the Usenix Association; a four times winner of the International Obfuscated C Code Contest and a member of the crew listed in the Usenix Association 1993 Lifetime Achievement Award.

About the Book

Diomidis Spinellis. Code Quality: The Open Source Perspective. Addison-Wesley, 2006.

Using hundreds of examples from open source software projects, like the Apache web and application servers, the X Window System, and the HSQLDB Java database, this book illustrates code quality concepts that every developer should appreciate and apply. From this book you will learn how to judge software code quality attributes: reliability, security, time and space performance, portability, accuracy, and maintainability. Having mastered this art, you’ll then be able to apply your new-found sense to the code you write on your own and to the code written by others, aiming to assess its quality aspects and improve what you find lacking. You can also use your acquired knowledge of code quality when you discuss implementation alternatives with your colleagues, hopefully nudging your project toward the most appropriate direction.

This article was first published on LinuxPlanet.com.