Imagine a trailer hooked on to the back of an Indy car - the effect on performance would be similar to what happens when you hook up hard drives to a server. That's because a hard drive is a slow, delicate, mechanical component attached to a computer system that is otherwise made up of solid-state electronics built for speed.
To get data on or off a disk, a drive head has to swing into the correct position and wait for the right part of the disk to come around. This can take hundredths of a second - many times longer than it takes to access data stored in RAM, for example. As a result, a disk I/O subsystem can be a huge data bottleneck, and significant improvements in the overall performance of the server can be achieved if the effects of this bottleneck can be minimized.
To understand how this might be done, let's take a look at how data is moved to and from a disk.
Put simply, the I/O subsystem accepts a stream of requests to read or write data to and from a disk which it holds in a queue. To help speed things up, it usually merges read or write requests together if they are close to each other in the queue, and if they involve the same area of the disk.
Read requests are generally given higher priority than write requests because a given process will have to wait for the read data it has requested before it can continue, while it won't have to wait for the result of a write.
The subsystem will also usually detect when data is being read sequentially from the disk and use a technique called "read-ahead" to read and cache the disk block following the one it has been asked to read. This can reduce seek time during sequential reads, although it does nothing to speed up reads to other random parts of the disk, and it is switched off when the subsystem detects random (i.e., non-sequential) reads.
After this simple merging has been carried out, the I/O subsystem could process requests in the queue in the order they are arrive at the head of the queue, but usually it doesn't. Instead it tries to speed things up by using a scheduler algorithm to decide the order in which requests should be processed. At the most basic level a scheduler will reorder requests to minimize the traveling the disk heads have to carry out, as this mechanical process is relatively time consuming. So reads in one area of the disk will be grouped, then reads in the next area, and so on.
To illustrate what scheduler algorithms do, it's helpful to think of elevators as an analogy. When an elevator arrives at the ground floor of a busy office block, many people get in and press the buttons for the floors to which they want to go, in a random order. Clearly it would make no sense to travel to the fifth floor, then the 40th, then the second, and then the 39th, just because that was the order in which the buttons were pressed. A far more efficient strategy is to stop at the floors in ascending order: the second, fifth, 39th and finally the 40th. By traveling in a single direction instead of backing up on itself, the elevator can drop everyone off at their floors in the minimum time.
But it turns out that this type of scheduling is not always optimal for an I/O subsystem (or for an elevator); it actually has a number of flaws. Going back to the elevator analogy, people working on the very highest floors would have to waste a lot of time, stopping at tens or potentially hundreds of intermediate floors, every time they traveled from the lobby to their office. For that reason many large buildings have some elevators that don't serve the lower floors at all, instead they go straight to the upper levels before stopping. The upper-floor workers benefit, but at the cost of the lower-floor workers, who now have fewer elevators that go to their floors.
So what is the best scheduler algorithm? In fact there is no such thing as a "best" one. Instead, there are a number of different schedulers, each one suited to different types of servers. The four principal schedulers common to most Linux systems are:
- completely fair queuing
This is the simplest scheduler, and it does nothing beyond merging requests and placing them in a queue to be processed in the order they arrive at the head of the queue.
The noop scheduler is useful in a system which uses a modern RAID controller which can schedule I/O better than the operating system. It is also useful on systems with solid state drives (including RAM disks,) which have a fixed seek time for all data, regardless of where it is stored on the device.
The deadline scheduler is designed to overcome the problem of a read request for a remote part of a disk getting serviced too slowly because it is constantly shunted to the back of the queue by requests arriving for closer parts of the disk. This is analogous to trying to get to the top floor of a building in an elevator that keeps stopping to let people on and off at intermediate floors. To overcome this, each request is given a waiting time or "deadline." If the request is not serviced before the deadline expires, it is then put to the front of the queue, merged with any requests for nearby disk locations and serviced immediately.
The deadline scheduler is useful for real-time applications because in most cases it keeps latency low since all requests are serviced within a relatively short time frame dictated by the deadline.
The anticipatory scheduler works by anticipating that the next request will be for the disk block that follows the one that has just been serviced. After carrying out a read or write, the scheduler enforces a slight delay to give an application time to submit a request for the next disk block. If such a request arrives (as anticipated), the disk head will be in the right place to service the request, instead of having moved away to service another request.
Using the elevator analogy, imagine a situation where an office block is host to a number of companies, each of which occupies, say, two floors. If an office worker gets on at a given floor, it's a good bet she will want to get off at the next floor, which also belongs to her company. So it is sensible for the elevator to wait until the office worker has pressed the button for the floor she wants before it shoots past the next floor.
These slight delays add up to a small amount of latency added to the system, but in some circumstances this can easily be outweighed by the efficiency gained from having the head anticipate where it next has to go.
The anticipatory scheduler is suited to tasks that don't get regularly interrupted by external requests. It tends to be useful for web servers and workstations that often access data in a linear fashion, but not for database servers that access data in a more random fashion.
cfq stands for "completely fair queuing." In this system the scheduler maintains a number of internal request queues, and each process running on the system is assigned to one of these queues. The schedule then takes the requests from the front of each queue and places them into a dispatch queue, where they are ordered to minimize seek time and serviced. The scheduler then goes back to the internal request queues and brings more requests from the front of these queues to refill the dispatch queue.
This is analogous to a building with a single elevator, with separate queues for each company that has offices in the building. Each time the lift arrives in the lobby it takes staff from the front of the queue for each company, so no individual company can hog the elevator excessively. Cfq works well for medium and large multi-processor systems, and is the default scheduler for Linux distros such as Red Hat Enterprise Linux.
In the first part of this article we looked at the role schedulers play in I/O optimization. But how do you actually select and tune a scheduler to increase I/O performance in practice?
The obvious first question that must be answered is this: Which scheduler should you use? It's a deceptively difficult question, and the answer will depend on many inter-related factors, including the applications running, the size of the files you are read and writing, the frequency of your file read and writes, and the pattern of these reads and writes. The only thing that can be said with much certainty is that unless you are using a solid state drive or RAM disk - which can access all files equally fast - the noop scheduler is the worst choice. The other, active schedulers should all perform better than noop with conventional spinning disks.
There's some evidence that when many different types of applications are requesting many different types of disk reads and writes the deadline scheduler is the best all-round performer, but in the end the best course of action is probably to test all three active schedulers and choose the one that gives the best results.
So once you've chosen a scheduler to test, how do you get your system to use it? There two primary ways to do this: at boot through a configuration file or on the fly from the command line. The examples we use here work for Red Hat Enterprise Linux but should be similar for any distribution you happen to be using.
To set a scheduler at boot, edit
to the end of the line that specifies the kernel. Scheduler names include "noop", "cfq", "deadline" and "as" (for anticipatory).
Alternatively, to set a given scheduler for disk hda on the fly, simply bring up a terminal and type:
To verify what scheduler hda is currently using, type
You'll see something like:
noop anticipatory 'deadline' cfq
which would indicate that the deadline scheduler is currently in use.
Once you've chosen and set a scheduler, you can tune it to work optimally with your system by altering various parameters. These parameters differ for each scheduler. The exception to this is the noop scheduler, which actually has no tunable parameters.
The parameters are set in files located at:
For example, the parameter read_expire on a device hda is stored in:
The parameter can be set to 80 (milliseconds) using the command:
echo 80 > /sys/block/hda/queue/iosched/read_expire
This has five tunable parameters:
- read_expire - the most important parameter for the deadline scheduler, it dictates the maximum time a read request must wait before being serviced.
- write_expire - the maximum time before a write request must be serviced.
- fifo_batch - the number of requests in each batch sent for immediate servicing once they have expired.
- writes_starved - this alters the amount of priority that read requests get over write requests. Since unserviced reads affect performance more than unserviced writes, it's usual that reads get some measure of priority.
- front_merges - this is disabled by default. Setting its value to 1 results in requests contiguous to existing requests in the scheduler being merged to the front of the queue instead of the back.
In addition to read_expire and write_expire, the anticipatory scheduler also includes:
- read_batch_expire - the time spent servicing read requests before servicing pending write requests.
- write_batch_expire - the reverse of the above
- antic_expire - the time in milliseconds that the scheduler pauses while waiting for a follow-on request from an application before servicing the next request in the queue
- quantum - the number of internal queues that the requests are taken from in one cycle and moved to the dispatch queue for processing. The cfq scheduler may have 64 internal queues, but only move requests to the dispatch queue by visiting the first eight internal queues, followed by the second eight in the next cycle, and so on.
- queued - the maximum number of requests allowed in a given internal queue.
Although there are no hard and fast rules, a sensible strategy is probably to benchmark I/O using different schedulers with their default parameter settings, and then to choose the most appropriate scheduler and attempt to fine-tune the parameter settings by referring to application-specific recommendations.
Don't forget that scheduling isn't the be all and end all of I/O optimization. Other factors that can affect performance include prefetching, disk capacity and spin rate (because these can affect seek time) and even the file system you choose for a given disk. One thing is certain: Whatever performance you are getting now, you can almost certainly improve it if you have the time, the will and the knowledge.
Paul Rubens is an IT consultant and journalist based in Marlow on Thames, England. He has been programming, tinkering and generally sitting in front of computer screens since his first encounter with a DEC PDP-11 in 1979.Article courtesy of ServerWatch