Imagine a trailer hooked on to the back of an Indy car – the effect on performance
would be similar to what happens when you hook up hard drives to a server. That’s because
a hard drive is a slow, delicate, mechanical component attached to a computer system that
is otherwise made up of solid-state electronics built for speed.
To get data on or off a disk, a drive head has to swing into the correct position and
wait for the right part of the disk to come around. This can take hundredths of a second
– many times longer than it takes to access data stored in RAM, for example. As a result,
a disk I/O subsystem can be
a huge data bottleneck, and significant improvements in the overall performance of the
server can be achieved if the effects of this bottleneck can be minimized.
To understand how this might be done, let’s take a look at how data is moved to and
from a disk.
Put simply, the I/O subsystem accepts a stream of requests to read or write data to
and from a disk which it holds in a queue. To help speed things up, it usually merges
read or write requests together if they are close to each other in the queue, and if they
involve the same area of the disk.
Read requests are generally given higher priority than write requests because a given
process will have to wait for the read data it has requested before it can continue,
while it won’t have to wait for the result of a write.
The subsystem will also usually detect when data is being read sequentially from the
disk and use a technique called “read-ahead” to read and cache the disk block following
the one it has been asked to read. This can reduce seek time during sequential reads,
although it does nothing to speed up reads to other random parts of the disk, and it is
switched off when the subsystem detects random (i.e., non-sequential) reads.
After this simple merging has been carried out, the I/O subsystem could process
requests in the queue in the order they are arrive at the head of the queue, but usually
it doesn’t. Instead it tries to speed things up by using a scheduler algorithm to decide
the order in which requests should be processed. At the most basic level a scheduler will
reorder requests to minimize the traveling the disk heads have to carry out, as this
mechanical process is relatively time consuming. So reads in one area of the disk will be
grouped, then reads in the next area, and so on.
To illustrate what scheduler algorithms do, it’s helpful to think of elevators as an
analogy. When an elevator arrives at the ground floor of a busy office block, many people
get in and press the buttons for the floors to which they want to go, in a random order.
Clearly it would make no sense to travel to the fifth floor, then the 40th, then the
second, and then the 39th, just because that was the order in which the buttons were
pressed. A far more efficient strategy is to stop at the floors in ascending order: the
second, fifth, 39th and finally the 40th. By traveling in a single direction instead of
backing up on itself, the elevator can drop everyone off at their floors in the minimum
But it turns out that this type of scheduling is not always optimal for an I/O
subsystem (or for an elevator); it actually has a number of flaws. Going back to the
elevator analogy, people working on the very highest floors would have to waste a lot of
time, stopping at tens or potentially hundreds of intermediate floors, every time they
traveled from the lobby to their office. For that reason many large buildings have some
elevators that don’t serve the lower floors at all, instead they go straight to the upper
levels before stopping. The upper-floor workers benefit, but at the cost of the
lower-floor workers, who now have fewer elevators that go to their floors.
So what is the best scheduler algorithm? In fact there is no such thing as a “best”
one. Instead, there are a number of different schedulers, each one suited to different
types of servers. The four principal schedulers common to most Linux systems are:
- completely fair queuing
This is the simplest scheduler, and it does nothing beyond merging requests and
placing them in a queue to be processed in the order they arrive at the head of the
The noop scheduler is useful in a system which uses a modern RAID controller which can schedule
I/O better than the operating system. It is also useful on systems with solid state drives
(including RAM disks,) which have a fixed seek time for all data, regardless of where it
is stored on the device.
The deadline scheduler is designed to overcome the problem of a read request for a
remote part of a disk getting serviced too slowly because it is constantly shunted to the
back of the queue by requests arriving for closer parts of the disk. This is analogous to
trying to get to the top floor of a building in an elevator that keeps stopping to let
people on and off at intermediate floors. To overcome this, each request is given a
waiting time or “deadline.” If the request is not serviced before the deadline expires,
it is then put to the front of the queue, merged with any requests for nearby disk
locations and serviced immediately.
The deadline scheduler is useful for real-time applications because in most cases it
keeps latency low since all requests are serviced within a relatively short time frame
dictated by the deadline.
The anticipatory scheduler works by anticipating that the next request will be for the
disk block that follows the one that has just been serviced. After carrying out a read or
write, the scheduler enforces a slight delay to give an application time to submit a
request for the next disk block. If such a request arrives (as anticipated), the disk
head will be in the right place to service the request, instead of having moved away to
service another request.
Using the elevator analogy, imagine a situation where an office block is host to a
number of companies, each of which occupies, say, two floors. If an office worker gets on
at a given floor, it’s a good bet she will want to get off at the next floor, which also
belongs to her company. So it is sensible for the elevator to wait until the office
worker has pressed the button for the floor she wants before it shoots past the next
These slight delays add up to a small amount of latency added to the system, but in
some circumstances this can easily be outweighed by the efficiency gained from having the
head anticipate where it next has to go.
The anticipatory scheduler is suited to tasks that don’t get regularly interrupted by
external requests. It tends to be useful for web servers and workstations that often
access data in a linear fashion, but not for database servers that access data in a more
cfq stands for “completely fair queuing.” In this system the scheduler maintains a
number of internal request queues, and each process running on the system is assigned to
one of these queues. The schedule then takes the requests from the front of each queue
and places them into a dispatch queue, where they are ordered to minimize seek time and
serviced. The scheduler then goes back to the internal request queues and brings more
requests from the front of these queues to refill the dispatch queue.
This is analogous to a building with a single elevator, with separate queues for each
company that has offices in the building. Each time the lift arrives in the lobby it
takes staff from the front of the queue for each company, so no individual company can
hog the elevator excessively. Cfq works well for medium and large multi-processor
systems, and is the default scheduler for Linux distros such as Red Hat Enterprise
In the first part of this article we looked at the role schedulers play in I/O optimization. But how do you
actually select and tune a scheduler to increase I/O performance in practice?
The obvious first question that must be answered is this: Which scheduler should you
use? It’s a deceptively difficult question, and the answer will depend on many
inter-related factors, including the applications running, the size of the files you are
read and writing, the frequency of your file read and writes, and the pattern of these
reads and writes. The only thing that can be said with much certainty is that unless you
are using a solid
state drive or RAM disk – which can access all files equally fast – the noop
scheduler is the worst choice. The other, active schedulers should all perform better
than noop with conventional spinning disks.
There’s some evidence that when many different types of applications are requesting
many different types of disk reads and writes the deadline scheduler is the best
all-round performer, but in the end the best course of action is probably to test all
three active schedulers and choose the one that gives the best results.
So once you’ve chosen a scheduler to test, how do you get your system to use it? There
two primary ways to do this: at boot through a configuration file or on the fly from the
command line. The examples we use here work for Red Hat Enterprise Linux but should be
similar for any distribution you happen to be using.
To set a scheduler at boot, edit
to the end of the line that specifies the kernel. Scheduler names include “noop”,
“cfq”, “deadline” and “as” (for anticipatory).
Alternatively, to set a given scheduler for disk hda on the fly, simply bring up a
terminal and type:
To verify what scheduler hda is currently using, type
You’ll see something like:
noop anticipatory 'deadline' cfq
which would indicate that the deadline scheduler is currently in use.
Once you’ve chosen and set a scheduler, you can tune it to work optimally with your
system by altering various parameters. These parameters differ for each scheduler. The
exception to this is the noop scheduler, which actually has no tunable parameters.
The parameters are set in files located at:
For example, the parameter read_expire on a device hda is stored in:
The parameter can be set to 80 (milliseconds) using the command:
echo 80 > /sys/block/hda/queue/iosched/read_expire
This has five tunable parameters:
- read_expire – the most important parameter for the deadline scheduler, it dictates
the maximum time a read request must wait before being serviced.
- write_expire – the maximum time before a write request must be serviced.
- fifo_batch – the number of requests in each batch sent for immediate servicing once
they have expired.
- writes_starved – this alters the amount of priority that read requests get over
write requests. Since unserviced reads affect performance more than unserviced writes,
it’s usual that reads get some measure of priority.
- front_merges – this is disabled by default. Setting its value to 1 results in
requests contiguous to existing requests in the scheduler being merged to the front of
the queue instead of the back.
In addition to read_expire and write_expire, the anticipatory scheduler also
- read_batch_expire – the time spent servicing read requests before servicing pending
- write_batch_expire – the reverse of the above
- antic_expire – the time in milliseconds that the scheduler pauses while waiting for
a follow-on request from an application before servicing the next request in the
- quantum – the number of internal queues that the requests are taken from in one
cycle and moved to the dispatch queue for processing. The cfq scheduler may have 64
internal queues, but only move requests to the dispatch queue by visiting the first
eight internal queues, followed by the second eight in the next cycle, and so on.
- queued – the maximum number of requests allowed in a given internal queue.
Although there are no hard and fast rules, a sensible strategy is probably to
benchmark I/O using different schedulers with their default parameter settings, and then
to choose the most appropriate scheduler and attempt to fine-tune the parameter settings
by referring to application-specific recommendations.
Don’t forget that scheduling isn’t the be all and end all of I/O optimization. Other
factors that can affect performance include prefetching, disk capacity and spin rate
(because these can affect seek time) and even the file system you choose for a given
disk. One thing is certain: Whatever performance you are getting now, you can almost
certainly improve it if you have the time, the will and the knowledge.
Paul Rubens is an IT consultant and journalist based in Marlow on Thames, England.
He has been programming, tinkering and generally sitting in front of computer screens
since his first encounter with a DEC PDP-11 in 1979.
Article courtesy of ServerWatch