Condor is a batch job queueing
system, which runs across multiple machines. It takes jobs from users, queues
them up, decides where and when to run them, and then returns the results to
the user once done. It enables you to turn any group of machines into a
cluster-like system — setting up a distributed-processing network with
whatever resources you have available. You can use it on pretty much any
setup, including dedicated clusters, but arguably its best use is as software
enabling you to treat your desktops as a part-time cluster. You can set rules
so that jobs are only run on idle desktops — making the most of unused
CPU cycles and power resources, especially if your site has an always-on
policy.
The basic workflow is that the user submits a job (a resource request) from
a Condor client. The job can specify its resource requirements and
preferences, as well as what should be run and where the output should be
sent. The central Condor server then examines its database to find a client
that matches the job requests. When an appropriate client comes up, the job
is sent out, run, and the output sent back to the user. It has a
checkpointing system which can handle pausing or cancelling jobs on-the-fly —
e.g. if a destkop comes back into use halfway through a job — and resuming
them if possible later.
The first part of this series deals with installing the Condor server and
client; the second part will show how to go about submitting jobs and
specifying resources.
Thinking about your setup
Condor can be downloaded from the
Condor website. Before installation, you need to consider where
the files will live.
You need to create a condor user on all machines running condor: this user
will own the files created by the Condor daemons (although the daemons
themselves run as root). Ideally, the home directory for this user would be
centralised, to simplify admin — for ease of explanation, I’ll assume this
setup here. (You can check the documentation for how to handle it if you want
to have separate home directories on each client.) Don’t edit the config
file until you’ve unpacked and installed the software (see below) — just
decide what you intend to do.
Each machine that condor runs on (either server or client) needs to have
its own spool, log, and execute directories. If
you’re using a centralised home directory, you can set the home directory and
local directories up in the
configuration file (condor_config) as
TILDE=/dir/condor LOCAL_DIR=$(TILDE)/hosts/$(HOSTNAME)
It’s also a good idea to have that condor_config global
configuration file on a shared directory — that way you only ever have to
edit it once! You also have a local configuration file for each machine,
which can override global config options — you can keep this in the
LOCAL_DIR directory. You set this in the global configuration file:
LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local
Now you’ve thought about all that, untar the Condor download into an
appropriate directory. Then run condor_install from this directory.
This script needs a few options:
condor_install --prefix=/dir/condor --local-dir=/dir/condor/hosts/myserver --type=manager
This will do a “manager” type install (other types are submit and execute – a
manager-only machine won’t be able to submit jobs or have jobs run on it),
install Condor to the /dir/condor directory (the home directory you
decided on in the previous section), and set
/dir/condor/hosts/myserver as the local directory for this
machine.
Global configuration
The global condor_config file is divided into 4 parts. The first
part is the settings that you must change. Some of these are the
variables you decided on earlier (e.g. LOCAL_DIR). You also need to
specify an admin email address, your local domain (e.g. example.com),
and a name for your system. The config file is well-documented.
Part 2 of the file is usually safe to leave as-is, but you do need to set
the HOSTALLOW_READ and HOSTALLOW_WRITE variables. You can
just set then as * (i.e. any machine at all can read/write to your
pool), but this is a bit of a risk. More likely you want to set these to
*.example.com or whatever your domain is.
Once you’re happy with the settings, make sure that you’ve set the
environment variable CONDOR_CONFIG to the location of this file.
Next, execute
/dir/condor/sbin/condor_master to start the Condor daemons – this is
the daemon that starts and monitors the other daemons. It also checks for
updated binaries and restarts the daemons if necessary.
If you’ve unpacked the Condor tarball somewhere central, you can log on to
the client, cd to that directory, and run this command:
condor_install --type=execute,submit --local-dir=/dir/condor/hosts/myclient --central-manager=myserver.example.com --verbose
Note that you need to specify the client name (for --local-dir) and
the manager name. (If you didn’t unpack the files centrally, you’ll have to
copy the release tarball somewhere appropriate and add a
--install=/path/to/condorrelease.tar option.)
As with the server, set the CONDOR_CONFIG variable, and execute
/dir/condor/sbin/condor_master to start the Condor daemon! Check
it’s running by grepping the process list for condor_* processes.
Once you’ve got everything up and running, you may want to set up a start
script so that Condor starts automatically on boot in the future.
You probably also want to have a look at the rules that govern when a job
can be run on a client. These are in Part 3 of the global config file. The file
sets the “UWisc – CS Dept” rules (look for the definition of
UWCS_WANT_SUSPEND in the config file). The rules as defined here are
probably a good start — you can adjust them later if you start having
problems, or you can override them for a particular machine if need be.
Next steps
OK, now you have your server and your first client set up. In the next
part of this piece we’ll look at how to submit a job.
This article was first published on LinuxPlanet.com.