Running Jobs

This page draws together a few introductory notes on how to run longish jobs in TCM. TCM purchases reasonably highly specified desktops with the expectation that they will be used not simply as terminals, but also that people will log in remotely to run jobs on them. It is not unreasonable for a single individual to be running jobs on half a dozen computers at once.

Where?

Firstly it should be noted that the computer in front of you is not always the best place for a job. There is only one of it, and it is not the biggest or fastest computer in the Group. However, if the task is more graphics-intensive than compute-intensive, it might well be the best place. Graphics do not transfer very fast across networks.

The local command rbusy exists to scan through computers to help find idle ones. It has various options, revealed with an argument such as --help.

Data

Data can be placed in one of three places:

/scratch/spqr1: fast, big, and local to the computer in question. Check free space with df -h /scratch. Find out who is using all the space with repscratch.
/rscratch/spqr1: slower, smaller, shared by all computers, protected against the mechanical failure of a single drive. Check free space with quota.
One's home directory. Slow, small, shared by all computers, and reasonably well backed up. In most cases regenerating output by rerunning code on the input files is cheaper than backing up large output files. Check free space with quota.

Though one's home directory should be used for anything smallish which needs backing up, jobs which have significant I/O requirements are better run in /scratch. A useful half-way house is provided by /rscratch which has some of the convenience of a home directory with only some of the disadvantages of a local /scratch disk. If you have a requirement for the long-term storage of large quantities of data (100GB and beyond), it may be beneficial to discuss individual arrangements additional to the above scheme.

Memory

Computers dislike running jobs which attempt to use more memory than they actually have. They attempt to use their hard disks to simulate extra memory, and can slow down a lot. A lot could even be a factor of a hundred or more. In this state, they become unusable as far as interactive use is concerned, so anyone attempting to use them as a desktop terminal will be upset.

There are many ways of checking whether too much memory is being used. One can compare the output of status with the sum of the sizes in the VIRT column of top's output (press q to exit top), or there are many other methods. Note that the reported free memory, from top, status or any other command, will always be positive. Indeed, if it is not at least 200MB, then it is too low. If the computer is likely to be used as a terminal, then one should leave at least 2GB for the interactive user's processes.

CPU cores

If one runs more serial jobs on a computer than it has CPU cores, then the jobs start to slow down, but, to a first approximation, the factor by which they slow down is simply the number of jobs divided by the number of cores, and the total throughput is constant.

For parallel jobs (MPI or OpenMP) which do a lot of communication, this need not be true. For two threads to communicate efficiently, they must run in timeslices at the same time on different CPU cores. If other competing processes mean that it is unlikely that two threads which are trying to communicate receive timeslices which overlap well, then performance may fall by a significant factor. I recently saw Castep, compiled in serial mode but with a threaded MKL library, slow down by a factor of five when there were just twice as many threads running as there were cores available (due to another job).

Note too that commands such as ps list processes, not threads, and the number of threads which a process has may vary during its execution. Currently we have sufficient computers that the advice should be to take care when adding a job to a computer which is already running a job.

Logging Off

Long-running jobs do not absolve you from the good practice of logging off each evening. Well, long-running Mathematica and Matlab jobs might. But for simple text jobs (including Castep), it should be possible to launch them, log off, and for them to keep running.

pc99:/scratch/spqr1/work$ ./castep CO2 &> out.log &

Some will find that

pc99:/scratch/spqr1/work$ nohup ./castep CO2 &> out.log &

works rather better. This also means that one can launch jobs remotely from a laptop / home computer, and turn that off without disrupting the remote job.

`nohup`

Two things cause trouble for long-running jobs after logging out. The first is that the job may attempt to write to the terminal it was launched from. Once that terminal has been closed, this will produce an error, and may cause the process to exit. It is bound to cause any such output to be lost. So &> file redirects all output (including errors) to the named file, and avoids this problem.

The second problem is that if the terminal from which the job was launched is closed before the shell from which the job is launched is closed, the terminal will probably send the job a signal, SIGHUP, to tell it that it no longer has a terminal connected to it. The default behaviour of a Fortran or C/C++ program at this point is to exit immediately.

So if you always close your xterms (or equivalent) by typing exit (or equivalent) into the shell, the shell closes before the terminal, and no signal is sent. If you close terminals by using the buttons on the title bar, or simply logging out and letting the window manager, session manager or X server close them, then the terminal will be killed before the shell running in it, and SIGHUP is likely to be passed on to your process. The use of the nohup command stops this from happening.

Time

Computers do not run for ever. They suffer power cuts. They need rebooting to apply security patches. They may even crash. It is therefore sensible to ensure that long-running jobs (more than a few days) can be restarted if they are unexpectedly interrupted. This can be hard if one's job is running in Matlab or Mathematica. Codes such as Castep and Casino are much saner in this regard.

Enthusiasm

Finally, there is nothing wrong with using several computers at once. There is something wrong with hogging the whole of the Group's resources at once. An attempt at defining the dividing line can be our guidelines for the use of TCM's computers.