TCM's Computing Resources - Big jobs

The ideas of paging and virtual memory are common to many operating systems (almost all Unixes, Windows 3.x enhanced mode, Windows95, WinNT, OS/2, and MacOS too), yet is surprising little understood by most users.

The concept is simple: disk space is cheaper than RAM, so parts of programs which have not been used recently are moved out to disk to free up more memory. When they are required, they are paged back in. This process is transparent to the programmer, and the only way the user notices is because it is slow (and potentially noisy if you are sitting beside the machine!). It is not necessarily inefficient - that editor you are not currently using might just as well be paged out to disk at the expense of a couple of seconds delay when you want it back again.

The problem is that if excessive use is made of this feature, the machine will stop: sequential access from disk is over thirty times slower than from memory, random access is about ten thousand times slower. Having a machine slow down by a factor of a thousand or so is not fun.

Detecting the problem is, regrettably, more of an art than a science. Paging is too transparent... The following tips may help, though.

How much memory does it have?

A notoriously difficult question to answer under UNIX, but the command free on Linux, or status (TCM only), both report this.

How much does it need?

For codes which do dynamic memory allocation, which means all 21st century code, there is no a priori way of telling, other than knowledge of the particular program. Of course, memory requirements may change as the code runs - what else does dynamic memory allocation mean?

How full is it?

pc6:~$ status
pc6: 3958Mb Core2 DC 2x2400MHz x86_64. Up 153 days, load 5.6, 141Mb free

This machine would seem to be unhappy. A mere 141MB free is far too low for happiness, and a load of 5.6 on a machine with just two cores seems bad too. If there is not at least 300MB free, interactive use is likely to suffer. Further investigation reveals:

pc6:~$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
spqr1     3230 14.3 23.8 1332352 968696 ?      Rsl  Oct10 1246:23 /usr/local/sha
spqr1     3427 29.1  1.3 691744 53500 ?        Sl   Oct10 2518:49 /a/scratch/rai
spqr2     4037 57.9  1.2 555632 50716 pts/1    Sl+  Sep10 30251:48 /a/scratch/ra
spqr2     4116  0.0 18.0 2083312 730344 pts/1  S+   Sep10  25:27 /a/scratch/raid
spqr2     4478 44.3  0.2 621332  9436 pts/8    Sl+  Aug24 34005:40 /a/scratch/ra
spqr2     4539  0.0 46.7 6411620 1896524 pts/8 S+   Aug24  34:13 /a/scratch/raid

The RSS field (Resident Set Size) is the amount of physical memory a process is using. The sum of these for all processes cannot be more than the amount of memory in the machine. The VSZ field (Virtual SiZe) is the amount of virtual address space allocated to the process. It is an upper bound on the amount of physical memory that the process would like to have. It is not unreasonable for the sum of the VSZ fields to exceed the physical memory of a machine, but it is generally unreasonable for them to exceed the physical memory by such a large extent.

The vmstat command reveals how inefficient this is:

pc6:~$ vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  0 1370736  35720   3008 149380    0    2     2    11    1    1 24 33 43  0  0
 5  0 1370736  35580   3008 149424    0    0     0     5 2025 1155132 32 68  0  0  0
 5  0 1370736  35580   3008 149420    0    0     0     0 2029 1153635 31 69  0  0  0
 6  0 1370736  35580   3008 149420    0    0     0     0 2027 1168723 30 70  0  0  0
 5  0 1370736  35580   3016 149424    0    0     0     4 2027 1155107 32 68  0  0  0

The first line of numbers is an average since last boot, and not very interesting. The next four lines are averages of four five-second intervals. They show very little free memory (35580KB), under a third of the CPU time going to user jobs (us field), over two thirds being spent in "system" tasks (i.e. doing and waiting for I/O) (sy field). us should generally be well over 95% (or zero if the machine is idle), and sy under 5%.

The number of interrupts per second (in) is rather high and in danger of disturbing the formatting of vmstat's output. The interrupts will be associated with I/O, and one might hope for figures of around 200, not 2000. The number of context switches per second (cs), over a million, is very impressive. It should be a few thousand or less, and again is a symptom of lots of I/O being handled badly by the kernel.

In this case, it is likely that if these processes had been run on a machine with 12GB not 4GB, but otherwise identical, they would have run three times faster, and interactive users would not have noticed their presence.

An example of a happy machine would be:

pc54:~$ vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0  35592 13414456 185020 1383856    0    2     6    14    1    5 38  1 61  0  0
 3  0  35592 13414520 185020 1383852    0    0     0     0 3214  614 75  0 25  0  0
 3  0  35592 13414644 185028 1383852    0    0     0    10 3188  620 75  0 24  0  0
 3  0  35592 13414676 185028 1383852    0    0     0    14 3136  614 75  0 25  0  0
 3  0  35592 13414644 185028 1383852    0    0     0     1 3071  618 75  0 25  0  0

This is a quad core machine, and is running only three processes. Hence it reports itself as 25% idle, 75% spent in user code, and 0% in system calls. The formatting of the output of vmstat has been destroyed by the enormous amount of free memory the machine has.

Unfortunately, it is difficult to distinguish excessive paging from innocent I/O using vmstat. However, even "innocent" I/O might seem less than innocent to anyone using the computer interactively. Well-designed programs try to avoid I/O as much as possible.

nice

The commands nice and renice reduce the priority of a job and mark it as non-interactive for scheduling (if the scheduler is intelligent). Both therefore allow the machine to be used more efficiently and reduce the impact of background jobs on interactive use. Simply using nice as a prefix when launching a program has this effect. Unfortunately, priority can only be lowered using nice - it is not possible for a user to raise it to what it was before!

The summary

Unfortunately it does matter: a machine paging excessively will be inefficient, and very annoying for anyone trying to use it interactively, as it will `freeze' for about thirty seconds from time to time. The art of telling how heavily loaded a machine is is best attained by practice, or, in extremis, calling round to listen to its hard disk...

It is unfair to blame Linux too heavily for this mess. MacOS X seems at least as bad. Tru64 was no better in its default configuration (but rather more configurable).

TCM, October 2012