Clusters

TCM currently runs two clusters, cluster for RJN and BM, and cluster2 for the Winton Fellows (AJM, AC, AAL). They are (almost) identically configured, and this page gives a brief description on how to use them.

Both consist of a number of dual-socket compute nodes, named node1 to nodeX, which hide behind a master or head node. The master node provides home directories, application directories, and the one point of entry for logging in to the computer nodes. The master node is not intended for running jobs other than compilations and trivial analysis. The network between the master and the compute nodes is entirely isolated from the rest of TCM.

Neither imports one's TCM home directory, but rather each uses its own local home directory. This is not backed up -- it is similarly (in)secure as /rscratch. However, it is bigger than it would be if it were backed up. Each compute node has a /scratch disk that should be used when running any jobs that require I/O.

Cluster runs Ubuntu 18.04, and cluster2 runs OpenSuSE 13.1. Both almost identically configured to TCM's desktops. However, /rscratch does not exist. Another difference is that the rbusy command will iterate over the compute nodes, not the TCM desktops.

Currently both their password files and their applications are synchronised with TCM manually and irregularly.

This means that if one changes one's password, one should do so on both the TCM desktops and the cluster(s). Otherwise, at some point the cluster password will be unexpectedly reset to the desktop one.

Hardware

See this separate page of cluster hardware.

Queueing

There is none. It is hoped that these clusters will have a small number of users who will be able to co-exist peacefully and efficiently.

On the Winton cluster only, the command nkill allows users to kill other users' jobs. It must be run from the desired node, with syntax nkill process_id, which will search for and destroy any process launched in the same session as the supplied process_id, whilst preserving shells. On nodes owned by AJM only people in the morris group can run it, and on nodes owned by AC only people in the chin group can run it.

Extra commands

On the head node only, the ssh_pc and scp_pc commands iterate over the compute nodes, not TCM's PCs.

On any node the command ipmi_sensors lists the current values of various temperature and fan rpm sensors. It is equivalent to typing ipmitool sensors as root.

On any node the command ipmi_log lists the complete log kept by the remote management module. It is equivalent to typing ipmitool sel list as root. Its long output may be best piped into less, and will show corrected DIMM ECC errors, power loss events, and other hardware issues. Note that on the Winton cluster the timestamps are unsychronised, but the error is less than a minute. On RJN's cluster, the clocks refuse to be set, and are about a decade out, simply counting from 1/1/2000.

Farming

It is assumed that most of the use of the nodes will be as a compute farm, a little like the desktop machines in TCM. Each has its own /scratch disk, and one can log in directly to the nodes via ssh from the corresponding master. The one thing which will not work on the compute nodes is compilation using Intel's compilers, due to complications involving contacting licence servers over the network. The master node should be a good place for compiling.

MPI between nodes

Are you sure you wish to do this? The network is only 1GBit/s.

If you must, set the environment variable MPI to OpenMPIt (OpenMPI with TCP) when compiling, linking, and using mpirun.

cluster:~/pingpong$ export MPI=OpenMPIt
cluster:~/pingpong$ mpifort pingpong.f 
cluster:~/pingpong$ mpiman mpirun
[...]
cluster:~/pingpong$ mpirun -np 2 -host node7,node8 ./a.out
[output suggesting 18us latency, 230MB/s bandwidth]
cluster:~/pingpong$ ssh node7
node7:~$ cd pingpong/
node7:~/pingpong$ mpirun -np 2 ./a.out
[output suggesting 0.5us latency, 3.1GB/s bandwidth]

For large transfers (>64KB) OpenMPI has found and used both ethernet networks on this cluster. To keep one reserved for NFS, as originally intended,

cluster:~/pingpong$ mpirun -np 2 -host node7,node8 --mca --btl_tcp_if_include eth0 ./a.out

The ultimate bandwidth is now 117MB/s, but the story for small transfers is complicated. Latency jumps to 26us, but, for transfer sizes of 8KB to 32KB a single link is actually faster, 50% faster at 8KB.

Castep

Running Castep on a single node writing to the local scratch disk should be fine and sane. It could be run thus with up to eight MPI processes, but note that, for a given calculation, the total memory requirement will increase with increasing numbers of MPI processes, due to the replication of data.

Castep spends a significant amount of time in BLAS-like libraries, more so if Intel's MKL is used for the FFTs. This parallelises using OpenMP like threads. It is not clear to me which of:

mpirun -np 8 ./castep biggish

and

OMP_NUM_THREADS=2 mpirun -np 4 ./castep biggish

would give the best performance (on an eight-core node), though the second should use slightly less memory.

For larger Castep jobs, using k-point parallelism between nodes is probably sensible, but possibly not g-vector parallelisation between nodes.

For a rather small example of 64 atom silicon, symmetrised and 4 k-points, running on a Nehalem node of RJN's cluster:

One core:                                    214.5s
One node (OMP_NUM_THREADS=8):                100.2s
Four nodes (-np 16 -npernode 4 -bynode):      47.5s
One node (-np 8):                             46.5s
Four nodes (-np 16 -npernode 4 -loadbalance): 21.1s
Four nodes (-np 32 -npernode 8 -loadbalance): 15.4s

The option -loadbalance should be the default.

The "four node" runs additionally had the option
-host node5,node6,node7,node8
and were run on node8.

The binary used in this test can currently be found as /rscratch/CASTEP/bin/6.1.1/castep-6.1.1_MPIt_ifort12_mklfft, and remember you will need to set MPI to OpenMPIt to run it.

(A Sandy Bridge node of the Winton cluster managed 144.2s on one core, and 20.0s on one node (-np 16).)

Very bad things happen if the product of the number of MPI processes per node and the number of OMP threads per process exceeds the number of cores. How bad? The time for the final run above rises from 15.4s to 1802.6s if OMP_NUM_THREADS is set to 8. Yes, over a hundred times slower. Currently the mpirun command always adds -x OMP_NUM_THREADS to its options to ensure that this variable is set to the same value on all nodes, and will default to 1.

It is important to keep k-points wholly on a node. Timings for the above 4 k-point run when spread across 2, 3, 4 and 5 nodes, always with 8 processes per node, were 25.6s, 34.1s, 15.3s and 51.5s, i.e. fast for the cases (2 and 4 nodes) for which k-points were not split between nodes. With a better interconnect, this would matter less.

For ideal performance, care should be taken over where Castep writes its temporary files "fortoR7wsx" etc, and its large .check file. The local /scratch disk will be faster than the remote home directory.