Using the new ARC/HTC Environment
Contents
Introduction
Operating system
The new ARC/HTC systems both use CentOS 8.1 as their main operating system. This is an upgrade from CentOS 6.6 on ARCUS-B and CentOS 7.7 on ARCUS-HTC. See Applications Software section for details on the software environment.
Note that Singularity is installed as part of the operating system; there is no longer a requirement to load it as a separate module.
Cluster Description
ARC operates two compute clusters - arc, our parallel workloads system a replacement for ARCUS-B, and htc, our High Throughput cluster the replacement for ARCUS-HTC. The main differences between them are:
Cluster | Description | Login Node | Compute Nodes | Minimum Job Size | Notes: |
---|---|---|---|---|---|
arc |
Replacement of ARCUS-B Our largest compute cluster. Optimised for large parallel jobs spanning multiple nodes. Scheduler prefers large jobs. Offers low-latency interconnect (Mellanox HDR 100). |
arc-login |
CPU: 48 core Cascade Lake (Intel Xeon Platinum 8268 CPU @ 2.90GHz) Memory: 392GB
|
1 core |
Non-blocking island size is 2212 cores |
htc |
Replacement of ARCUS-HTC Optimised for single core jobs, and SMP jobs up to one node in size. Scheduler prefers small jobs. Also catering for jobs requiring resources other than CPU cores (e.g. GPUs). |
htc-login |
CPUs: mix of Broadwell, Haswell, Cascade Lake GPU: P100, V100, A100, RTX Novel architectures: KNL |
1 core |
Jobs will only be scheduled onto a GPU node if requesting a GPU resource.
|
Node Types
Previously on ARCUS-B/HTC login nodes were used for the preparation/submission of batch jobs and also the pre/post processing of application data. On ARC/HTC different node types are used for these workflows:
Login nodes
These are only to be used for accessing the cluster and submitting jobs. Login nodes should not be used for software build or compute tasks. They are not designed for building software or running analysis; they do not have the same CPU architecture as the cluster nodes, and they do not run the same operating system as the cluster nodes. Please use the interactive nodes for any software builds (see below).
We have an explicit policy that user processes can only use a maximum of 1 hour of CPU time on login nodes.
Interactive nodes
These nodes should be used for pre/post processing of data and for building software to be used on the ARC clusters. See the section on interactive jobs for more information.
Compute nodes
Typical ARCUS-C compute nodes have 48 cores and 375GB memory per node available to jobs. This is an increase over what was available in ARCUS-B.
Access
The clusters can be accessed by SSH connection to the login nodes (arc-login or htc-login) from the Oxford University network (including VPN).
Access from outsite the University network is via the ARC SSH gateway server: gateway.arc.ox.ac.uk.
Scheduler
The ARC/HTC system uses SLURM as its resource manager (or scheduler). This is the same system used for ARCUS-B and ARCUS-HTC so users will be familiar with its commands and submission script syntax.
As a reminder, to do work on ARC's clusters, you will need to submit a job to the job scheduler; the login nodes are for preparing and submitting scheduler jobs and should not be used for performing computational work. If you need to run interactive computational work such as pre/post processing data or building your own code - this must be performed on interactive nodes.
Unlike on ARCUS-B nodes on ARC/HTC are not allocated exclusively to jobs; jobs are allocated the requested number of cores and may share nodes with other jobs. The default number of cores allocated is 1 (as is the default number of nodes). Default amount memory per CPU is 8000 MB. You will not be able to use resources you have not requested in your job submission; this includes memory and CPU cores.
Thus if you need more than 1 CPU core, you will need to explicitly ask for them. At its simplest this can be specified by requesting a specific number of tasks, e.g.:
#SBATCH --ntasks-per-node=8
to request 8 tasks.
For MPI job submissions this would normally be changed to asking for a number of nodes and specifying the number of tasks per node, e.g.:
#SBATCH --nodes=2 #SBATCH --ntasks-per-node=48
to request two nodes with 48 tasks each.
For a hybrid MPI/OpenMP job, where an MPI tasks spawns multiple CPU threads, the specification needs to also specify how many CPUs per task the job will need. For example, to request 2 nodes with two MPI tasks per node that each start 24 compute threads, you need to request:
#SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=24
The default number of CPUs per task is 1.
It is possible to request exclusive access to a node by adding "--exclusive" to your sbatch command or the following line to your submit script:
#SBATCH --exclusive
However, we strongly advise being specific about required resources rather than using exclusive node access; homogenity of resources (or CPU features) can not be assumed. This is especially true on the HPC system, or when submitting to multiple clusters (see section Job Scheduling).
This is a very short overview; SLURM offers various ways to specify resource requirements; please see 'man sbatch' for details.
Partitions
Both clusters have the following time-based scheduling partitions available:
- short (default run time 1hr, maximum run time 12hrs)
- medium (default run time 12hrs, maximum run time 48hrs)
- long (default run time 24hrs, no run time limit)
- devel (maximum run time 10 minutes - for batch job testing only)
- interactive (default run time 1hr, maximum run time 4hrs, can oversubscribe, for pre/post-processing and building software)
Jobs in the short and medium partitions are scheduled with higher priority than those in the long partition; however, they will not be able to run for longer than the time allowed on those partitions.
On the previous clusters (Arcus-B, Arcus-HTC), users who wanted to submit long running jobs needed to submit the jobs to the scheduler specifying an acceptable timelimit, then once the job had started running, would request that the job's walltime be extended. One the new ARC/HTC clusters, this is no longer required; users can submit jobs with long time limits to the long partition. Note: the default time limit on long partition is 1 day; users must specify a time limit if a longer runtime is required.
We will no longer extend jobs.
The htc cluster has an additional partition available named legacy. This partition contains a number of nodes which have CentOS 7.7 installed in order to maintain compatibility with some legacy commercial applications. Access to the legacy partition is restricted to users with a requirement to use legacy software and will be enabled by the ARC team for specific users when it has been demonstrated that using a more recent version of the software application is not possible.
Job Scheduling
By default jobs will be scheduled based upon the login node you using - if you are logged into arc-login jobs you submit will be queued to the arc cluster. If you are logged into htc-login jobs will be queued to the htc cluster.
However, The clusters are accessible from either login nodes and can be specified by passing --clusters=arc or --clusters=htc SLURM options. Additionally, squeue can report the status of jobs on either cluster or both (using the option --clusters=all).
It is possible for jobs to target either cluster or both clusters using the --cluster specification in job scripts, for example
#SBATCH --clusters=arc or #SBATCH --clusters=htc or #SBATCH --clusters=all
If submitted with --cluster=all a job will simply be run on the first available resource, regardless of what cluster this is on.
Submission Scripts
As an example - to request two compute nodes, running 48 processes per node (using MPI), with one CPU per task (the default) requiring 2GB of memory per CPU, and a two hour wall time, the following submission script would be used:
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=48 #SBATCH --mem-per-cpu=2G #SBATCH --time=02:00:00 #SBATCH --job-name=myjob #SBATCH --partition=short module load mpitest/1.0 mpirun mpihello
To request a single core for 10 minutes, with one task on the node (and one CPU per task), requiring 8GB memory, a typical submission script would be:
#!/bin/bash #SBATCH --time=00:10:00 #SBATCH --job-name=single_core #SBATCH --ntasks-per-node=1 #SBATCH --mem-per-cpu=8G #SBATCH --partition=short module purge module load testapp/1.0 #Calculate number of primes from 2 to 10000 prime 2 10000
Interactive Jobs
To start an interactive session, you need to use the srun command, e.g.
srun -p interactive --pty /bin/bash
or for a session that allows graphical interfaces (via X forwarding):
srun -p interactive --x11 --pty /bin/bash
This would allocate 1 core on one interactive node and log you in to the system (giving you a shell on the system). Multiple cores, memory, or other resources can be requested the same way as for sbatch.
Exiting the the shell ends the job. It will also be aborted once it exceeds the time limit.
GPU Resources
The most basic way you can access a GPU is by requesting a GPU device using the gres option in your submission script:
#SBATCH --gres=gpu:1
You may also request a specific type of GPU device, for example:
#SBATCH --gres=gpu:v100:1
To request one V100 device, or:
#SBATCH --gres=gpu:rtx8000:2
To request two RTX8000 devices. Available devices are P100, V100, RTX (Titan RTX), RTX8000, and A100.
Alternatively you can request a GPU (--gres=gpu:1) and specify the type via a constraint on the GPU SKU, GPU generation, or GPU compute capability:
#SBATCH --gres=gpu:1 --constraint='gpu_sku:V100'
#SBATCH --gres=gpu:1 --constraint='gpu_gen:Pascal'
#SBATCH --gres=gpu:1 --constraint='gpu_cc:3.7'
#SBATCH --gres=gpu:1 --constraint='gpu_mem:32GB'
#SBATCH --gres=gpu:1 --constraint='nvlink:2.0'
Configured GPU related constraints are:
gpu_gen: | GPU generation (Pascal, Volta, Turing, Ampere) |
gpu_sku: | GPU model (P100, V100, RTX, RTX8000, A100) |
gpu_cc: | CUDA compute capability |
gpu_mem: | GPU memory |
nvlink: | device has nvlink - contraint exist as simple (-C nvilink) and specifying version (-C 'nvilink:2.0') |
For details on available options/combinations see the table of available GPUs.
Please note that co-investment GPU nodes are limited to short partition, i.e. the maximum job run time is 12 hours. No such restrictions apply to ARC owned GPUs. See the table of available GPUs for more information.
Application Software & Modules
The ARC/HTC software environment comprises a mixture of commercial applications, software built using the EasyBuild framework and software built using our own local build recipes. As with ARCUS-B/HTC we use the environment modules system (via the module command) to load applications into the environment on ARC/HTC.
The application module names have changed on ARC/HTC. You will therefore need to look up the new module name such that you can include this in your submission script - the best way to search for an application is by using the module spider command. For example, to search for the GROMACS application:
module spider gromacs ------------------------------------------------------------------------------------------------------------------------------ GROMACS: ------------------------------------------------------------------------------------------------------------------------------ Description: GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. This is a CPU only build, containing both MPI and threadMPI builds. Versions: GROMACS/2020-fosscuda-2019b GROMACS/2020.4-foss-2020a-PLUMED-2.6.2 GROMACS/2020.4-foss-2020a ------------------------------------------------------------------------------------------------------------------------------ For detailed information about a specific "GROMACS" package (including how to load the modules) use the module's full name. Note that names that have a trailing (E) are extensions provided by other modules. For example: $ module spider GROMACS/2020.4-foss-2020a ------------------------------------------------------------------------------------------------------------------------------
The module spider command gives you a list of available GROMACS packages. Please note, module spider is NOT case-sensitive for searching, so:
module spider GROMACS module spider gromacs module spider Gromacs
... are all equivalent. However, when loading the module using module load you must obey the correct case e.g.
module load GROMACS/2020.4-foss-2020a
You will find more detailled advice on how to run some of the more popular applications under the applications & software section of our support pages.
You can also build your own software in your home or data directories using one of the compilers provided (which are also available through the environment modules system). Typically the compiler toolchains, including maths libraries and MPI can be loaded using the modules named foss (e.g. foss/2020a) for free open-source software (i.e. GCC) or intel (e.g. intel/2020a) for the Intel compiler suite. For example loading foss/2020a will include the following modules:
module load foss/2020a module list Currently Loaded Modules: 1) GCCcore/9.3.0 4) GCC/9.3.0 7) libxml2/2.9.10-GCCcore-9.3.0 10) OpenMPI/4.0.3-GCC-9.3.0 13) FFTW/3.3.8-gompi-2020a 2) zlib/1.2.11-GCCcore-9.3.0 5) numactl/2.0.13-GCCcore-9.3.0 8) libpciaccess/0.16-GCCcore-9.3.0 11) OpenBLAS/0.3.9-GCC-9.3.0 14) ScaLAPACK/2.1.0-gompi-2020a 3) binutils/2.34-GCCcore-9.3.0 6) XZ/5.2.5-GCCcore-9.3.0 9) hwloc/2.2.0-GCCcore-9.3.0 12) gompi/2020a 15) foss/2020a
Some commercial applications only support CentOS 8.1 at their latest release level, so only the newest versions of these applications are loaded in the main software repository on ARC/HTC. To give some backwards compatibility we have a small number of CentOS 7.7 nodes which can use older releases of the affected applications (also see Partitions section).
Storage
#!/bin/bash cd $SCRATCH || exit 1 rsync -av $DATA/myproject/input ./ rsync -av $DATA/myproject/bin ./ module load foss/2020b mpirun ./bin/my_software rsync -av --exclude=input --exclude=bin ./ $DATA/myproject/
This examples copied directories '$DATA/myproject/input' and '$DATA/myproject/bin' into $SCRATCH (which will then contain directories 'input' and 'bin'); runs './bin/my_software'; and copies all files in the $SCRATCH directory - excluding directories 'input' and 'bin' - back to $DATA/myproject/ once the mpirun finishes.