Tier-1 Zenobe quickstart guide
Note: this informatios is obsolete. Zenobe has been replaced by Lucia
Access
Access to Zenobe is only granted to users who submitted as project as described here.
Zenobe is accessible directly only from a CÉCI university network:
$ ssh -i ~/.ssh/id_rsa.ceci my_ceci_login@zenobe.hpc.cenaero.be
Make sure to configure your SSH client like for the other CÉCI clusters to avoid the burden of specifying your SSH key and login each time and just type:
$ ssh zenobe
From other locations, access to Zenobe is possible through a gateway named hpc.cenaero.be
. Your CÉCI key is actually installed on the gateway, and access from the gateway to Zenobe itself is automatic.
Manual process
The two-step process is typically as follows:
$ ssh -i ~/.ssh/id_rsa.ceci my_ceci_login@hpc.cenaero.be
which leads to
================================================================================
You're connected to hpc.cenaero.be - Authorized access only!
================================================================================
Dear HPC user,
You're connected to the new hpc.cenaero.be gateway, as always, please limit
file storage on this server to transient transfer only though this is enforced
by user quota (soft limit: 10GB - hard limit: 40GB - grace period: 7 days).
Thank you for you collaboration,
The Cenaero HPC & Infrastructure team.
--------------------------------------------------------------------------------
Last updated on Feb. 21, 2014
================================================================================
$
and then
ssh zenobe
so that you arrive on Zenobe
Last login: Wed Jan 14 14:17:48 2015 from hades.cenaero.be
================================================================================
dMMMMMP dMMMMMP dMMMMb .aMMMb dMMMMb dMMMMMP
.dMP" dMP dMP dMP dMP dMP dMP dMP dMP
.dMP" dMMMP dMP dMP dMP dMP dMMMMK´ dMMMP
.dMP" dMP dMP dMP dMP aMP dMP aMF dMP
dMMMMMP dMMMMMP dMP dMP VMMMP" dMMMMP" dMMMMMP
dMMMMMMMMMMMMMMMMMMMMMMP" frontal node dMMMMMP"
Warning : Only authorized users may access this system.
================================================================================
Automated process
The process can be automated by properly configuring your SSH client. Simply add the following to your ~/.ssh/config file.
:
# Configuration for the gateway access using the CÉCI key
Host hades
User YOURCECILOGIN
ForwardX11 yes
ForwardAgent yes
Hostname hpc.cenaero.be
IdentityFile ~/.ssh/id_rsa.ceci
# Configuration for the transparent access to zenobe through hades
Host zenobe
User YOURCECILOGIN
Hostname zenobe.hpc.cenaero.be
ForwardX11 yes
ForwardAgent yes
IdentityFile ~/.ssh/id_rsa.ceci
ProxyCommand ssh -q hades nc %h %p
Make sure to replace YOURCECILOGIN
with the actual information (twice.) Once done, you can simply issue the command
ssh zenobe
to connect to the supercomputer. Make sure to configure your agent correctly set up (see here for Linux, or here for Windows) to avoid having to enter your passphrase each time you connect.
With such a configuration, you can also simply transfer files with scp
or rsync
:
$ echo 'test file' > ./myfile
$ scp myfile zenobe:
myfile 100% 10 0.0KB/s 00:00
$ ssh zenobe
Last login: Fri Jan 16 11:59:54 2015 from hades.cenaero.be
[...]
$ echo myfile
myfile
$
or run commands directly:
$ ssh zenobe hostname
frontal1
Disk space
You have read/write access to four directories, in the following filesystems:
- /home: for user personnal codes, scripts, configuration files, and small datasets (quota 50GB)
- /project: for data and results that are stored for the whole project duration (for current usage and/or quota: contact the support team -- see below )
- /SCRATCH: for temporary results during the course of one job. Users have access to
/SCRATCH/acad/projectname
and/SCRATCH/primarygroup
(See your primary group withgroups
.) You can get your current usage and quota withmmlsquota -g projectname
ormmlsquota -g primarygroup
.
Note that the setguid
bit is set on the project and scratch directories. This ensure that data you place in those directories are owned by the project group rather than your own personnal group. In the ls -ls
listings, they appear with an s rather than an x for the group. This bit need to be set for the sub directories too. If you have removed that bit you can set it back with chmod g+s <dir_name>
. You can also use the newgrp <group_name>
command to set the default group on all files you create in the current session.
Job preparation
Zenobe does not use Slurm as resource manager. Jobs are orchestrated by PBSPro version 13 The main differences are listed below. Note that the concept of 'partition' in the Slurm vocabulary is a 'queue' in PBS's context.
Commands
Getting general information about the queues is done with either qstat -q
(resources and limits) or qstat -Q
(jobs.)
All jobs are listed with qstat
and you can look at your job only with the -u $USER
parameter. To see full information about a specific job, use qstat -f jobid
, and qstat -awT
to have an estimation of when a pending job should start (equivalent to squeue --start
.)
All nodes and their features can be listed with pbsnodes -a
, and information about down nodes is available through pbsnodes -l
.
Jobs are submitted with qsub
and canceled with qdel
.
To compile your program, use preferably the following modules:
module load compiler/intel/2015.5.223
module load mkl/lp64/11.2.4.223
module load intelmpi/5.0.3.049/64
First compile with -01 and check the results, then only use -O2 and -O3 and make sure there is no regression.
Queues
Zenobe offers two queues for CÉCI users:
- large: this is the queue with the largest number of Ivybridge nodes (8208
cores by 24 cores per node). Jobs there are limited to 24 hours walltime, and
must use at least 96 CPUs and at most 4320. Nodes are allocated exclusively to
jobs on that queue (no node sharing.) Select it with
#PBS -q large
. Make sure to use at most 2625MB of memory per core to respect the general RAM/core ratio of the nodes on that queue. - main: this is the queue with 5760 last-generation Haswell cores (24 cores
per node) without the time limitations. Select it with
#PBS -q main
.
The scheduling policy is a fairshare per category then per project. The category fairshare is set to the agreed-on distribution of compute time through the categories (see this document in French). All jobs from the same project have the same priority, that priority is based on the past usage of the cluster by the project. The fairshare is configured with a decrease factor of 0.5 and a period of one week.
Scripts
PBS scripts are very similar to Slurm script in that most Slurm parameters have a direct PBS equivalent. Still, there are some differences that must be taken into account.
Chunks
PBS resources are allocated (and thus requested) by chunks. A chunk is an allocation unit defined in terms of number of CPUs and memory, that will be entirely allocated to a node (it cannot be splatted across nodes.) Chunks are requested with the #PBS -l select=
construct. For instance:
#PBS -l select=4:ncpus=1:mem=512
requests 4 chunks of one CPU each with 512MB of RAM. It is equivalent to
#SBATCH --ntasks=4
#SBATCH --mem-per-cpu=512
The CPUs will be allocated freely by PBS so you could end up with 4 CPUs on the same node, 4 CPUs on distinct nodes, or any combination on two or three nodes.
To have all CPUs allocated on a single node, you will request one chunk with 4 CPUs and 2GB of memory:
#PBS -l select=1:ncpus=4:mem=2048
which corresponds to
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=512
You can also define the number of MPI processes and OpenMP threads to create in a chunk like this:
#PBS -l select=4:ncpus=24:mpiprocs=2:ompthreads=12:mem=63000
so that PBS know that you want a total of 8 MPI processes each with 12 threads, and you want them allocated two by two. For more complete examples, refer to the script generation wizard.
Working directory
By contrast with Slurm which starts your job in the same directory it was submitted from, i.e. the current directory when you typed the sbatch
command, PBS executes the scripts in a special directory by default. It is then common to start each job script with a cd ${PBS_O_WORKDIR}
command.
Standard I/Os
Similarly, the output of your job is redirected to a file that will be available in the working directory only at the end of the job. To monitor a job, you consequently need to redirect the output to a specific file in your home or working directory. You can do this once and for all the commands in your script if you start it with
exec > filename
Again, see the script generation wizard for a practical example.
Important advice and common mistakes
Use queue 'Large' rather than 'Main'
You should mainly use the 'large' queue. That is where the majority of the compute nodes are located. It is also where the newest nodes are located. The 'main' queue should be used for tests and smaller jobs.
Do not run jobs on the frontend
Make sure to submit your jobs to PBS properly. If they end up running on the frontend, they will be killed. The frontend is only there to submit and manage jobs, handle files, and compile your code.
Use 24CPUs per node on queue 'large'
As nodes on the 'large' queue are allocated exclusively to one job, it is best to use all of its 24 cores, because they can't be used by other jobs.
Stay as close as possible to 2625 MB/core on queue 'large'
If your jobs scalre properly, choose the parameters so as to use at most 2625MB per cpu in your job to ensure optimal usage of the compute nodes.
Use job arrays when you have many similar jobs
When you have a large number of jobs doing nearly identical computations, you should use the job array capabilities of PBS. See Chapter 9 of the PBSPro manual
Make sure to use your canonical email address in #PBS -M
Alias email addresses may not work with PBS so make sure to only give your main email address when using the -M
PBS parameter. Otherwise, your email will most probably not be sent to you and the system administrator will receive an error email.
Be concerned with how much memory you request
If you request more memory than you really need, you prevent other users from using CPUs that could be used if you had estimated the memory requirements of your job properly. Your jobs are then scheduled later than when they actually could run because the scheduler is waiting for the resources you have reserved to become available. The larger the resource, the longer the average waiting time will be. The whole throughput of jobs is slower than it could be, and the (costly) resources are wasted.
To estimate how much memory your job needs, you can run test jobs and connect with SSH
to the compute node allocated to your job (which you can find with qstat -f JOBID | grep exec_host
) and use the top
command for instance to get that information. If you notice you have requested too much memory, you can also reduce it with the qalter
command.
On queue 'large', the optimal memory usage is 2625 MB per core. Please use this value as upper bound on your memory request.
Make some effort estimating the running time of your jobs
If you request more time than you really need, you prevent other users from using CPUs that could be used if you had estimated the time requirements of your job properly. Your jobs are then scheduled later than when they actually could run because the scheduler do not consider them for backfilling. The longer the running time, the larger the average waiting time will be. The whole throughput of jobs is slower than it could be, and the (costly) resources are wasted.
To estimate how much time your job needs, you can run test jobs and have a look at the information sent by email by the system at the end of the job. If you notice you have requested too much time, you can also reduce it with the qalter
command.
On queue 'large', maximum allowed running time is 24 hours. Please favor underestimating the running time (and use checkpointing) to overestimating.
Do not use SSH to launch jobs on the compute nodes
All compute processes should be managed by PBS in order to make sure PBS will be able to manage them if needed. SSH
on the compute nodes should only be performed to monitor your jobs. Running jobs with SSH
manually can potentially leave lots of ghost processes (processes that do not belong to a PBS job) that must be cleaned manually by the sysadmin.
Do not use qstat too much
The qstat
command imposes some load on the scheduler, and users that use watch
to monitor the output of qstat
every second put a large load on the scheduler to the point that the proper scheduling of the jobs is affected. Running a lot of qstat
to see when your job starts actually results in delaying the start of your job...
Reservations
Reservations to meet deadlines or to run debugging jobs can be requested by email to ceci-logist@lists.ulg.ac.be.
The following rules apply:
- maximum 311040 hours.core per reservation (e.g. 4320 cores for 3 days)
- maximum wall time of 10 days per reservation
- maximum 4320 cores reserved at any time
Reservations are granted so as to best organize the load of the machine and be fair to all CÉCI users. See also https://tier1.cenaero.be/en/reservations.
More information ...
- Official documentation by Cenaero
- Cenaero Support: it@cenaero.be
- PBSPro documentation from Altair