Survey 2023: Summary of findings

At the end of year 2023, CÉCI users were invited to respond to an anonymous satisfaction survey.

The text below offers a summary of all comments and suggestions made in the responses. This other document offers a synthetic view of the responses.

Content

Documentation
Connecting
Job policies
Storage
Suggestions
Training sessions

Documentation

One user felt lost while trying to learn Linux and the command line and did not know where to begin.

One page in the CÉCI documentation provide links to tutorials about basic terminal commands: https://support.ceci-hpc.be/doc/_contents/QuickStart/Linux/IntroLinux.html, and suggests two "serious games" designed to learn the foundations of Bash.

It might nevertheless still be too large a step for complete beginners.

We will enhance that page with information for users who know about Windows and graphical user interfaces but have never used Linux or the command line.

Furthermore, as we settled on Gameshell for the training sessions, we will remove the link to the other game listed on that page and add information on how to start the game on the CÉCI clusters where it is installed.

One respondent requested that more information be provided about the use of $LOCALSCRATCH when submitting jobs

How to use the different types of filesystems (local scratch, global scratch, home, etc.) is described in details in a dedicated training session : Efficient data storage on CÉCI clusters

The slides of that session contain example submission scripts. We will add them to the documentation about Slurm submission scripts.

One respondent asked for more information about how to connect to storage at UCLouvain, and in general, about storage management.

One training session https://indico.cism.ucl.ac.be/event/143/contributions/143/ is overviews many aspects of data management on the clusters. Regarding the storage solution in one specific university, the users are invited to consult the documentation specific to that university, which in the present case, can be found at https://www.cism.ucl.ac.be/doc.

One respondent requested tips on how to best use the clusters (logging, benchmarking, etc.)

This is a very vast topic that is encompassed by multiple training sessions and pages in the documentation. Users who have suggestions about specific topics that could be developed in the documentation can send us an email at ceci-logist@lists.uliege.be.

Connecting

One respondent complained that idle connections were closed automatically

Automatic disconnection in SSH is controlled by the SSH client.

For the command line client use the SSH configuration wizard.

Jobs

A respondent complains that the different software that are needed for their job are not all in the same release.

The modules on the clusters are organised by "releases" to reduce the risks of software version incompatibilities ; all software within one release is known to be compatible with the toolchain in the release. To avoid issues, only one release can be active at a point in time.

If the software you need is spread across multiple releases, you can switch releases in the jobs submission script.

If that is not possible, because for instance the software are "glued" together by a workflow management system, feel free to ask the system administrators to install the software in a specific release.

A respondent complained that a configured limit would prevent them from submitting large job arrays at once.

If you regularly hit the maximum user limits, you may consider switching from TIER-2 clusters to TIER-1, to ensure FAIR usage of resources among CÉCI clusters and users. Currently, LUCIA offers such an option.

If you wish to remain on TIER-2 clusters, then using a workflow management system, and more specifically in this case atools will help.

One respondent wondered how to make sure not to monopolize the cluster and make sure to leave resources available for others

As with all shared resources, ensuring fair access to all users is important. That is the role of Slurm, and is possible thanks to the limited walltime. Users should not worry about fairness ; if their jobs start, it means they were the highest priority (based on prior usage) or they were backfilled (and did not delay the higher priority jobs).

If, nevertheless, a user wants to be "nice" to others, they can use the --nice option of sbatch which voluntarily lowers the job priority.

In the case of job arrays, the % sign can also be used to limit the number of jobs in the array that are allowed to run concurrently.

See the sbatch man page for more details

The same user suggests that A100 GPU usage be taken into account for fairshare (and hence priority) calculations

That is the case on Lucia where past usage in measured in "billing units", and one "billing unit" is a composite measure of the CPU, memory and GPU usage.

One respondent complained that when, for instance, Python is upgraded on the cluster, it forces them to re-run heavy compiling processes (for instance in Madgraph jobs)

Keeping the software up to date in important for many users who depend on the latest features or optimisations of the software for their job. But often, too, previous versions are important to avoid breaking existing jobs.

That is why software is never actually upgraded, only the default release is updated every year. But all previous releases still remain accessible. If, after maintenance week, the default Python version has changed and does not fit your needs, you should simply load the previous release, and save it as your default with lmod. See the lmod documentation for details.

One respondent explained that large stiff ODE problems cannot be benchmarked at small scale, so estimating the time for a job is difficult, and a lot of these jobs end up timed out with logs wiped.

Estimating the resources and time needed for a job is often very difficult. Yet it is important to make sure the expensive hardware is well used and shared.

Any log that is written to local scratch is indeed erased when the job reaches its end. To avoid that, make sure to write logs to a persistent storage like the $GLOBALSCRATCH filesystem.

And this is a reminder that checkpointing is a very important feature of scientific software. If you are the developer, make sure to include that feature in your software, it is often rather easy to implement. If you are a non-developing user, make sure to research that functionality in your software's documentation. If it does not appear, you can resort to software such as DMTCP.

One respondent complained that sometimes their job are killed by the administrators for a reason they do not understand and often at a time very close to the completion time.

If you realise a job of yours has been killed by an administrator, please consult your email, you should have a message explaining why. Do not hesitate to ask further questions if you do not fully understand its contents. Most of the time, it will be because your job performs IO operations (typically file writes) that make the whole system unstable.

Storage

One respondent noticed that when files on the common filesystem are deleted on a cluster, it can take some time before they appear as deleted on the other clusters

This is expected on the current CÉCI common storage, as explained in the documentation. This allows a very fast local reaction time.

The next generation of CÉCI common storage will offer the opposite trade-off ; the file will disappear from every cluster at the same time. From the other clusters, the operation will appear as faster than with the previous infrastructure, but from the cluster itself where the files were deleted, it will appear a bit slower.

One respondent suggested that the common storage be organised like the home filesystem, with one directory per user, rather than allowing everyone to created their own directories.

The CÉCI home filesystem was organised as such from the beginning, but the Transfer partition was initially left open and managed like a /tmp directory. Now, we have re-organised it with per-user directory, which are created automatically like the home directories.

A respondent complained that the paths to some directories can change from one cluster to another, making it difficult to have scripts that run on all of them

Although most of the time, the paths are chosen so as to be as uniform as possible on the clusters, there may be technicalities that make this impossible. Therefore, the recommendation is always to use the environment variables described in the documentation: $HOME, $CECIHOME, $GLOBALSCRATCH, etc. Those are guaranteed to be identical on all CÉCI clusters.

A respondent complained that Lucia did not have access to the CÉCI common filesystem.

Lucia was not integrated in the current CÉCI common filesystem but the plans for the next generation of the common storage is to include Lucia. This poses technical challenges for our network provided Belnet because Lucia is located in a new building that needs to be equipped accordingly. But all the issues should be resolved at some point to enable a fast access from Lucia to the common storage.

A respondent explained that they have problems choosing the correct permissions to be able to share files with others

A dedicated page on the Documentation was written on the subject : https://support.ceci-hpc.be/doc/_contents/ManagingFiles/SharingFiles.html. This topic is furthermore addressed in the training session Introduction to data storage and access.

Suggestions

One respondent noted that `ColabFold` `multitimer` used to work on Hercules, but does not anymore, probably after a CUDA upgrade.

While we always try to make sure upgrades do not bring regression issues, that can happen. As soon as you notice that something that used to work, does not work anymore, make sure to contact the system administrators.

That particular problem should be solved by now, but if it breaks again, contact us!

One respondent reported that some of their jobs using singularity sometimes fail, while other similar job succeed.

If you observe that some jobs fail while other do not, in a set of similar jobs, you can use the sacct Slurm command to try and correlate the errors with the nodelists. If you observe that the failing jobs were all scheduled on the same compute node, you should notify the system administrators, and add the --exclude <NODELIST> option to your jobs that are currently pending to make sure they do not fail.

A respondent explained that their docker-composed-based workflow cannot easily be transposed to Singularity and that prevent them from using the clusters.

Indeed Docker and docker-composed are technologies that were developed for a virtualisation-based cloud infrastructure rather than for an HPC infrastructure. That difference has major impacts on the security of the cluster, as with cloud technologies, user isolation is provided by the virtualisation, while on the cluster, user isolation really depends on the fact that no use has root access.

The good news still is that the Slurm developers are currently working on making cloud-originating workflows compatible with Slurm, but this is not fully ready yet.

One respondent complained that the maintenance week of Manneback was just before a paper submission deadline and it would be better to check with the researchers before fixing the dates.

The maintenance dates are decided based on many constrains; user deadlines are one type of constrains, but not the only one. Often, the maintenance period are chosen to coincide with maintenance of the hosting infrastructure (building, electricity, cooling system, etc.) or of the network infrastructure so as to minimize downtime periods. But those also depend on many other factors, which are often out of our decision perimeter.

As for the specific case of Manneback (which is actually not a CÉCI cluster) additional constrains come from the fact that the cluster is part of the LHC. But the maintenance dates are always discussed during the CISM "Comité des utilisateurs" so you should make sure your representative in that committee is aware of, and reports, your deadlines.

One respondent requested that other shell than Bash be configurable as login shell on the clusters

Using a different shell than Bash is not supported on the CÉCI clusters for the reasons listed in the documentation, but the same page lists workaround to still use ZSH. But beware that it might break some software modules.

One respondent complained that the information for citing the CÉCI and other clusters in the acknowledgment sections of papers is not easy to find.

Acknowledgments are indeed super important, as they allow us to compile publication lists for the funding entities when we write the funding requests for renewing the clusters. The information is available from the FAQ page (do not look for it in the technical documentation).

One respondent mentioned it would be great to have a physical person with whom to chat

The main contact channel for the system administrators is the web site and the support contact form, but that does not mean you cannot request a meeting with the system administrators of your university, either in person or through Teams or Zoom. Make sure to explain the purpose of the meeting and invite all relevant colleagues to make the most of the meeting.

One respondent requested that the project management tool would report resource usage and other statistics.

The project management tool allow creating and updating projects, but indeed does not offer any report about the project. That would be a very interesting feature of course, but it would require further developments of the tool, for which we currently, unfortunately, do not have financial nor human resources.

One respondent requested that CÉCI hosts a git and git-annex server for long-term secure storage of versioned data

Long-term availability of data is important for the credibility of publications that are based on that data. Unfortunately that is not the mission of CÉCI so we cannot offer any grantees for long-term storage, but is the responsibility of the universities. All the universities members of CÉCI have a solution based on Dataverse, Zenodo, or other. You should enquiry with your local administrators who will point your to the right resources.

One respondent noted a visualisation partition for software for `pymol` would be useful

Visualisation is an import part of research in several fields. But that requires mainly GPUs to be actually effective. At the time of writing, GPUs are available mainly on Lucia, where a visualisation partition is present. The next CÉCI cluster to offer GPUs will be Lyra, so this will be discussed during its setup.

One respondent requested that Octave with its GUI be present of the cluster so they could modify code and data without the need to download and reupload data at every change.

The HPC clusters are really not designed for interactive usage of a GUI-based software. When used on the frontend, they might disrupt proper functioning of the frontend, while submitting an interactive job would be possible but multiple tunnels will be needed and such interactive jobs could leave resources unused for long periods of time. Furthermore, remote GUI can be very slow. They could make sense on a visualisation partition, though.

Training

One respondent complained that the introductory session about the command line was not introductory enough and is lacking exercises.

This comment is probably related to older sessions as since 2023, the session about the command line is purely exercise-based. It is actually a text game (which you can play anytime on Lemaitre4 with module load gameshell) designed to introduce the Linux command line for real beginners through metaphors and small exercises.

One respondent noted that training sessions were missing information on good practice on the clusters. For instance some users train neural networks on CPU rather than on GPUs.

It is true that "good practice" are not elaborated during the sessions, as they are often not universal. For instance training neural networks on CPUs could be done in a study of the hardware performances in machine learning and be perfectly valid in this context. The system administrator often do not know enough of the context to be sure that some behaviour is "bad practice". Only when it has an impact on the system itself do we know for sure that it is bad practice.

Nevertheless, from 2024 on, there will be a short session about good practice that are expected from users on the clusters. Kind of a CÉCI good citizenship manual.

One respondent complained that the Python tutorial was focussed on code optimisation while they were hoping for an introduction to the language.

The respondent probably attended the second Python session, which is indeed dedicated to high-performance Python. But the first Python session is dedicated to the language itself, does not address optimisation, but teaches participants how to write simple programs in Python. Never too late to attend it!

One respondent suggested that sessions be different for very beginners and more advanced users because when a lecture tries to address both at the same time, it just does not work.

Participants in the training sessions can indeed have very different prior knowledge and that makes the organisation of the session very complicated. That is why we have, for each session, a description of the prerequisites and a list of topics detailed on the website where people can register, hoping that people would make sure they meet the requirements for the session.

Note that currently many sessions are split into introductory/advanced, e.g. Python, Julia, GPU, Slurm, and some others. On the other hand, some cannot be split because of the general difficulty for people to auto-determine their current level of skill. For instance, many users think they know everything they need to know about SSH and yet fail at connecting to the clusters, or complain about the passphrase that must be typed twice at every connection (no).

One respondent suggested that training sessions happen at different campuses and be online.

At the time of writing, most training sessions are organised in Louvain-la-Neuve, and some of them are in Namur, both universities being in a central position in Wallonia. But indeed, for participants from Liège or Mons, that is further away.

We try to make sure the beginning and ending time of the sessions are compatible with train travel from Liège or Mons. If you feel that some sessions should be organised there, please contact your local system administrator to start a discussion about that.

As for the bringing the sessions online, that is something we tested and are not willing to do anymore, for multiple reasons. First it makes hands-on sessions much more difficult and in the end, most remote participants are unable to actively participate. Then, there is no added-value to online training compared to the already-recorded videos that are available on our Youtube channel. Second, that requires double the amount of trainers by session and more logistics, while at the same time enable people to mindlessly follow the session from their office, while doing something else and without really focussing. In that situation, we cannot deliver participation certificates that are actually worth anything. Knowledge isn't free. You have to pay attention (Richard P. Feynman).