Status
Future events
Current issues
None. If you notice something wrong, please notify us.
Past events
2024-09-21 NIC5: At 3:47 AM, the master node of NIC5, along with all services running on it (including the Slurm controller), experienced a failure. Normal operations were restored by 7:00 AM. No impact of this outage has been observed on active jobs.
2024-09-01 Lemaitre3 Deactivation of the full system, which will become completely unavailable.
2024-08-08 Dragon2: Due to unforeseen problems during the Dragon 2 maintenance, Dragon2 is currently in a degraded state. The cluster is up and you can submit jobs, with precautions explained in an email associated with the event.
2024-07-29 / 2024-08-05 Dragon1/2 Maintenance week with cleaning of global scratches.
2024-07-01 Lemaitre3 Cleaning of the global scratch, deactivation of slurm, and freezing of the home directories (read-only).
2024-07-01 Lemaitre4 Some short disruptions of services to be expected from time to time during the maintenance week.
2024-06-24 07:00 NIC5: Start of the urgent unplanned maintenance, NIC5 unavailable untill 13:00. Due to network problems perturbing the acces to /home or /CECI on some compute nodes, we have to drain the cluster during the w-e to have it empty of jobs Monday morning to perform a reboot of the Infiniband switches. NIC5 back at 13:00 as forecasted.
2024-06-10 Hercules2: Planned maintenance week
2024-05-13 09:30 NIC5: The second /scratch server is up, and the faulty disk has been replaced and is slowly rebuilding. To ensure data safety, until tonight, the size and number of jobs per user is strictly limited.
2024-05-12 23:30 NIC5: One of the two /scratch fileservers is down. Data are safe and available, but the performances are degraded. Submission of jobs is momentarily suspended.
2024-04-04 14:00 Hercules2: Due to a power outage, the GPU nodes on Hercules2 are unavailable. They are expected to be back in service in the next few days.
2024-04-08 16:00 Hercules2: Hercules2 is back in service.
2024-04-04 14:00 Hercules2: Due to a power outage, Hercules2 is down. The service is expected to resume Monday April 8th.
2024-04-05 09:00 UNamur CÉCI gateway: The UNamur CÉCI gateway is back online.
2024-04-04 14:00 UNamur CÉCI gateway: Due to a power outage, the UNamur CÉCI gateway is down.
2024-03-19 Lemaitre3 and Lemaitre4: Planned power cut
2024-01-29 Manneback: Planned maintenance week (New date!)
2024-02-19 Lucia: Planned maintenance (7:00-19:00)
2024-01-31 Lemaitre3: Planned power outage (7:00-19:00)
2023-10-12 07:00 NIC5: The scheduled maintenance went well and ended sooner than expected.
2023-10-12 08:36 NIC5: The CECI common file system gateway of NIC5 has been rebooted. Access to all /CECI
partitions has been restored.
2023-10-12 00:49 NIC5: The CECI common file system gateway of NIC5 failed. As a consequence, access to all /CECI
partitions was lost. Jobs using one of these partitions may have failed.
2023-10-02 10:00-12:00 NIC5 and CECI websites: inaccessible due to a networking issue
2023-09-24 11:00 Hercules: Home filesystem back online.
2023-09-23 13:56 Hercules: Home filesystem unavailable preventing login.
2023-09-20 14:45 Lemaitre3: The BeeGFS global scratch /scratch
is back online after replacement of the failing hardware
2023-09-20 13:20 Lemaitre3: The BeeGFS global scratch /scratch
is currently unavailable.
2023-09-17 17:45 Lemaitre3 and gwceci.cism.ucl.ac.be: Network connectivity has been restored.
2023-09-16 16:45 Lemaitre3 and gwceci.cism.ucl.ac.be: UCLouvain HPC infra inaccessible due to a networking issue.
2023-09-05 16:04 Hercules2: workaround implemented to mitigate the slowdowns
2023-09-05 16:04 Hercules2: Cluster stability issues detected due to defective network device
2023-08-10 11:08 NIC5: NIC5 is up and running again
2023-08-10 09:00 NIC5: Login node memory replacement and reboot
2023-08-06 16:04 NIC5: Hardware memory problem on login node detected
Legend
Everything is running as expected.
The system status is degraded. Some functionalities might be missing, or less performant.
The system is unavailable ; we are working to make it functional again.
The system is undergoing planned maintenance operations.
The system is not maintained anymore.
Beginning of the event/issue
Resolution of the event/issue
Information and status update
Future announcements and "save the date" info