Barcelona, February 8th 2023 – HPCNow! has announced a real-time HPC cluster monitoring capability.
The monitoring stack includes open-source solutions such as Grafana, Elasticsearch and Prometheus, for visualization and data storage, and Slurm plugins plus customized scripts to gather all the information needed by the system administrator. It is delivered using Docker Compose for single-node monitoring scenarios, or using Docker Swarm if high-availability is requested by the customer.
Additionally, it includes dashboards to display the information gathered, some of them are:
- Slurm jobs: accounts for all Slurm jobs over a period of time.
- Job detail: returns the detail of each job (submission, start and end date, CPUs used and their efficiency, memory used and its efficiency, Slurm script, etc.)
- Slurm accounting: general overview of the HPC workload.
- Job efficiency monitoring (CPU and memory): resources asked, used and wasted.
The HPCNow! monitoring solution is flexible. It is provided taking into account the needs of the customer in terms of availability, variables to control and visualization.
“This new technology is a must for those institutions facing cluster congestion issues, that want to maximize their return on investment, and/or to keep the cloud bursting budget under control,” the company said. “Additionally, it helps the HPC center to draw a line to define what is reasonable regarding resource usage and educate users on using the cluster properly if they are allocating more resources than needed.”