[SPONSORED GUEST ARTICLE] In HPC, leveraging compute resources to the maximum is a constant goal and a constant source of pressure. The higher the usage rate, the more jobs get done, the less resources sit idle, the greater the return on the HPC investment. Greater power efficiency means greater organizational efficiency. The bottom line – HPC resource management is a very big deal.
At Lenovo, with its HPC-class ThinkSystem server product line, workload maximization is a top priority. For several years, the company has partnered with SchedMD, developer of the popular open source workload manager software, Slurm. Together Lenovo and SchedMD have built an optimized orchestration solution for all Lenovo HPC ThinkSystem servers via Lenovo’s Intelligent Computing Orchestration (LiCO).
Slurm is by far the most widely used workload manager in HPC with market penetration of more than 70 percent.
Slurm began in 2002 as a collaboration by Lawrence Livermore National Laboratory, Linux NetworX, Hewlett-Packard and Groupe Bull. In 2010, SchedMD LLC was incorporated to develop and market the application.
Slurm maximizes workload throughput, scale, reliability, and results in the fastest possible time, while optimizing resource utilization and task allocation at an ultra-granular level. Slurm’s HPC and AI job automation capabilities are designed to simplify administration, accelerate job execution and improve end user productivity all while reducing cost and error margins.
Building on those capabilities, a key aim of the Lenovo-SchedMD partnership is to simplify the use of Slurm. Ana Irimiea, Lenovo AI Systems & Solutions Product Manager, explains how the integration works.
“HPC and AI users give their inputs, things like scripts, containers and other resources, through the Lenovo LiCO interface on ThinkSystem servers,” she said, “and LiCO creates Slurm batch scripts underneath based on the inputs to deploy and manage the workload. The user doesn’t need to have all those skills or know the commands for Slurm. They just select from the user interface what they want to achieve, and that is going to be done in the background by Slurm through the interface with LiCO.”
Slurm is known for its massive scalability for handling requirements for large cluster and leadership-class supercomputers. Along with those capabilities, a major Slurm distinction is its ability to manage GPUs. Victoria Hobson, Vice President of Marketing at SchedMD, details this distinction.
“When submitting jobs, Slurm allows users to request GPU resources alongside CPUs. Slurm can allocate CPUs and GPUs together. This flexibility allows administrators to configure features according to the specific requirements of their site’s complex business policies. By effectively managing both CPUs and GPUs, Slurm ensures that jobs are executed quickly and efficiently, while maximizing resource utilization.”
“For everyone in organizations utilizing HPC-class resources, be it for scientific simulations or AI, gaining access to HPC is closely managed and tightly controlled, and uptime must be fully leveraged to get jobs done,” said Ana. “Our integration with Slurm is designed to be an easy-to-use tool so that end users can focus on developing and running HPC workloads, not configuring, requesting and managing compute resources. Slurm capabilities include accounting for task level in real time, power-consumption and API usage, as well as automatically re-queuing jobs. Through LiCO, we let Slurm automate all those tasks.”
To further help address a holistic and sustainable approach via software, Lenovo has entered a long-term partnership with Energy Aware Software (EAS) around Energy Aware Runtime (EAR) to provide a solution optimized to deliver high performance sustainability for the Exascale era.
EAR is an energy management framework created by Luigi Brochard and Julita Corbalan through a BSC-Lenovo collaboration project, which started in 2016. EAR also provides energy optimization and smart power capping capabilities with integration into workload managers and job schedulers.
It is designed to work independently of any scheduler. However, Lenovo provides a Slurm SPANK plugin to make EAR utilization easy with Slurm. The plugin is fully integrated into LiCO to allow customers to select their energy strategy for running jobs.
SchedMD provides consultation, training, support and migration services for Slurm. For more information about these offerings, visit this page.
To learn more about Lenovo compute orchestration in HPC data centers using Slurm, visit this page.