Lawrence Livermore National Lab has deployed a 170-node HPC cluster from Penguin Computing. Based on AMD EPYC processors and Radeon Instinct GPUs, the new Corona cluster will be used to support the NNSA Advanced Simulation and Computing (ASC) program in an unclassified site dedicated to partnerships with American industry.
In searching for a commercial processor that could handle the demands of HPC and data analytics, Matt Leininger, Deputy for Advanced Technology Projects, LLNL, said several factors influence the choice of CPU including single core performance, the number of cores, and the memory performance per socket. All these factors drove LLNL to seriously consider the EPYC processor.
Our simulations require a balance of memory and compute capabilities. The number of high-performance memory channels and CPU cores on each AMD EPYC socket are a solution for our mission needs.” he said.
The lab’s latest HPC cluster deployment—named Corona—is built with AMD EPYC CPUs and Radeon Instinct MI25 GPUs. “We are excited to have these high-end products and to apply them to our challenging HPC simulations,” said Leininger.
Simulations requiring 100’s of petaFLOPS (quadrillions of floating-point operations per second) speed are run on the largest supercomputers at LLNL, which are among the fastest in the world. Supporting the larger systems are the Commodity Technology Systems (CTS), what Leininger calls “everyday workhorses” serving the LLNL user community.
The new Corona cluster will bring new levels of machine learning capabilities to the CTS resources. The integration of HPC simulation and machine learning into a cognitive simulation capability is an active area of research at LLNL.
Coupling large scale deep learning with our traditional scientific simulation workload will allow us to dramatically increase scientific output and utilize our HPC resources more efficiently,” said LLNL Informatics Group leader and computer scientist Brian Van Essen. “These new Deep Learning enabled HPC systems are critical as we develop new machine learning algorithms and architectures that are optimized for scientific computing.”
The Corona HPC cluster is powered by AMD EPYC CPUs, peaks at 383 teraFLOPS from 170 nodes, and has room to expand to a total of 300 to 400 nodes. Each node contains a two-socket EPYC 7401 processor, with 256GB RAM and a 1.6TB solid state drive (SSD). Half of the nodes also are equipped with four AMD Radeon Instinct MI25 GPUs.
The computing platform is Penguin Computing’s XO1114GT platform, with nodes connected by Mellanox HDR InfiniBand networking technology.
We have folks thinking about what they can pull off on this machine that they couldn’t have done before,” Leininger said.
CPU/GPU Powering Machine Learning in HPC
We’ve been working to understand how to enable HPC simulation using GPUs and also using machine learning in combination with HPC to solve some of our most challenging scientific problems,” Leininger said. “Even as we do more of our computing on GPUs, many of our codes have serial aspects that need really good single core performance. That lines up well with AMD EPYC.”
The EPYC processor-based Corona cluster will help LLNL use machine learning to conduct its simulations more efficiently with an active-learning approach, called Cognitive Simulation, that can be used to optimize solutions with a significant reduction in compute requirements. Leininger explained that multi-physics simulations, which include a significant number of modeling and calculations around hydrodynamic and materials problems important to NNSA, are the lab’s most complicated. These analytic simulations produce a range of parameter space results that are used to construct error bars which depict uncertainty levels that must be understood and reduced.
We are looking to use some machine learning techniques where the machine would figure out how much of the parameter space we really need to explore or what part of it we need to explore more than others,” Van Essen said.
Using EPYC-powered servers with the Radeon Instinct MI25 for machine learning, LLNL will be able to determine exactly where to explore further in order to detect what component is driving the majority of the error bars and significantly reduce time on task to achieve better science.
Sign up for our insideHPC Newsletter