The NCRC was established in 2009 as part of a strategic partnership between NOAA and the U.S. Department of Energy and is responsible for the procurement, installation, testing and operation of several supercomputers dedicated to climate modeling and simulations. The goal of the partnership is to increase NOAA’s climate modeling capabilities to further critical climate research. To that end, the NCRC has installed a series of increasingly powerful computers since 2010, each of them formally named Gaea. The latest system, also referred to as C5, is an HPE Cray machine with over 10 petaflops — or 10 million billion calculations per second — of peak theoretical performance — almost double the power of the two previous systems combined.
C5 is one of three NOAA computers operating at ORNL. Typically, the NCRC only operates two supercomputers at a time for NOAA users. They are replaced on a rotating schedule to provide NOAA users with uninterrupted access to more powerful machines while also minimizing operational and maintenance costs.
“The power efficiency, cooling efficiency, and CPU power all increase significantly over time. We can replace all of the computational power of C3 with a single cabinet of C5, which has eight cabinets total,” explained Paul Peltz, the ORNL technical lead for Gaea.
Originally scheduled to arrive in the fall of 2021, supply chain issues delayed C5’s delivery and installation by several months.
“It was a unique period of time that made purchasing a system of this size very challenging,” said Chris Fuson, the NOAA program manager at ORNL.
When the hardware arrived and C5 was assembled in the summer of 2022, the team began the testing and acceptance process, which is a standard, but critical, phase that pushes the system to test its reliability, stability and performance under various workloads. This work was led by Verónica Melesse Vergara, group leader for the System Acceptance and User Environment group. Working with her were ORNL staff members Tom Papatheodore, Dan Dietz and Nick Hagerty. Initial tests, which find faulty hardware and confirm basic functionality, were followed by benchmarks and applications provided by NOAA that were representative of actual workloads.
“We load up the system with the application benchmarks and ensure the system can run with the expected performance,” said Dietz, a high-performance computing, or HPC, engineer at ORNL. “We slowly loaded up the number of copies of each benchmark running at once, easing on the gas to ensure the system doesn’t run into any issues under heavy load. We want to see consistent performance among all copies of the benchmark.”
“Finding problems and fixing them before we open the system to users is rewarding,” said Melesse Vergara. “If we did our jobs correctly, then users will be able to run without major challenges; so often they are unaware of the bugs that were fixed before they had access.”
When Gaea goes into full production and is open to NOAA users, the ORNL team will take a step back and focus on system maintenance while preparations for the next system begin.
“ORNL is a custodian of the machine for NOAA,” said Peltz. “We provide strong HPC knowledge and top-class facilities, and we invest heavily in our ability to house these machines in a secure manner. Those are things that NOAA doesn’t have to worry about. This interoperability between agencies is great.”
UT-Battelle manages ORNL for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.
source: Betsy V. Sonewald, Oak Ridge National Laboratory