We’re accustomed to the massive scale of everying associated with exascale supercomputing. Now we’re getting details on the file system that will support Frontier, the world’s first exascale-certified system housed at Oak Ridge Leadership Computing Facility’s High-Performance Computing Storage and Archive Group at Oak Ridge National Laboratory.
The file system, called Orion, consists of 50 cabinets with capacity for up to 700 petabytes of data spread across a three-tiered system of flash memory, spinning disk drives and other nonvolatile media that relies on open-source Lustre and ZFS technologies.
“We have one user with a single job on Frontier that will generate 80 petabytes of data,” said Dustin Leverman, group leader for HPC Storage and Archive of the Oak Ridge Leadership Computing Faility’s Systems Section, who is overseeing Orion’s installation and acceptance. “One petabyte equals about 500 billion pages of text, so do the math and that gives you some idea of the scale of data produced by these studies. If the user needs to refine the model, there may be even more data generated. We need ways to keep this kind of data available for verification and for future study.”
With Orion, files will be distributed across:
- A flash-based performance tier of 5,400 nonvolatile memory express, or NVMe, devices providing 11.5 petabytes of capacity at peak read-write speeds of 10 terabytes per second with more than 2 million random-read input-output operations per second;
- A hard disk–based capacity tier of 47,700 perpendicular magnetic recording devices providing 679 petabytes of capacity at peak read speeds of 4.6 terabytes per second and peak write speeds of 5.5 terabytes per second; and
- A flash-based metadata tier of 480 NVMe devices providing an additional capacity of 10 petabytes.
Files older than 90 days will be moved to a data archive.
Frontier set a new record for computing speed when the HPE Cray EX system debuted in May 2022 at an average speed of 1.1 exaflops, or 1.1 quintillion calculations per second. Engineers at the OLCF, which houses Frontier and its predecessor, Summit, expect that speed to be faster after final tuning.
“We’re taking a hybrid approach to this system because these spinning mechanical drives by themselves aren’t fast enough to keep up with the processing speeds,” Leverman said. “The more time a computer burns while it stops to save data, the less time you have for computational workloads. We don’t want 20 percent of anyone’s time on Frontier to be spent on just writing data.”
Frontier marks the fourth leadership-class supercomputing system Leverman has helped deploy. He joined the OLCF in 2009 and worked on Summit and its predecessors Titan and Jaguar, each the world’s fastest HPC systems at their times of launch.
“There are always challenges throughout the process, and this has been no different,” Leverman said. “Each of these machines has helped change the future of science, and we’re excited to help make that possible.”
source: Matt Lakin, Oak Ridge National Laboratory