Exascale: Rumors Circulate HPC Community Regarding Frontier’s Status

Print Friendly, PDF & Email

Frontier

By now you may have expected a triumphant announcement from the U.S. Department of Energy that the Frontier supercomputer, slated to be installed by the end of 2021 as the first U.S. exascale-class system, has been stood up with all systems go. But as of now, DOE (whose Oak Ridge National Laboratory will house Frontier) is foregoing a “mission accomplished” announcement and instead has issued a somewhat formal statement about Frontier’s status. Left unaddressed are rumors circulating through the HPC community of difficulties encountered in the late stages of Frontier system integration and fine tuning.

Here’s the official statement on the state of Frontier, issued by Mike Bernhardt, communications lead for DOE’s Exascale Computing Project, “ORNL’s partners in the exascale effort, HPE and AMD, have delivered the new Frontier system to ORNL ahead of the schedule for this fall.  The installation and integration of Frontier, a massive, complex effort, is now underway, and current progress indicates everything is on track to have Frontier available to users for open science next year — as anticipated.”

Yet there also are rumors circulating through the HPC community that Frontier is not now as near to end user readiness as had been hoped. While it’s said that Frontier is showing impressive runs on some codes, we also hear the Slingshot interconnect, tasked with tying the mammoth HPE cluster together, is proving troublesome. Where, specifically, the fabric problems may lie is unclear, but there is speculation it’s related to integrating the HPE Cray-based Slingshot with the AMD EPYC CPUs and Radeon Instinct GPUs that will power Frontier. It’s possible DOE has decided to delay announcing it has stood up the country’s first exascale system until the rumored interconnect issues are resolved.

insideHPC asked HPE and AMD to comment on the Slingshot rumors but has not yet received a response. We will update this story as appropriate.

Semantics may be in play here: it’s easy to conflate words like delivered, deployed, installed, stood up and other terms used to describe a new system’s operational status. DOE has consistently said Frontier would be in place at Oak Ridge by year’s end and available for users next year. In a Dec. 23, 2021 article in Nextgov, Justin Whitt, OLCF program director, said “Some early users will get access to Frontier this summer to help harden the system for full user operations on Jan. 1, 2023.”

DOE and the Oak Ridge Leadership Computing Facility have spotlighted their success in preparing the OLCF facility for Frontier, such as recognition of the installation team leaders with ORNL Director’s Awards (see Frontier Exascale Install Teams Win ORNL Director’s Award, Dec. 10, 2021) and accounts of the immense infrastructure required to accommodate Frontier (see A Look Inside the US’s 1st Exascale Supercomputer Facility, Sept. 30, 2021).

Having said all of this, it’s also important and fair to state that Frontier is pushing supercomputing into uncharted waters, that those tasked with deploying a system capable of a billion billion calculations per second have taken on systems integration complexities (see Getting to Exascale: Nothing Is Easy, Oct. 18, 2020) on a scale never before seen.

“Frontier is a first-of-its-kind system and it requires a thoughtful, deliberate process to bring a machine of its magnitude online,” Whitt told Nextgov last week. “Thanks to the hard work and dedication of our world-class team, we’re exactly where we thought we’d be when we put the plan in place two years ago.”