Doug Kothe knows about transitions, including transitions on a grand scale. The outgoing director of the U.S. Department of Energy’s Exascale Computing Project helped transition the supercomputing industry into the exascale era. He took the ECP helm in October 2017, and in May 2022, the project reached a milestone with the certification of the Frontier supercomputer surpassing the exascale performance barrier. By any objective measure, it was a watershed moment in the history of HPC.
Now Kothe – who also was Associate Laboratory Director, Computing and Computational Science Directorate at Oak Ridge National Laboratory – is transitioning to the position of Chief Research Officer and associate laboratories director of advanced science and technology at Sandia National Laboratories, titles he assumed last week.
We caught up with Kothe on a Zoom call soon before his last day in his Oak Ridge office, and the background scene caught on camera revealed a desk and shelves mostly bare, while the odd stack of papers here and three-ring binders there remained. Kothe told us that the moving process has caused him to realize he’s something of a hoarder.
“I’ve concluded that I hoard three things instead of just the two that I’d thought,” said Kothe, who is both earnest and affable. “I always thought it was tools and electronics. But apparently, it’s a lot of old magazines and papers because I’m having a hard time deciding what to throw away.”
Of his departure from Oak Ridge, Kothe called it a difficult decision.
“This was really not easy, it’s very bittersweet,” he said with a tinge of wistfulness. “I’ve been at Oak Ridge for about 17-and-a-half years. It seems like I’m doing ‘the lab tour’ because I spent 20 years at Los Alamos and a year at Livermore. And I didn’t ever plan my last gig, meaning I see this as my last position before I retire. I never really planned to go to Sandia.”
This career transition came about after Kothe was approached by Sandia Lab Director Dr. James Peery, whom Kothe has worked with and for in different positions over the course of his career “and I really have tremendous respect for him, his vision and his leadership. And so a chance to re-engage working closely with James Peery was part of the decision.
“The position at Sandia looked to be a great opportunity,” Kothe continued. “My biggest difficulty was that I love Oak Ridge, I love the computing organization, I believe it’s the best in the world. So I struggled with this for a while before I visited Sandia. I interviewed there, and it just felt like it was the right thing for me at the right time.”
That said, Kothe said gaining a full understanding of his new role at Sandia will involve an orientation period.
“The learning curve will be steep,” he said, “and I’ll be immersed in in that role in the months to come, so what I’m saying now is what I envision will happen. It’s a substantial role and one that is really exciting.”
At Sandia, Kothe explained, CROs are expected to have a line responsibility, and for Kothe this means he’ll be responsible for almost 2000 employees in the Advanced Science and Technology Division. “It’s advanced science and technology that really underpins most of their national security mission,” he said. “And the computing organization is part of it.”
The new role will involve significant engineering analysis that use simulation tools to support the national security goals and objectives of the lab. “There’s also several key experimental facilities,” he said, “and that will be new to me, I’m not an experimentalist. I’ve certainly collaborated closely with experimentalists over my whole career, but here’s where I’m going to have to learn a lot and rely on the leaders of those facilities to do what they do well.”
In the end, the role at Sandia fit with his expectations of what a career transition should offer.
“There are two or three boxes that I like to check: it has to be fun, it has to be challenging and it has to be a learning opportunity. And those three are checked here,” he said. “The national security mission is also very compelling. And certainly that’s a mission here at Oak Ridge as well. But I’ll go squarely back into that space at Sandia.”
This career transition means Kothe is moving out of DOE’s Office of Science, whose purview includes oversight of the Oak Ridge Lab, and over to the DOE’s National Nuclear Security Administration (NNSA), a science and engineering laboratory for national security. That said, Kothe won’t be moving away from HPC but within it because the NNSA labs rely as heavily on HPC as the science labs do.
On the other hand, the HPC community probably won’t hear as much from Kothe compared with his tenure as ECP leader because the work Sandia does is necessarily more confidential.
“A lot of the scope of the labs is sensitive and not open,” he said. “And that’s the case for this division as well. Basically, a lot of the engineering aspects of nuclear deterrence are scoped in this group. It’s fundamental material science, including engineering and radiological sciences. Sandia has a tremendous responsibility for the engineering aspects of nuclear deterrence. I’m an engineer by training, so this is something that resonates with my background.”
The lower profile of HPC at Sandia will come as marked contrast with Kothe’s work in directing ECP, involving the efforts of roughly 1000 people, and the development of the software and application ecosystem for the three exascale systems: Frontier at Oak Ridge, El Capitan at Lawrence Livermore National Laboratory and Aurora at Argonne National Laboratory. Not just a significant technical achievement, standing up Frontier and the two other exascale systems has been a monumental project management challenge encompassing hardware and software engineers, coordinating the work of HPE, AMD and other vendors along with input from HPC experts at the national labs and DOE’s Office of Science and a slew of other exascale stakeholders.
The overriding goal: build what ECP called “capable exascale” – beyond machines that could exceed the exascale throughput barrier, ECP and DOE wanted systems with a built-out exascale ecosystem. In the case of Frontier, a usable, stable machine able to leverage its compute resources (74 HPE Cray EX supercomputer cabinets with more than 9,400 AMD-powered nodes and 90 miles of networking cables) with a library of ported and optimized software applications.
“With Frontier, the thing some people don’t realize is that we unleashed the ECP horde on that system,” Kothe said, “hundreds of people in the early science period, which I don’t think has ever been done. That group of very seasoned developers really helped to harden the system. And putting a much larger breadth and depth of a software stack on a system like that … more quickly and readily elucidated features that were needed and bugs that need to be fixed.”
A key to the project’s success was systematic, daily communications and updates from the various development teams.
“How to coordinate the work everyone was doing, how to process the issue tickets that are being put out, having kind of daily communications with the ECP users and facility users was critically important and it worked out really well,” Kothe said. “But it took a lot of planning. We said we’re going to have daily stand ups because with a project like this, things change every day. So the apps teams, and the software teams were essentially logging their status daily. And the application development and the software technology leadership teams were constantly checking in on who needs to do what, when, and where things needed to be done. That put a lot of pressure … on the OLCF and the Frontier crowd, and they did a great job. And, of course, the vendors HPE and AMD, you know, really were up to the task as well.”
Kothe remains impressed with the accomplishments on the software side of the project.
“This was new and different in terms of deploying a system where instead of, say, eight or 10 early science applications involving maybe 50 people, now we’ve got 400 or 500 people. So that was challenging,” he said. “But in the last year, we’ve seen teams roll out applications and they have matured to where the first simulation out of the box works, or if it doesn’t, they know exactly what the problem is and what to do about it. It’s the result of six years of R&D, these teams really being down deep on the metal.”
Kothe said the steady progress made in deploying Frontier, the lessons learned, and the project methodology put in place and then honed, has given him confidence that the next leadership-class supercomputer developed at Oak Ridge, internally called OLCF-6 (meaning it will be the sixth such system built at the Oak Ridge Leadership Computing Facility), will be a success.
“There’s a great leadership team in this directorate for computing and across the lab, and the computing organization is in very good hands,” Kothe said. “Some of that is because we’re constantly deploying, procuring and operating these large systems. That really does bring people along in how to handle high pressure, high profile leadership positions very quickly.”
For a senior lab manager like Kothe, part of a successful transition, too. It’s not just transitioning yourself effectively into a new role at a different organization, it’s enabling your old organization to succeed after your departure.
Terrific article on a legend! Best of luck, Doug!