Data researchers, developers, data managers and program managers from the Department of Energy national laboratories visited Lawrence Livermore National Laboratory Oct. 24-26 to discuss the latest in data management, sharing and accessibility at the 2023 DOE Data Days (D3) workshop.
The three-day event, sponsored by the National Nuclear Security Administration’s (NNSA) Office of Defense Nuclear Nonproliferation and hosted annually by LLNL, featured more than two dozen speakers from across the DOE/NNSA complex. More than 200 attendees met to explore wide-ranging topics in the data sphere, including common challenges the national labs face in operating in a secure environment and maintaining an ever-increasing amount of data generated in the age of artificial intelligence (AI), machine learning (ML) and other advanced high performance computing (HPC) technologies.
Common threads throughout the workshop were the importance of building a “data community” across DOE, the need for a shared language around data and the value of national lab personnel who are working on data issues.
LLNL Geophysics Data Specialist Rebecca Rodd, a member of the organizing committee and host for the workshop, said she hoped D3 — held in-person for the first time since 2019 — brought the DOE data management community together to advance solutions in a fast-changing landscape.
“Technology changes rapidly, which can both provide challenges and opportunities in our research; it is critical, as data managers across the spectrum of specializations, that we have a forum to address challenges, learn from one another, engage in thoughtful discussions and drive change and innovation,” Rodd said.
Rodd called the event “a great success” that “brought people from various levels and across the DOE agencies together in a room to share their data management efforts, discuss common and ongoing challenges and identify actionable items to progress forward.”
“I’m excited to see how this community continues to engage throughout the year and in future DOE Data Days events,” she added.
The workshop’s daily sessions covered data access, sharing and sensitivity, data curation and metadata standards, data governance and policy, cloud and hybrid data management and data intensive computing, with each session concluding with a question-and-answer (Q&A) period. Attendees shared their labs’ latest data management tools and platforms, as well as best practices and possible avenues for improving data literacy and accessibility. Speakers from DOE, NNSA and the national labs also highlighted emerging trends being applied to DOE workloads across the complex, such as generative AI, Large Language Models and converged computing.
DOE’s Chief Data Officer (CDO) Rob King, who gave the event’s opening keynote on enterprise data management, remarked on the importance of “speaking the same language” across labs and of sharing best data management practices.
“This event is fantastic,” King said during the Data Leadership panel discussion. “There is nothing that replaces sitting and talking with peers, that’s why we push for this event to bring this community together and grow it.”
Building a foundation of standardized data management
The 2023 D3 opened with a session on data governance and policy, with a focus on improving DOE’s geospatial data repositories. The afternoon session carried a theme of data curation and metadata standards, with a keynote by Alex May and Olga Kuchar of Oak Ridge National Laboratory (ORNL). Following talks on data curation strategies at NNSA, the Pacific Northwest National Laboratory (PNNL), the National Energy Technology Laboratory (NETL) and Los Alamos National Laboratory (LANL), LLNL researchers Camille Mathieu and Wayne Kool discussed findability and accessibility issues around weapons program data.
Mathieu, a knowledge management program manager in the Strategic Deterrence directorate, said her team wants to “make it easier to look across silos” and give users the ability to query data easily by leveraging existing tools. She discussed efforts to create a secure central authority for controlled information and a “Google Scholar for the weapons program,” emphasizing publishing, while providing a mechanism for authors to share their research more efficiently.
“The work that we’re doing is about resiliency,” Mathieu said. “It’s about being able to structure information that otherwise might be in Word documents or in people’s heads. Once we do that in a way that’s both human and computer-readable, using ontology and taxonomy management solutions, we’re able to provide a foundation where we can do a lot of cool stuff, but we need to start there.”
During the afternoon’s Data Leadership Panel (moderated by DOE Geospatial Information Officer Josh Linard), DOE’s King, NNSA Office of the Chief Information Officer Data Program Lead Erica Vosseller, PNNL Data Architect Larry Seid, Sandia National Laboratories (SNL) Chief Data Officer Tom Trodden and Nevada National Nuclear Sites Data Lead Krams Ramasubramanian discussed “establishing a data culture” at the DOE labs, as well as common challenges with maintaining metadata and data literacy, and what changes might be on the horizon, due to sizeable investments across the complex in AI and ML.
SNL’s Trodden remarked on the “real opportunity to start bringing down the friction of collaboration, as we begin to build a common lexicon around what does it really mean to share data and manage data.” Vosseller added that solutions would come from the biggest strength of the community: the people who are working in data across the DOE/NNSA labs and sites.
“They’re amazingly creative, curious and really focused on solving problems,” Vosseller said. “People at the sites and labs have worked together in a grassroots way to build a community where they can solve a problem. One of the big challenges is the complexity. We have lots of different orientations where our organization has split points, and I think being able to bridge those is one of the is one of the biggest challenges.”
D3’s second day began with a session on data access, sharing and sensitivity, highlighted by a keynote by Argonne National Laboratory’s (ANL) Rachana Ananthakrishnan on federated science data at DOE and Globus, a software-as-a-service for research data management tool. LLNL computer scientist Daniel Gardner and Angeline Lee discussed architecture for digital engineering and NNSA’s program for sharing information across the complex — the Product Realization Integrated Digital Enterprise (PRIDE) — followed by talks from experts at other DOE national labs on security and privacy concerns unique to DOE and NNSA.
The afternoon’s data intensive computing session began with a keynote by ANL’s Arvind Ramanathan on combining generative AI with robotics and simulations for DOE applications, with a focus on biological data. Ramanathan described some of the ways ANL uses generative AI and robots to execute autonomous experiments to design antimicrobial peptides that can kill bacterial strains without harming host cells.
Subsequent talks in the session covered AI and ML technologies that are being used or could be used at the DOE national labs, including Large Language Models and converged computing. One of the highlights of the session was a talk by Charles Doutriaux, a computer scientist in LLNL’s Weapon Simulation and Computing program. Doutriaux presented tools the program uses to enable workflows and decrease time to solution, such as Maestro and Merlin. Other tools used at LLNL include Sina, which increases the availability and discoverability of simulation data with minimal effort; Kosh, a tool built on top of Sina, enabling users to access and process data in a consistent fashion without worrying about format or path; and IBIS (Interactive Bayesian Inference and Sensitivity), which generates statistical models and other Verification Validation and Uncertainty Quantification tools.
PNNL computer scientist Luanzheng “Lenny” Guo also discussed the next possible steps for DOE after HPC, in the form of converged computing — the coordination of distributed resources such as HPC, edge and cloud computing. Guo talked about work PNNL is doing with LLNL researchers on Resilience for Converged Computing Systems, including the Fault Tolerance 500 (FT-500) benchmark to identify shortcomings of converged computing and find bottlenecks in data movement.
LLNL research scientist Alexx Perloff then described the data challenges involved in creating the Geospatial Information System for Knowledge and Rapid Decisions (GISKARD), a multi-layered, “digital twin” of the world used for conflict decision-making. Generating a comprehensive dynamic map of the globe with low latency — incorporating static and dynamic elements such as terrain, atmosphere, regional properties, demographics and sensor data — requires using potentially hundreds of terabytes of data, making management and stewardship extremely complex, Perloff explained.
“We want to provide humans with some strategies to achieve their goals, and this is really important because in this day and age, the human is being bombarded with information and being asked to make decisions in a much smaller timeframe,” Perloff said. “Here at Lawrence Livermore National Laboratory, we are combining novel computational approaches, machine learning and high performance computing to achieve decision superiority.”
The final day of the event featured a session on cloud and hybrid data management, with talks on managing largescale drug discovery data in hybrid environments by LLNL software and data architect Jie Deng and modernization of the Earth System Grid Federation (ESGF) data platform by LLNL Center for Applied Scientific Computing computer scientist Sasha Ames.
Participants also held an engaging Q&A around the future of cloud data management at the national laboratories. In addition to the presentations, attendees connected in breakout sessions of smaller groups to discuss questions and issues related to the themes covered throughout the workshop, such as tools being used by labs to monitor, secure or otherwise reduce the risk of cloud environments; how to facilitate good practices of documenting AI-related data artifacts and access to potentially biased information; how to confidently use private data with open models; and using AI/ML in data curation processes.
The results of the discussions will be summarized in a briefing report, along with the workshop proceedings, and made available to the public in early 2024 via OSTI.gov. Details on next year’s DOE Data Days will be released at a future date.