By Preston Smith, Director of Research Services and Support at Purdue University
At Purdue University, our community cluster program has supported scientists from every corner of our campus since 2004, building upon decades of experience at Purdue in scientific computing. Batch high-performance computing remains our scientists’ bread and butter, with current community clusters delivering hundreds of millions of CPU-hours each year, but cloud technology, using Azure’s AMD EPYC™ processor-based HBv3 instances, is providing us with new opportunities for innovation.
In the higher education HPC sector, the role of the commercial cloud in the nation’s “seamlessly integrated spectrum of [cyberinfrastructure] resources, tools, services” continues to be a hot topic. Previous work here at Purdue, a recent study from members of the Coalition of Academic Scientific Computing, the Chronicle of Higher Education, Virginia Tech, and the IceCube experiment have all contributed to a great deal of analysis on the cloud’s role, largely focusing the financial implications of cloud vs. on-premise supercomputers.
While understanding cost-effectiveness is a critical concern for academic center directors, I propose that cost considerations are not the primary reason to include the cloud in a center’s toolbox. Cloud technologies offer benefits in flexibility and access to new or niche capabilities which are an attractive way to augment our existing on-premise supercomputers, and has served as an inspiration for how resources are bought and provisioned in our campus ecosystem, with a cloud-like “service at the click of a button”.
Increasingly, providing only batch computing is not enough. Our campus partners require access to an ever-broadening set of tools as they look to build and operate within a larger ecosystem around their high-performance computing workflows.
For example, a Purdue biochemist builds upon the RNA-seq pipelines that she runs on our community clusters by loading data into a NoSQL database for other tools to consume, uploading genomes to a genome browser, or even hosting container-based science gateways to provide collaborators with seamless access to her tools and data, all without her valuable research data ever leaving the campus ecosystem.
With this experience in mind, our Purdue team proposed and was awarded a $10M grant from the National Science Foundation to build and operate the “Anvil” system for the national science and engineering community within the NSF’s XSEDE program. Equipped with AMD 3rd Gen EPYC processors, the CPU-based partition of Anvil will provide 1,000 nodes of HPC capacity for traditional batch high-performance computing. In addition, to better serve the growing diverse set of needs for modern researchers, Anvil will also provide easy-to-use interactive interfaces, innovative composable capabilities provided through an on-premise Kubernetes resource operated alongside the cluster, and an on-ramp to the cloud to access higher-level tools and HBv3 instances from Microsoft Azure.
Anvil was designed and proposed before the 3rd Gen EPYC processors existed, and the system’s anticipated performance was based on projections using 2nd generation AMD EPYC processors. At the time of this writing, our team is hard at work on the preparation necessary for Anvil’s deployment at Purdue. With the recent availability of HBv3 instances on Azure – based on the same 64-core AMD EPYC processors that will underpin Anvil – we were provided with an amazing opportunity to get a head start on application-level work before our Dell hardware even arrives.
Early access to 3rd Gen EPYC processors allows us to study the performance characteristics of key applications from the different scientific domains which we expect to support, validate the projections made during the design and procurement phase of Anvil, and launch into building and optimizing our user environments and software toolchains.
Based on early results from our applications team, the 3rd Generation EPYC processors which we have used in the Azure HBv3 VMs have performed largely in line with our projections, with some molecular dynamics applications even exceeding our expectations. We hope to present the detailed results of this work at various venues this spring and summer.
Looking ahead, after Anvil enters its production phase, we plan further work to continue to leverage Azure, HBv3, and all the H-series instances that offer Infiniband in the cloud. First, we will use Cycle Cloud to leverage the flexibility and higher-level services of Azure to offer burst capacity for loosely-coupled workloads through HBv3 instances directly from Anvil’s batch queues. At the application level, we will support access to Azure’s higher-level machine learning toolkits, as well as an on-ramp to the scale and flexibility of Azure’s Kubernetes service for scientists who build applications with Anvil’s composable subsystem.
This is an exciting time to assist the national community by enabling their science, and we look forward to forging the future of computing with Anvil and the ecosystem surrounding it.
Anvil is supported by the National Science Foundation under Grant No. 2005632. The Purdue/Azure HPC Center of Excellence was funded by Microsoft in partnership with AMD.