HPC DevOps: Powered by the Cloud

Print Friendly, PDF & Email

Bruce Moxon, Sr. HPC Architect, Microsoft Azure Global

Dynamic infrastructure, DevOps, and Infrastructure-as-code (IaC) – terms historically at odds with High Performance Computing – are finding life in today’s Cloud.  And the implications can be transformational for organizations — and for applications long thought to be firmly embedded in on-premises infrastructure.

Efforts that, in the past, required a long process of collective requirements gathering, architectural compromises, and five-year capital budget commitments to procure a multi-departmental compute cluster can now be addressed more effectively – and quickly – and in a manner better suited to the needs of individual departments or projects.

Some of the recent Cloud trends contributing to this more agile, DevOps-oriented approach to HPC include:

Timely access to state-of-the-art server infrastructure.  High core counts, large high-bandwidth memory configurations, and the latest GPU and FPGA accelerator architectures are now available to fuel the most aggressive scientific, technical, and AI applications.  Microsoft’s recent announcement of its Azure HBv3 virtual machines based on the AMD EPYC™ 7003 series CPU ushers in a new era of “first day” availability of such hardware.

Integrated High Performance Networking.  These high performance VMs are configurable with high performance, low latency networking such as NVIDIA Mellanox HDR 200 Gb/s InfiniBand within “proximity groups” that ensure consistent, minimal latency for tightly coupled (MPI) parallel HPC applications typically used in largescale sensor processing or physical simulation, including seismic processing, structural analysis, and computational fluid dynamics.

High Performance Storage. High-performance managed service storage offerings, such as Cray® ClusterStor™ in Azure and Azure NetApp Files are readily available in many geographies.  And software-defined storage solutions built on high-performance SSD-based VMs, and storage acceleration and caching solutions like Microsoft Azure HPC cache can be deployed nearly instantly in front of large, cost-effective object storage in a model reminiscent of what LANL’s Gary Grider first termed “campaign storage”.

Software. HPC-ready software stacks in the Cloud include optimized VM images with the latest MPI and GPU libraries, NVIDIA Mellanox InfiniBand drivers, and high performance storage clients.  And many of the traditional HPC commercial packages now support pay-for-use SaaS licensing models, including ANSYS Cloud on Azure Marketplace.

Job Scheduling and Dynamic Clusters. Both traditional HPC job schedulers (e.g., Slurm, LSF, PBS) and cloud-native schedulers like Azure Batch are readily available and can be used in conjunction with dynamically scaled server clusters like Azure CycleCloud and Azure VM Scale Sets, providing access to thousands of cores nearly instantly, and only for the duration of the job.  And loosely coupled application architectures can further leverage spot instance pricing to stretch development dollars or trade cost for time-to-results.

Dynamic Deployment and Orchestration (IaC). Both cloud-specific deployment tools (e.g. Azure Resource Manager) and open source, cross-cloud infrastructure-as-code approaches, including Hashicorp Terraform and Ansible are increasingly being used to dynamically instantiate HPC infrastructure when needed, to scale it for production runs, and to relinquish it when no longer needed.

Azure Global’s Specialized Workloads Customer Enablement team develops and publishes reference architectures and associated deployment scripts as examples of HPC-on-demand infrastructure.  These are available in Github repositories and can be used to seed customer POCs or serve as a foundation for HPC DevOps initiatives.  The team works closely with customers and ISV partners to accelerate development projects, and with Azure Global Engineering to drive additional product needs.

The first of these is available at http://github.com/Azure/azurehpc.  More application-specific scenarios are also in development, as is an extensible HPC on-demand platform.

Together, these developments are ushering in a new era of high performance computing with the speed and agility of the Cloud. A well-planned, cloud-native HPC strategy complements traditional on-premises HPC investments.  It leverages a DevOps model to improve access to the right infrastructure at the right time in the development cycle, to opportunistically incorporate rapid technology advances, and to accelerate innovation and improve time-to-results.

Author’s Bio

Bruce Moxon is a Senior HPC Architect in the Specialized Workloads Customer Enablement group, where he works with customers in deploying dynamic HPC infrastructure for applications in Finance and Life Sciences.

References

More performance and choice with new Azure HBv3 virtual machines for HPC | Azure Blog and Updates | Microsoft Azure

azure cray clusterstor – Bing

Azure HPC Cache | Microsoft Azure

CycleCloud – HPC Cluster & Workload Management | Microsoft Azure

Batch – Compute job scheduling service | Microsoft Azure

2016-grider.pdf (ahsc-nm.org)

Ansys Cloud (microsoft.com)