In this video from GTC 2019 in San Jose, Harvey Skinner, Distinguished Technologist, discusses the advent of production AI and how the HPE AI Data Node offers a building block for AI storage.
Commercial enterprises have been investigating and exploring how AI can improve their business. Now they’re ready to move from investigation into production. However, research or Proof of Concept (PoC) environments are relatively small, both in terms of the number of servers and the amount of data. Conversely, a production AI environment may have racks of GPU servers in a complex clustered environment with large numbers of NVIDA GPUs and multi-petabytes of data.
To reduce Deep Learning training time, companies will likely turn to distributed computing with GPU clusters. HPE partnered with NVIDIA, Mellanox and WekaIO to run benchmarks to explore the throughput requirements in a distributed training environment, using 4 HPE Apollo 6500 Gen10 Systems, up to 32 NVIDIA® Tesla® V100 SXM2 32 GB GPUs, Mellanox 100 Gb EDR InfiniBand, and the WekaIO Matrix file system. The benchmarks show that NVIDIA GPUs scale very linearly in both scale-up and scale-out configurations. Throughput definitely increases as the number of GPUs increase, and particularly for inferencing, using high-performance storage and network fabrics would certainly be recommended. See the link below to read about the benchmark results.
The HPE AI Data Node is a HPE reference configuration which offers a storage solution that provides both the capacity for data, as well as a performance tier that meets the throughput requirements of GPU servers. The HPE Apollo 4200 Gen10 density optimized data server provides the hardware platform for the WekaIO Matrix flash-optimized parallel file system, as well as the Scality RING object store. This configuration combines both software stacks into a single cluster of nodes which is tested and optimized by HPE. By using a converged hybrid solution, rack footprint, complexity, and power and cooling are all minimized, as well as allowing a faster deployment.
To learn about the throughput requirements for production AI, read about our distributed AI benchmarks in this gated white paper: Accelerate performance for production AI.
To learn more about the HPE AI Data Node, please read the white paper: HPE Reference Configuration for HPE AI Data Node.