Tech giants Lenovo, NetApp & NVIDIA team up to get small
Experience tells us that there is a relationship between organizational size and technology adoption: Larger, more resource-rich, enterprises generally adopt new technologies first, while smaller, more resource constrained organizations follow afterward, (provided that the small organization isn’t in the technology business). This pattern has repeated itself for multiple generations across a myriad of technologies. However, once smaller organizations get ahold of a technology, their creativity can drive it in ways that nobody ever imagined. Case in point: The personal computer, which was originally deployed at large companies primarily to do word processing and spreadsheets, but within a decade was being used to compose music, control buildings, front-end complex medical devices, and thousands of other applications. Lenovo, NetApp and NVIDIA have teamed up to help drive Artificial Intelligence (AI) into smaller organizations and hopefully seed that creative garden.
NetApp uses a nifty description of AI as a data pipeline. Since we’re partners, I’m going to shamelessly steal that description. The pipeline starts with (1) data creation and ingestion at the edge, then (2) moved to a centrally located data cleaning and preparation stage, where it is aggregated for the (3) training stage, which is the most resource intensive stage of the process. If the data were oil, think of training as going through the refinery to produce petroleum. Finally, (4) the deployment and inference stage where the trained data is sent back out to the edge in inference mode and to collect more data to restart the pipeline with ingestion.
Our solution is focused at that critical third stage, training. It is a reference architecture (RA) that combines Lenovo ThinkSystem servers equipped with NVIDIA GPUs and Lenovo ThinkSystem storage based on NetApp technology. By providing customers and partners with a “best recipe”, Lenovo, NetApp and NVIDIA have taken the guesswork out of configuring and optimizing a training platform. It can work as single, scale-up instance where multiple users run jobs on individual nodes using the shared storage, or as a multiple node scale-out cluster where jobs are executed sequentially across all of the nodes, with the nodes accessing the shared storage simultaneously. This is a key distinction because most smaller organizations will start in with scale-up, but may eventually migrate toward a scale-out approach.
Training, specifically Deep Learning which relies on neural networks to do the training, can require hundreds of GBs, up to PBs of storage. During the cleaning and prep phase, the data is assembled into large, pre-packaged files such as TFRecords, (TensorFlow Records) which are then read through sequentially. Crucial to any workload that utilizes GPUs is keeping them fed with data to process. This makes system-wide throughput critical to keeping all of the compute resources humming.
Because the DM5000F is all flash storage, and the SR670 is designed to keep data moving to the GPUs, the benchmark results were equal to that of using local storage on the nodes, even when using 10GbE fabric. The DM5000F combined with the ONTAP software provide additional data management and protection capabilities such as deduplication and volume-level encryption that local nodes do not have.
The major components of the RA are:
The Lenovo ThinkSystem SR670 a two-socket system designed to house either four large or eight small GPUs in its 2U chassis. The award-winning SR670 was built on the concept of “Integrated Modularity” which eschews hard-wired PCI slots for flexible PCI cabling to allow customers to dynamically configure their CPUs to the GPUs as need.
NVIDIA Tesla V100 32GB or T4 Tensor Core GPUs deliver the computational acceleration needed. The V100 32GB delivers the processing power of 100 CPUs. The T4 is specifically designed for scale out workloads like inference and edge applications, in a small form-factor package.
The Lenovo ThinkSystem DM5000F All Flash Array based on NetApp technology is a unified, all flash storage system that is designed to provide performance, simplicity, capacity, security, and high availability. Powered by NetApp’s ONTAP software, the ThinkSystem DM5000F delivers enterprise-class storage management capabilities with a wide choice of host connectivity options and enhanced data management features.
The above base configuration was chosen as the most economical combination of server storage and fabric for the workload. Our teams tested multiple configurations, varying the components and rerunning the benchmark. This was done for customers who want to run multiple types of workloads on the same resources or want to use other components for higher performance or additional features. This could mean a different server such as a ThinkSystem SR650, or alternate storage controller like the DM7000F.
The technical details of the RA can be found at this link: https://lenovopress.com/lp1348.pdf including the benchmarking results, and all the substitute components that were tested. Additionally, we have a video presentation in conjunction with NetApp titled “Affordable AI: Practical Advice for Small and Medium Businesses” which is posted here:
Because affordability was a key factor in the development of the RA, small and medium enterprises, as well as departments within larger organizations should find the solution within budgetary reach. This should give customers a solid, high performance base that can scale up or out as needed and start them on their AI journey. As our experience tells us, smaller organizations can drive amazing changes when their creativity is unleashed with technologies. If they didn’t, when we turn on our laptops every morning, we would all still be staring at this: C:\
Authors:
David Arnette, Principal Technical Marketing Engineer, NetApp
Miroslav Hodak PhD, Global AI Architect, Lenovo
Patrick Moakley, Director of Marketing, HPC & AI, Lenovo