Meta Announces AI Research SuperCluster: 16,000 GPUs, 1 Exabyte of Storage, 5 ExaFLOPS of Compute

Print Friendly, PDF & Email

Meta, formerly Facebook, announced it has designed and built what it calls the AI Research SuperCluster (RSC), “the fastest AI supercomputers running today and will be the fastest AI supercomputer in the world when it’s fully built out in mid-2022” delivering nearly 5 exaFLOPS, the company said in a blog announcing the system.

“Once we complete phase two of building out RSC, we believe it will be the fastest AI supercomputer in the world, performing at nearly 5 exaflops of mixed precision compute,” Meta’s Kevin Lee, technical program manager, and Shubho Sengupta, software engineer, in the announcement blog. “Through 2022, we’ll work to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5x. The InfiniBand fabric will expand to support 16,000 ports in a two-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16 TB/s and exabyte-scale capacity to meet increased demand.

RSC will be built in partnership with Penguin Computing and Pure Storage. In its current incarnation, RSC is comprised of 760 NVIDIA DGX A100 systems as its compute nodes for a total of 6,080 GPUs, which communicate via an NVIDIA Quantum 200 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.

Meta said early benchmarks on RSC, compared with Meta’s legacy production and research infrastructure (see below), “have shown that it runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster. That means a model with tens of billions of parameters can finish training in three weeks compared with nine weeks before.”

“Meta’s impressive plan aims to create not only one of the world’s fastest AI supercomputers, but one of the fastest supercomputers of any kind,” said Steve Conway, senior adviser, HPC Market Dynamics, at HPC industry analyst firm Hyperion Research. “General-purpose supercomputers are designed to run problems at up to 64-bit precision, while 16-bit precision is typically adequate for AI supercomputers. Meta’s news release doesn’t specify how fast the supercomputer under development will be, but the company has one supercomputer on the latest (November 2021) list of the world’s 500 most powerful supercomputers (www.top500.org), ranked number 139. I’d expect the full, mid-2022 version of Meta’s new supercomputer to improve that ranking among all types of supercomputers, while being at or near the top for the subcategory of AI supercomputers. Many important AI problems rely on some combination of GPUs and CPUs. While the news release doesn’t reveal the GPU/CPU ratio of the new supercomputer, Meta’s selection of Penguin Computing as its architecture partner makes me confident that the chosen ratio makes good sense.”

HPC veteran vendor Penguin Computing, whose market positioning is built around its “AI infrastructure” strategy (see insideHPC interview with Penguin CEO Sid Mair), is serving as Meta’s RSC architecture and managed services partner working with Meta’s operations team on hardware integration to deploy the cluster and set up the control plane, Meta said.

Meta began exploring AI in 2013 with the formation of the Facebook AI Research Lab. It added to its HPC infrastructure in 2017, in partnership with Penguin and Pure, with a cluster incorporating 22,000 NVIDIA V100 Tensor Core GPUs that performs 35,000 training jobs daily with more than a trillion parameters on data sets, some of which are an exabyte, according to the company.

“RSC will help Meta’s AI researchers build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more,” wrote Lee and Sengupta. “Our researchers will be able to train the largest models needed to develop advanced AI for computer vision, NLP, speech recognition, and more. We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together. Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”