Sponsored Post
Today’s IT organizations must maximize their resource utilization to deliver the computing capabilities their organization needs when and where it’s needed. This has resulted in many organizations building multi-purpose clusters, which impacts performance.
Even worse from an ROI perspective, in many instances, once resources are no longer required for a particular project, they cannot be redeployed to another workload with precision and efficiency. Composable disaggregated infrastructure (CDI) can hold the key to solving this optimization problem, while also providing bare metal performance.
What is CDI?
At its core, CDI is the concept of using a set of disaggregated resources connected by a NVMe over fabric solution so that you can dynamically provision hardware, regardless of scale. This infrastructure design provides the flexibility of the cloud and the value of virtualization but the performance of bare metal. Because it decouples applications and workloads from the underlying hardware, CDI offers the ability to run diverse workloads on a cluster while still optimizing for each workload and even support multi-tenant environments.
Software providers often used in CDI-based clusters include Liqid CDI and Giga IO. Liqid Command Center™ is a powerful management software platform that dynamically composes physical servers on demand from pools of bare-metal resources. GigaIO FabreX is an enterprise-class, open-standard solution that enables complete disaggregation and composition of all resources in the rack.
What are the technical and business benefits of clusters that include CDI?
The disaggregated resources in CDI allow you to dynamically provision clusters using best fit hardware without the reduction in performance that you would get in a cloud-based environment. With respect to HPC and AI, the value of CDI comes from the flexibility of the underlying hardware, different workloads, and environments. This improves cost effectiveness and scalability compared to cloud services and cloud service providers, improving ROI and lowering costs.
For AI and HPC workloads, performance is still top priority and on-premises hardware provides better performance, with the ability to burst to the cloud on an as-needed basis. A well-designed cluster built with commercial off-the-shelf (COTS) hardware elements and connected with PCIe, Ethernet, and InfiniBand can increase the utilization, flexibility, and effective use of valuable data center assets. Organizations that implement CDI realize a 2x to 4x increase in data center resource utilization, on average.
Beyond optimizing resource allocation, CDI also provides several additional benefits for your dynamically configured system:
- Support for multiple workloads with different technical requirements without major administration efforts
- Cost-effective, scalable performance to support workloads beyond the capabilities of cloud service providers
- Future proofing, in case of strategic direction changes, new project requirements, etc.
- A central resource for a diverse user base and workload set
What are ideal use cases for CDI?
A wide variety of technology areas can benefit from CDI. These include:
- Multi-tenant environments
- HPC and simulation
- AI and machine learning (ML)
- Cloud-like computing
- Engineering and visualization
- VFX and digital production
For deep learning, it is best to keep clusters on-premises because on-premises computing can be more cost-effective than cloud-based computing when highly utilized. It’s also advisable to keep primary storage close to on-premises compute resources to maximize network bandwidth while limiting latency.
What are the key components of a CDI cluster?
There are two critical factors in deploying a successful CDI-based cluster. The first is a design that properly integrates leading-edge CDI software.
As mentioned above, two software platforms often used in CDI clusters are Liqid Command Center and GigaIO FabreX. Both are technologies Silicon Mechanics has worked with before and uses in our CDI-based clusters.
Liqid Command Center is a fabric management software for bare-metal machine orchestration. Command Center provides:
- Policy-based automation and dynamic provisioning of resources
- Advanced cluster, machine, and device statistics and monitoring
- Scalable architecture supporting high availability (HA)
- Multiple control methods, including GUI and RESTful API
GigaIO FabreX is an open-standard solution that allows you to use your preferred vendor and model for servers, GPUS, FPGAs, storage, and for any other PCIe resource in your rack. In addition to composing resources to servers, FabreX can compose servers over PCIe. FabreX enables true server-to-server communication across PCIe and makes cluster scale compute possible, with direct memory access by an individual server to system memories of all other servers in the cluster fabric.
High-performance, low-latency networking, like InfiniBand from NVIDIA Networking, is the second critical element to the way CDI operates. It’s possible to disaggregate just about everything—compute (Intel, AMD, FPGAs), data storage (NVMe, SSD, Intel Optane, etc.), GPU accelerators (NVIDIA GPUs), etc. You can rearrange these components however you see fit, but the networking underneath all those pipes stays the same. Think of networking as a fixed resource with a fixed effect on performance, as opposed to other resources that are disaggregated.
It is important to plan out an optimal network strategy for a CDI deployment. InfiniBand is ideal for large scale or high performance. Conversely, Ethernet is a strong choice for smaller clusters. If you expand over time, you’ve got that underlying network to support anything that comes up in the lifecycle of that system.
How can CDI help handle demanding HPC and AI workflows?
Today, many organizations run demanding and complex workflows, such as HPC and AI, that require massive levels of costly resources. This drives IT departments to find flexible and agile solutions that effectively manage the on-premises data center while delivering the flexibility typically provided by the cloud. CDI is quickly emerging as a compelling option to meet the demands for deploying applications that incorporate advanced technologies.
Silicon Mechanics is an engineering firm providing custom, best-in-class solutions for HPC/AI, storage, and networking, based on open standards. The Silicon Mechanics Miranda CDI Cluster is a Linux-based reference architecture that provides a strong foundation for building disaggregated environments.
Get a comprehensive understanding of CDI clusters and what they can do for your organization by downloading the Inside HPC white paper on CDI.