John Spiers, Chief Strategy Officer, Liqid
Resource utilization and the soaring costs surrounding it are a constant push and pull issue for IT Departments. Now with the emergence of AI and machine learning, resource utilization is far more front and center than it has ever been. Managing legacy hardware in a hyperconverged environment just as you always have is not going to cut it, because the people and hardware costs associated with these extremely heavy workloads are tremendous. Intelligent fabrics and composable infrastructure software deliver a solution to the problem that allows IT providers the ability to pool and deploy their hardware resources to adapt to the workload at hand, then re-deploy as required for a balanced system that can address the demands of AI and machine learning.
No Balance for a Broken System
Despite countless hours of upfront resource planning, IT departments find themselves monitoring critical systems for potential performance issues, troubleshooting or reconfiguring systems based on usage patterns, and re-balancing resources to optimize performance and availability. System administrators spend countless hours deploying and configuring physical servers, creating and managing VMs, and mapping and visualizing relationships between physical servers and virtual machines. Over- and under- utilized resources are common and managing more complexity with more people results in expanding IT budgets.
The economics of this legacy IT infrastructure model drives IT leaders to explore public cloud options with the promise of eliminating IT capital equipment purchases, hardware management, and reducing IT administration costs by leveraging the cloud’s inherent capability to dynamically scale resources.
The reality is, IT departments find themselves right back in the same mode: troubleshooting issues and dealing with what feels like a black box, having to manage cloud software policies, limits, and making step adjustments. Costs soar from applications and testbeds that get abandoned and forgotten, along with all the other spun-up but not powered-down services. Reversely, if a cloud service deployed within a company becomes very popular and well-used, it can quickly be over-utilized, leading to gigabytes upon gigabytes of data or activity. When the bill shows up, sticker shock sets in and the IT department quickly shifts into the mode of optimizing cloud resource consumption to lower costs.
AI+ML Hit the Wall of Traditional Architecture
Whether it’s private or public cloud, resource management and planning in a business environment with rapidly growing and variable workloads is always a challenge. There is tremendous potential in having AI+ML learn workload patterns and related requirements for security and performance, and dynamically configure resources on-demand to deliver on these as service level resources. It’s all about having AI and machine learning applications to complete the human tasks typically performed by systems administrators and being more predictive versus reactive to constantly changing requirements.
However, advancements in AI+ML operational models are constrained by the limitations of the underlying physical resources. For example, when a server runs out of memory, it’s difficult to add more memory, and most applications aren’t designed to consume memory for more than one physical server at a time. There is very little an AI algorithm can do to dynamically provision resources when it is roadblocked by the limitations of the resources themselves. The same is true for networking, storage and accelerators.
Clustering technology helps solve these problems, but most applications don’t dynamically scale to take advantage of a scaling cluster of server nodes and applications are typically relegated to the resources in a single physical server. Data can be striped across servers, but network latencies, and resolving reliability and availability issues associated with failing devices becomes complex, and not knowing where your data physically resides can easily become a regulatory compliance nightmare.
Resources Revolutionized: Intelligent Fabrics and Composable Software
For full potential of AI+ML to be achieved in IT, the physical constraints of IT infrastructure must be removed. A good example is allocating GPU resources for AI+ML workloads. A typical standard server is physically constrained and can only accommodate 2 GPUs. If GPU’s are spread out across multiple servers, peer-to-peer computing becomes challenged by network latency and software’s ability to have GPUs communicate with each other across nodes.
With composable infrastructure, a pool of more than aggregated, bare-metal GPUs can be assigned to a single server through software without having to worry about its physical constraints, and peer-to-peer functionality is easily achieved within the pool over a high-speed, low-latency intelligent fabrics. GPU resources can be managed in tandem with NVMe, CPU, NIC, FPGA, and other PCIe-deployed accelerator technologies. Once a task is complete, resources can be reconfigured for a different application. This can be done on-demand and automated through APIs, delivering a more balanced system for AI+ML by taking advantage of all the hardware resources available at hand.
This strategy also allows for disaggregated growth, meaning you add only the hardware you need, when you need it, versus deploying tightly bundled solutions that include resources you don’t need.
The promise of composable infrastructure removes the physical constraints of server silos and enables servers to be built or composed on-the-fly by AI+ML algorithms to meet workload demand. Composable software enables the capability to build servers from pools of memory, storage, accelerators and network devices on demand, enabling AI+ML algorithms to be unconstrained by what hardware is in any given physical server. As composable technology becomes the standard way to deploy infrastructure, AI+ML software becomes resource and environment aware, allowing infrastructure to be intelligently managed with software.