By Tim Miller, Vice President, Product Marketing, One Stop Systems
Most AI inferencing requirements are outside the datacenter at the edge where data is being sourced and inferencing queries are being generated. AI inferencing effectiveness is measured by the speed and accuracy with which answers are provided, and, in many applications, real-time response is required. To meet the objectives of many AI applications, a very large number of inferencing queries need to be serviced simultaneously. Often many different inferencing models answering different types of queries need to be coordinating in parallel.
Autonomous trucking is a prime example. To achieve AI Level 4 (no driver) in autonomous trucks, powerful AI inference hardware supporting many different inferencing engines operating and coordinating simultaneously is required. A long-haul truck in Boston with cargo destined for the west coast will spend two days autonomously operating across the country, saving its owner cost and time. When driving all day and night, it will encounter a range of variable weather and traffic conditions, as well as some unexpected events like animal crossings, accidents, construction detours or debris in the road.
The truck is outfitted with many sensors, including lidar (light detection and ranging), cameras and radars, along with a powerful and rugged AI inferencing computer server. The data captured by the sensors generates thousands of inquires a second to the on-board inference engines. The resulting query responses are synthesized to answer the questions: what is it seeing? is it normal? what is the correct response? Data continuously streams in from each independent sensor, generating a steady stream of inferencing queries.
The first task that needs to be accomplished is environmental awareness, such as what is going on around the truck and how it relates to where it is located. It checks its acquired understanding with the pre-conceived view, based on GPS mapping coordinates. The next step is decision making. Based on the perceived world what actions are required? Thousand of micro decisions need to be made, such as does the truck need to turn, accelerate, or brake? Finally, the action stage is where instructions are relayed to the steering and braking systems with fine-tuned adjustments streaming in continuously. Different inference models are being taxed with different sets of inquiries. Some answers are generating new queries to different engines. And all of this is operating in real-time.
The operation of the autonomous truck is dependent, not on a single inference query but thousands, all requiring simultaneous processing. The system is made up of countless inference models, each with a different piece of the puzzle that needs to run in parallel. A powerful GPU-based compute server system is ideal for handling all of this, and doing it with low latency, but it must be capable of supporting many inference models running in parallel.
Today a single high-end GPU from NVIDIA can provide over 600 TOPS (trillion operations per second), which is the measure of performance relevant for inferencing. But it would be inefficient to utilize this entire GPU for a single inference model, which even when servicing thousands of queries per second will only consume a fraction of this performance. To optimize the utilization of the GPUs, it is essential to partition it into independent elements that can each execute its own dedicated inferencing engine. Nvidia provides MIG technology that allows a single A100 to be set up into 7 virtual GPUs each, with its own dedicated cores and memory. To fully optimize the utilization of the GPUs, further fine tuning the resources allocated to each inferencing engine is useful. In this case, utilizing software-enabled GPU partitioning capabilities, that are complementary to the NVIDIA MIG capabilities, is beneficial. For example, fractional GPU capabilities from Run:AI utilize a Kubernetes foundation to allow for dynamic software-controlled partitioning of the GPUs, with each partition able to be sized appropriately for specific inference models. Not surprisingly, the more GPU power that can be efficiently utilized, the more sophisticated the response capability.
Whereas AI training will consume entire GPUs for hours and even days, AI inferencing requires powerful GPUs partitioned effectively for real-time responsiveness. Inference engines in an IoT device can utilize a low-power embedded processor or GPU, but for long-haul autonomous trucking, efficiently utilized high-end GPUs with many inferencing engines running simultaneously is required.