By Sree Ganesan, d-Matrix.ai
[CONTRIBUTED THOUGHT PIECE] Generative AI is unlocking incredible business opportunities for efficiency, but we still face a formidable challenge undermining widespread adoption: the exorbitant cost of running inference.
You’ve heard of the staggering expenses incurred during the training of Large Language Models (LLMs): the multitude of GPUs, the enormous electricity bills. Analysts estimate Meta may spend $15 billion on GPUs in 2024. Generative AI demands copious memory and bandwidth for weight calculations and data handling, presenting a major obstacle to running models at scale. Even OpenAI’s Sam Altman says, “there’s no way to get there without a breakthrough,” he said. “It motivates us to go invest more in fusion.”
CPUs, GPUs and many of the custom-designed accelerators have been the main choices for AI so far, but even the most advanced solutions are slowed by the traditional Von Neumann architecture. Custom-designed ASICs, FPGAs and APUs offer specialized controllers with higher memory bandwidth and require large amounts of RAM to hold the model in memory, making generative AI almost impossible to deliver economically.
Despite these costs and compute limitations, soon enterprises will deploy multiple AI models broadly, expanding the demand for inference. This will call for another significant increase in compute power. Generative AI inference requires more compute and memory capacity because today’s models are much larger than non-generative ML models of the past.
Compute for inferencing depends not just on the model size but also the size of the user’s input prompts. For instance, when the prompt length increases from 8K to 32K, OpenAI’s GPT-4 operation cost doubles. On the other hand, the speed with which the model generates each output token is an important user experience metric. Here the memory bandwidth matters, but memory bandwidth limitations of legacy architectures exacerbate the pain points of inferencing cost and power.
The AI community is actively looking for solutions to address these challenges, including climate-friendly energy sources (wind, solar, e.g.), new hardware design approaches and algorithmic optimizations. Today I’d like to discuss an approach to address this problem.
The Memory and Energy Walls
Amir Gholami, research scientist, BAIR/SkyLab, UC Berkeley, and colleagues describe the challenges of training and serving transformer models in their article “AI and the Memory Wall.” A term coined by William Wulf and Sally McKee in 1995, the memory wall involves both the limited capacity and the bandwidth of memory transfer. Distributed inference may avoid a single accelerator’s limited memory capacity and bandwidth, but that approach also faces the memory wall problem: the communication bottleneck of moving data between accelerators, which is even slower and less efficient than on-chip data movement.
Creating content from trained weights requires a large number of tiny calculations. To conduct these calculations, GEMM (General Matrix Multiply) operations are used. GEMM doesn’t require massive processors, but it demands rapid and efficient small calculations.
This is where the memory wall gets in the way. With each operation, data must travel between DDR RAM and the processor, as well as between processors. Even if this journey only covers millimeters in distance, it requires time and energy due to the separation of storage and compute processors inherent in the Von Neumann architecture. And there are so many calculations!
Each time data moves over the memory bus, DRAM access consumes about 60 picojoules per byte, while computation requires only 50-60 femtojoules per operation. This means it takes a thousand times more energy to move the data back and forth than it does to actually use the data. As these energy costs accumulate across millions of user prompts, hundreds of watts per GPU, and thousands of servers and data centers worldwide, you can see why Altman thinks we’ll need a nuclear energy breakthrough.
In-Memory Computing
If it’s so inefficient to move data back and forth between storage and processing, can we flip this script and move the compute in memory instead? In recent years, in-memory computing (IMC) has therefore emerged as a promising alternative by performing the MAC (multiply-accumulate) operations near/in the memory cells directly.
Purdue University’s research shows that in-memory computing architectures offer up to 0.12 times lower energy consumption compared to established baselines for ML inference. The MICAS Center at the KU Leuven research university in Belgium champions IMC, emphasizing its ability to reduce access overheads and enable massive parallelization opportunities, potentially resulting in orders of magnitude improvements in energy efficiency and throughput.
Early explorations in the field looked at analog IMC as an efficient way to evaluate weights and run inference on pre-trained LLMs. However, this approach requires costly digital-to-analog converters and additional error checking.
Digital In-Memory Computing (DIMC) offers an alternative that sidesteps the challenges of analog IMC, providing noise-free computation and greater flexibility in spatial mapping. DIMC sacrifices some area efficiency compared to analog but opens up much more flexibility and power to handle future AI needs. KU Leuven’s research also supports SRAM as the preferred solution for IMC due to its robustness and reliability compared to NVM-based solutions.
DIMC promises to revolutionize AI inference, lowering costs and improving performance. Given the rapid pace of adoption of generative AI, it only makes sense to pursue a new approach to reduce cost and power consumption by bringing compute in memory and improving performance. By flipping the script and reducing unnecessary data movement, we can make dramatic improvements in AI efficiency and improve the economics for AI going forward.
Sree Ganesan is Vice President of Product at d-Matrix, a startup building AI chips for generative AI inference. To read more about the company’s DIMC-based AI accelerators, go to https://www.d-matrix.ai/.