Accelerating LLM inference is a vital ML analysis downside, as auto-regressive token technology is computationally costly and comparatively sluggish, and enhancing inference effectivity can scale back latency for customers. In addition to ongoing efforts to speed up inference on Apple silicon, now we have lately made important progress in accelerating LLM inference for the NVIDIA GPUs broadly used for manufacturing functions throughout the trade.
Earlier this yr, we revealed and open sourced Recurrent Drafter (ReDrafter), a novel method to speculative decoding that achieves state-of-the-art efficiency. ReDrafter makes use of an RNN draft mannequin, and combines beam search with dynamic tree consideration to hurry up LLM token technology by as much as 3.5 tokens per technology step for open supply fashions, surpassing the efficiency of prior speculative decoding strategies.
Productionizing ReDrafter to Speed up NVIDIA TensorRT-LLM
This analysis work demonstrated sturdy outcomes, however its larger affect comes from being utilized in manufacturing to speed up LLM inference. To make this development production-ready for NVIDIA GPUs, we collaborated with NVIDIA to combine ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.
Although TensorRT-LLM helps quite a few open supply LLMs and the Medusa speculative decoding methodology, ReDrafter’s beam search and tree consideration algorithms depend on operators that had by no means been utilized in earlier functions. To allow the combination of ReDrafter, NVIDIA added new operators or uncovered current ones, which significantly improved TensorRT-LLM’s functionality to accommodate refined fashions and decoding strategies. ML builders utilizing NVIDIA GPUs can now simply profit from ReDrafter’s accelerated token technology for his or her manufacturing LLM functions with TensorRT-LLM.
In benchmarking a tens-of-billions parameter manufacturing mannequin on NVIDIA GPUs, utilizing the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, now we have seen 2.7x speed-up in generated tokens per second for grasping decoding (see Figure 1). These benchmark outcomes point out this tech may considerably scale back latency customers might expertise, whereas additionally utilizing fewer GPUs and consuming much less energy.
For further element, see this submit on the NVIDIA developer weblog.
Conclusion
LLMs are more and more getting used to energy manufacturing functions, and enhancing inference effectivity can each affect computational prices and scale back latency for customers. With ReDrafter’s novel method to speculative decoding built-in into the NVIDIA TensorRT-LLM framework, builders can now profit from sooner token technology on NVIDIA GPUs for his or her manufacturing LLM functions.
Acknowledgements
Many individuals contributed to this undertaking together with: Aonan Zhang, Xuanyu Zhang, Yunfei Cheng, Chong Wang, Yi Wang, Abhishek Udupa, Dhaval Doshi, and our collaborators at NVIDIA.