Scientists from ETH Zurich found that Large Language Models (LLM) only need to use a small fraction of their neurons for individual inferences. Their new approach promises to make LLMs run a lot faster.
To begin to understand how they managed to speed up AI models we need to get a rough idea of some of the technical stuff that makes up an AI language model.
AI models like GPT or Llama are made up of feedforward networks, a type of artificial neural network.
Feedforward networks (FF) are typically organized into layers, with each layer of neurons receiving input from the previous layer and sending its output to the next layer.
This involves dense matrix multiplication (DMM) which requires every neuron in the FF to perform calculations on all inputs from the previous layer. And this is why Nvidia sells so many of its GPUs because this process takes a lot of processing power.
The researchers used Fast Feedforward Networks (FFF) to make this process a lot faster. An FFF takes each layer of neurons, breaks it up into blocks, and then selects only the most relevant blocks based on the input. This process amounts to performing conditional matrix multiplication (CMM).
This means that instead of all the neurons of a layer being involved in the calculation, only a very small fraction is involved.
Think of it like sorting a pile of mail to find a letter meant for you. Instead of reading the name and address on every single letter, you could first sort them by zip code and then only focus on the ones for your area.
In the same way, FFFs identify just the few neurons required for each computation resulting in only a fraction of the processing required compared to traditional FFs.
How much faster?
The researchers tested their method on a variant of Google’s BERT model that they called UltraFastBERT. UltraFastBERT consists of 4095 neurons but selectively engages just 12 neurons for each layer inference.
This means that UltraFastBERT requires around 0.03% of its neurons to be involved in processing during inference while regular BERT would need 100% of its neurons involved in the calculation.
Theoretically, this means UltraFastBERT would be 341x faster than BERT or GPT-3.
Why do we say “theoretically” when the researchers assure us that their method works? Because they had to create a software workaround to get their FFF to work with BERT and only achieved a 78x improvement in speed during real testing.
It’s a secret
The research paper explained that “Dense matrix multiplication is the most optimized mathematical operation in the history of computing. A tremendous effort has been put into designing memories, chips, instruction sets, and software routines that execute it as fast as possible. Many of these advancements have been…kept confidential and exposed to the end user only through powerful but restrictive programming interfaces.”
Basically, they’re saying that the engineers who figured out the most efficient ways to do the processing of the math required for traditional FF networks keep their low-level software and algorithms secret and won’t let you look at their code.
If the brains behind the designs of Intel or Nvidia GPUs enabled low-level code access to implement FFF networks in AI models, then the 341x speed improvement could be a reality.
But will they? If you could engineer your GPUs so that people could buy 99.7% fewer of them to do the same amount of processing, would you do it? Economics will have some say in this but FFF networks may present the next giant leap in AI.