The release of smaller and more efficient AI models like Mistral’s groundbreaking Mixtral 8x7B model has seen the concepts of “Mixture of Experts” (MoE) and “Sparsity” become hot topics.
These terms have moved from the realms of complex AI research papers to news articles reporting on rapidly improving Large Language Models (LLM).
Fortunately, you don’t have to be a data scientist to have a broad idea of what MoE and Sparsity are and why these concepts are a big deal.
Mixture of Experts
LLMs like GPT-3 are based on a dense network architecture. These models are made up of layers of neural networks where each neuron in a layer is connected to every neuron in the preceding and subsequent layers.
All the neurons are involved during training as well as during inference, the process of generating a response to your prompt. These models are great for tackling a wide variety of tasks but use a lot of computing power because every part of their network takes part in the processing of an input.
A model based on an MoE architecture breaks the layers up into a certain number of “experts” where each expert is a neural network pretrained on specific functions. So when you see a model called Mixtral 8x7B it means it has 8 expert layers of 7 billion parameters each.
Each expert is trained to be very good at a narrow aspect of the overall problem, much like specialists in a field.
Once prompted, a Gating Network breaks the prompt into different tokens and decides which expert is most suited to process it. The outputs of each expert are then combined to provide the final output.
Think of MoE as having a group of tradesmen with very specific skill sets to do your home renovation. Instead of hiring a general handyman (dense network) to do everything, you ask John the plumber to do the plumbing and Peter the electrician to do the electrics.
These models are faster to train because you don’t need to train the whole model to do everything.
MoE models also have faster inference compared to dense models with the same number of parameters. This is why Mixtral 8x7B with a total of 56 billion parameters can match or beat GPT-3.5 which has 175 billion parameters.
It’s rumored that GPT-4 uses an MoE architecture with 16 experts while Gemini employs a dense architecture.
Sparsity
Sparsity refers to the idea of reducing the number of active elements in a model, like the neurons or weights, without significantly compromising its performance.
If input data for AI models, like text or images contains a lot of zeros the technique of sparse data representation doesn’t waste effort on storing the zeros.
In a sparse neural network the weights, or strength of connection between neurons, is often zero. Sparsity prunes, or removes, those weights so they aren’t included during processing. An MoE model is also naturally sparse because it can have one expert involved in processing while the rest sit idle.
Sparsity can lead to models that are less computationally intensive and require less storage. The AI models that eventually run on your device will rely heavily on Sparsity.
You can think of Sparsity like going to a library to get an answer to a question. If the library has billions of books you could open each book in the library and eventually find relevant answers in some of the books. That’s what a non-sparse model does.
If we get rid of a lot of the books that have mostly blank pages or irrelevant info it’s easier to find the books relevant to our question so that we open fewer books and find the answer faster.
If you enjoy staying up to date with the latest AI developments then expect to see MoE and Sparsity mentioned more often. LLMs are about to get a lot smaller and faster.