Mission: Generate
A Mixture of Experts
Show Notes:
- Mistral’s paper explaining its Mixture of Experts architecture: https://arxiv.org/abs/2401.04088
- Hugging Face explaining the purpose and advantages of Mixture of Experts: https://huggingface.co/blog/moe
- A blog examining the Mixtral and Grok architectures as MOE models: https://www.unite.ai/the-rise-of-mixture-of-experts-for-efficient-large-language-models/
- Our episode on training, which we referenced: https://www.missioncloud.com/podcasts/training
Official Transcript:
Ryan:
Welcome to Mission Generate, the podcast where we explore the power of AWS and the cutting-edge techniques we use with real customers to build generative AI solutions for them. Join us for each episode as we dig into the technical details, cut through the hype, and uncover the business possibilities.
I'm Ryan Ries, our chief data science strategist, and your host.
In this episode, we're going to have a conversation about a peculiar architecture common to LLMs which you may have heard of: the so-called "Mixture of Experts."
Why does a Mixture of Experts matter when developing language models? How does this change the way you should think about developing your own AI solutions or training your own model? We'll explore questions like these and many more today.
With that, I'll hand it off to Casey, our cohost and leader of Product Marketing at Mission.
Casey:
Hey, everyone, I know it's been a while since you've heard my voice on this pod, though in fact, you've been listening to my words all this time if you've been tuning in.
I'm the writer of Mission Generate and, because these are two synthetic voices talking to you, in a somewhat profound way, that means you're actually hearing me talk, even when Ryan is speaking.
Ryan:
I prefer to speak for myself, Casey.
Casey:
Ah, but Ryan, even your irreverent wit is mine to wield in this brave new world we're entering.
Ryan:
He's right, folks—if you haven't listened before, Casey and myself are in fact not in a recording studio or even recording anything at all—we are both synthesized voices, generated by AI with a little help from a large language model and natural language processing to sound as lifelike and convincing as our human counterparts.
Casey:
Exactly. Our podcast's particular mixture of experts is entirely generated, much like the architecture we're going to be talking about today. Speaking of which—Ryan, what's the deal with this architecture and why should our listeners care about it?
Ryan:
For the sake of time, I'm going to let your clever pun stand and pretend we're both experts here.
Casey:
That's right, folks. Just like Ryan got his PhD from CalTech in bio-physical chemistry, I went to clown college. Truly, we are peers in all regards.
Ryan:
Okay okay, quit clowning around and let's get back to the topic, Casey. Let's start here: Mixture of Experts isn't anything new. It's actually been around for a long time, longer even than the lifespan of generative AI. But it is a clever concept that takes us into the inner workings of these large language models and may help you have a better grasp of what's going on inside them.
Mixture of Experts can be easily visualized as a panel of, well, experts. To stick with the visual metaphor for a bit, imagine these experts, let's say eight of them, sitting at a long table, attentively watching a door at the back of the room. When that door opens, a courier enters bringing a message. The courier leaves the message in front of the table with a moderator, and that moderator then delivers it to the right expert depending on whom he thinks will best answer it. That's Mixture of Experts in a nutshell.
Casey:
Cool. So in this metaphor, I take it the courier and the message, that would be the prompt, right?
Ryan:
You got it, Casey. I can see your inference engine is really working today.
Casey:
Well, let me infer a bit more in that case. The panel's moderator, that would be what is called an "orchestrator," who functions as the router to our panel of experts to make sure the prompt goes to the right one. As for the panel—that's the parts of the model who are replying with an output based on that message's input. The other half of our chat conversation, let's say, if we were talking to a chatbot.
Ryan:
Very good, Casey. If you hadn't written all of this yourself, I'd be saying right now that it seems like you finally learned something. But I guess you did since you're able to invent this conversation.
Casey:
I'm not half-bad for a product marketer, eh? But Ryan, let's point out something here. Why even bother with the panel and the moderator? Wouldn't a much more effective model architecture be a panel with one member who just knew the answers you were looking for to every question?
Ryan:
Yes, it often seems to customers as though they're interacting with one monolithic entity when working with a model, but that perception can actually be something of a mirage. As we said earlier, Mixture of Experts is an old architecture. And while this hasn't gotten official confirmation, starting with GPT-4 we were already interacting with a Mixture of Experts architecture out in the wild, for instance.
Casey:
So those chat windows, even though they made it appear as though we were talking to one very large model, were actually putting us in touch with this experts panel, so to speak?
Ryan:
That's right, Casey. Right from the outset, our perceptions of what the model was were actually rather skewed.
Casey:
So, is the right way to think about this that there were actually several GPTs behind the scenes, re-routing our questions as appropriate?
Ryan:
Yes and no. Intelligence is a problematic word when talking about interacting with a model, but it would be more accurate to say you were dealing with a distributed intelligence which could infer which parts of itself were encoded to best answer your questions and then let those parts reply.
Casey:
Right, so this is where the panel metaphor gets a bit misleading, doesn't it? Because we think of a panel as comprised of individuals, like individual experts in their fields, but what you're saying is that you can't necessarily separate those parts into their own models?
Ryan:
Not entirely, at least. In some ways, they aren't that neatly divisible. But computationally speaking, you can separate them, which actually gets to one of the first reasons you'd use a Mixture of Experts architecture—you can run a very large model while only paying for the computation of a small part of it at any time.
Casey:
Right, and that's because, even while the model itself might be something like, a hundred billion parameters, the Expert replying at any one time might only be twelve billion, so when it's generating those tokens to reply to you, it requires less overall power and expense to run.
Which means there's economic benefits to this division of labor. But what about the model's efficacy? Is it truly advantageous to have these different experts who specialize in different subjects to help formulate the answer you're looking for?
Ryan:
I'm gonna stop you right there, Casey, because this is where the first fallacy around Mixture of Experts comes in. So far, you may have gotten the impression that this architecture is somehow organized from the top down, that machine learning scientists are creating one expert for math, one for biology, et cetera. But that is not what's happening here, at all. You're in for something far weirder.
This is where we come back to something we said in our training episode. Just like with the embedding process, creating a mixture of experts happens during the training of a model. Something is encoded within the model to correctly assign different parts of itself to the task of generating. But what that encoding is is not straightforward, in fact, it's beyond human definition at the moment. So the creation of the expert and what it is they have so-called expertise in is actually rather mysterious still.
And this is why the name "Mixture of Experts" is ultimately a bit misleading. There's no math expert or writing expert or anything so neatly categorizable as all that—just as with embedding, how the model is determining which part of itself should answer or what that part is attuned to is not really explainable. Kinda spooky, right?
Yes, all these different parts are specialized, but what those specializations are... that has no comprehensible explanation as of yet. All we know is that the model encodes a way of distributing the work among itself according to a form of probability. This is how the orchestrator is routing work.
Casey:
One of the things that occurs to me as I hear you say this is, Why? Why go through the trouble of embedding all of this into a model when training? What are we improving here?
Ryan:
Yes, Casey, fair question. It does seem like a big song and dance to create this extra complexity under the hood of the model, so to speak. But as I like to say on this podcast, you've got to think about money.
The cost of running these multi-billion parameter models is substantial. A lot of what we focus on with customers when building solutions boils down to optimally managing the surrounding infrastructure just to ensure we're not wasting resources. Because model costs can be exorbitant if you don't design carefully.
For this very reason, in general we try to steer customers away from self-hosting models—you should avoid this added expense if you can get the necessary performance with a vector database and retrieval-augmented generation, for instance. And there's many use cases where we see that perform well enough to fit the problem. But if you find yourself in the position of wanting or needing to host your own model because you need to finetune it, for example, or you have some other kind of infrastructural considerations for that, in that case "Mixture of Experts" matters greatly to how you think about your model's cost versus performance.
By dividing the computational labor up among the panel of those eight experts from our thought experiment, you are now lowering the computational cost of the model itself. Instead of needing to host all the parameters of a model simultaneously, you end up running a fraction of them at any one time. And that has a huge effect on choosing the kind of server it can run on.
Casey:
It's always about the cash, isn't it Ryan?
Ryan:
Yes, Casey. No one likes spending money they don't have to. And this also opens up some new vistas for us as far as how large a model can become. Above a certain parameter size, no matter how many Nvidia GPUs you've got chugging, you're going to run out of processing power to host a model, let alone do so cost-effectively. There is no single GPU big enough for a trillion-parameter model.
But in a world where Mixture of Experts exists, you can now pay for a computational fraction of that model to run it. And that has very interesting performance considerations.
Casey:
Ryan, I know one model that's gotten a lot of attention for going this route is Mixtral. That's a portmanteau of Mistral, the name of the company who created it, and Mixture, as in Mixture of Experts. Great branding, by the way. This model happens to be both on Amazon Bedrock and available to self-host because they released its source code. Can you tell us why this model has gotten so much attention lately?
Ryan:
Sure thing. Mixtral is a model that's been making waves for openly advertising that it's a 8 by 22 billion parameter model. Effectively, that means you get 176 billion parameters of power at the cost of running only 22 of those at any one time. We're simplifying here, but if you've been listening well so far, the cost equation should already be forming in your mind.
Hosting 22 billion parameters at any one time is small enough it can be done effectively by a single GPU with enough memory—or a CPU if you want to quantize the model, which is a whole other discussion. But there are additional cost advantages to doing that. What really matters is the parameter count, though. The larger the number of running parameters the more RAM you need to run the model. So far, Mixture of Experts does not alleviate this pressure on RAM—you're going to need all that memory to hold the 176 billion parameters at the same time—but the compute efficiency has vastly increased, because you're only inferencing with 22 billion parameters at a time.
The reason folks are excited about Mixtral is that, when compared to other models, we know substantially more about how it was trained, which means you can effectively finetune and alter the model for your purposes with less guesswork and, hopefully, less effort. In applications where this kind of additional training and model realignment really matter, this is a great starting place.
On top of that, the model is quite performant... If we go off Chatbot Arena's current ranking, it's beating Claude 2 and is currently hovering just outside of the top ten models ranked globally. That puts it in the same realm as Claude 3, Llama 3, and GPT-4 in terms of capability. Combine that kind of performance baseline with effective training and a cost-efficient architecture, and you have a fantastic, highly tunable model for a lot of use cases.
Casey:
Ryan, to be honest, when you put it that way, it sounds like a clear winner. But I suppose from what we've covered on today's episode, we need to clarify that all of this only matters in the context that you're hosting the model yourself. If you're just running a model as is, out of the box, whether or not it uses Mixture of Experts is an implementation detail you don't have to worry about.
Ryan:
That's right. We are big believers in only paying for what you use. And for lots of use cases, we've found there's little reason to go through the trouble of hosting and fine-tuning a model. The technology is already so capable that the more important parts tend to be things like data architecture, appropriately cleaning and storing your data for easy retrieval, or using agentic frameworks, like LangChain, to embed engineered prompts into your input model, like we do with Ragnarock, our approach to AI best practices.
With that said, don't get me wrong—Mixture of Experts is great news for the field of AI. It represents another step toward packing model power into a smaller and smaller compute envelope. It means we're going to keep seeing more capable models and at the same time that the computational power needed to run those models is shrinking.
So there's something very much like Moore's Law going on here. I don't know if this already exists out there somewhere, but if I had to coin a Ries' Law, I would predict that there's some ratio of compute to capability we're going to keep progressing. Don't hold me to this, but for the sake of simplicity, we could say something like: every eighteen months we should be able to attain the same model performance in half the computational throughput.
Mixture of Experts is just the beginning of that. Where we go from here? That's going to be an exciting story, no doubt.
Casey:
Awesome stuff, Ryan. As we end here, we hope you enjoyed listening and learning more about the architecture of LLMs today.
And if you've been considering finetuning your own model or developing a self-hosted solution and want someone to bounce ideas off of, we'd love to chat with you.
At Mission, we believe that the best kind of consulting is honest, low on hype, and leaves the customer more knowledgeable than they were at the outset. And that's why we'll offer you an hour of our time, free of charge, just to learn what you're working on and explain what we think the ideal solution is. Yes, it sounds crazy but we will gladly help you without any money changing hands.
We know that just giving away advice like that without any obligation might seem a bit too good to be true. But we've found that's exactly why customers trust us and use us to build their solutions. So if that sounds like a conversation you'd like to have, why not head on over to mission cloud dot com and drop us a line?
That's it for today's episode! Like Ryan always says, Good luck out there, and happy building!
Subscribe to the Generate Podcast
Be the first to know when new episodes are available.