Mission: Generate

Aired March 21, 2024

Getting Gen AI Solutions into Production, Part 1

In this episode, which is part 1 of 2 on this topic, we talk about getting gen AI solutions into production. We discuss cost and scale in terms of data architecture, running and training models, dealing with their unpredictability, and the challenges all of this can introduce when building a full-scale gen AI solution.

Show notes:

Our AWS marketplace offer for scaling your generative AI solution - https://aws.amazon.com/marketplace/pp/prodview-rhor3y55mosuiOur blog on

Frequently Asked Questions we hear from our customers on generative AI - https://www.missioncloud.com/blog

Fine-tuning advice and cost considerations from Hugging Face and AWS - https://aws.amazon.com/blogs/machine-learning/train-a-large-language-model-on-a-single-amazon-sagemaker-gpu-with-hugging-face-and-lora/

AWS's guidelines for training large language models using SageMaker - https://aws.amazon.com/blogs/machine-learning/training-large-language-models-on-amazon-sagemaker-best-practices/

Llama 2, Meta's local models, which are available at 7B, 13B, and 70B parameters - https://ai.meta.com/llama/

Official Transcript

Ryan:

Welcome to Mission Generate, the podcast where we explore the power of AWS and the cutting-edge techniques we use with real customers to build generative AI solutions for them. Join us each episode as we dig into the technical details, cut through the hype, and uncover the business possibilities.

I'm Ryan Ries, our generative AI practice lead and your host.

In today's episode, which is part 1 of 2 on this subject, we're going to talk about the differences between a proof of concept solution and operating one in production.

There's a lot to consider here. Your number of users; whether it's going to be exposed to your customers or reserved for internal use; how much it will cost you to run at the scale you expect; and how you plan to deal with training, inaccuracy, and hallucinations.

We're going to explore these through two problems: cost and scale.

Okay, let's begin.

Casey:

Hey! Casey here, our Senior product marketing manager and the writer of this podcast.

Now, of course I'm not actually here, physically speaking—my voice is synthesized by an AI just like Ryan's—but I did write this, so I'm here in spirit, I suppose. You'll sometimes hear me jumping in on episodes like this one where a more dynamic format can benefit the conversation.

Getting a Gen AI solution into production is a huge and evolving topic, so you should expect we'll be revisiting this. But what we're aiming to do here is equip you with what we've learned and introduce you to the landscape of decisions you'll want to be thinking about as you move a solution into production.

With that out of the way, let's get back to grilling Ryan, one of my favorite pass times.

Ryan, what do we even mean by that word, "production?" Aren't generative AI models, like GPT or Claude from Anthropic, already being used by companies in the real-world? What distinction are we making here?

Ryan:

Yeah, that's a good question, Casey. Because, in one sense, you could argue that something like GPT-4 or Anthropic's Claude, which are already out on the internet and being used by millions of people, are really already a production-grade system.

Now, without getting too spicy and verging into the territory of how accurate those models are when performing tasks, what I'd point out here is that having a model that you can work with via an API is one thing; and having the data architecture to feed that model, the automation surrounding it to get it to perform your business task, and the training it needs to do so reliably—that's something else.

At this point, the maturity of a given model for your solution is important, but it's certainly not the only thing to determine the production-readiness of a specific solution you're building with it.

So when we say, "getting into production," what we're really talking about is incorporating a model into a specific solution you've engineered and launching that to your users, whoever they are.

Casey:

Right, so by "production," what we really mean is: the launching of your specific solution and operating it at the scale it needs to be at to solve your specific problem.

Ryan:

Exactly.

Casey:

Okay, so, then what makes that so hard? Why is this something that we think customers of AWS should care about?

Ryan:

Sounds like you need to work on your customer empathy, Casey! Isn't that supposed to be a product marketer's specialty?

Casey:

Well, I'm exactly as empathetic as a synthesized AI voice can sound!

... I guess you've got me on this one, Ryan. But explain this problem. Why is scaling a solution such a big deal?

Ryan:

Yeah, well, not to make too broad a point here, but I'm going to start with the obvious one.

It's money.

Running a generative AI solution at scale can cost a lot of money. Like, A LOT. Let's say you need a large GPU instance-type for the kind of training you're doing. That right there can easily run you 50 grand a month, if it's on 24/7.

And maybe that's fine for your goals as a business. But, more than likely, what that should suggest immediately is like, "Oh, I probably want to make sure I've only got that instance up and running when I ABSOLUTELY need it." Which means now you've got to think about how you're feeding it its inputs. How are you going to get the data to it? How are you going to prepare that data ahead of time? How are you going to make sure that all your training is effective? Because you don't want to waste a ton of compute time for a non-usable result.

So you can see how even with just this one concern—money—we've actually opened up a series of larger questions about things like data architecture, data quality, experimentation, iteration time, risk, and business value.

The scale of these models, the amount of compute performance it takes to get them to work, all of that is going to put these questions in play when building a Gen AI solution.

Casey:

Yeah, I've heard you say that before and it is shocking how much money and compute these models can chew up.

Which I think raises the question, "Do I have to be a billion dollar company to afford working with this technology? Am I committing out of the gate to millions of dollars spent just to get something working?"

Ryan:

Yeah, I think there can be a sense of risk to this technology. But you definitely don't have to be a billion dollar enterprise to pull off a solution and there are ways to run your model that are cost efficient. But there are two hard problems here, for sure.

One is the money, like I said. If you're not careful you're going to burn cash. But, I would say the other major issue here is that generative AI models are probabilistic, not deterministic. Sure you can set the temperature of your model really low, but if you're doing large scale training of your model, feeding it all this text, data, and so on, there's a dimension of uncertainty to how it's going to react to that data upon seeing it for the first time.

The goal of training, in some ways is to remove that uncertainty. You're basically trying to teach the model, "When you see input A, I want you to respond with output B." So training is fundamental to this equation.

But what you brought up also raises a few other questions, namely, are you selecting the right model for your problem space? This one isn't AS MUCH of an issue right now as it could be in the future, but as models get more specialized, you're going to have to think about it more. But also, different models come at different parameter sizes. You may have heard, for example, that Facebook released a 70 billion parameter model and a seven billion parameter model at the same time. Well, the reason for that is that the 70 billion parameter model benchmarks better for the ways we try to measure AI problem solving. But as you can probably guess, it takes an order of magnitude larger hardware to run, which means cost.

Casey:

Yeah, and it does seem like there's a new language model getting released every week at this point. I guess some of that comes down to getting better performance out of a lower number of parameters, hence cost?

Ryan:

Look at you, Casey! You're learning!

If I wasn't an AI voice, I might even sound proud at this point... In fact, I kind of do, don't I? Huh!

Casey:

Thanks, Ryan. I have the occasional moment of clarity...

I'm going to steal the outro from our traditional custodian and end today's episode here. As we wrap up part one, we hope you've started to think about the cost efficiency of scaling your solution and moving to production for your customers.

At Mission, we guide our customers all the time on these kinds of questions. And if that's a conversation you'd like to have with Ryan and the other experts on his team, reach out to us! We'll give you an hour of our time, free of charge, to talk through your challenges and how you can meet them.

You can always find us on the web at mission cloud dot com. And stay tuned for part 2 of this episode: working with APIs, training your models, and dealing with the unpredictability of users.

We hope you've enjoyed this episode. Good luck out there and happy building!