Mission: Generate

Aired March 21, 2024

Getting Gen AI Solutions into Production, Part 2

In this episode, which is part 2 of 2 on this topic, we continue our conversation about getting gen AI solutions into production. We discuss tuning and prompting, API request scaling, Amazon Bedrock, and the problems of probabilistic models + unpredictable humans.

Show notes:

AWS’s explanation of what Bedrock is and how it works - https://aws.amazon.com/bedrock/

An example of a real solution we built with Bedrock for a customer - https://www.missioncloud.com/case-studies/magellantv-uses-generative-ai-for-expansion

Anyscale's blog on "Numbers every LLM Developer should know," including the average 1.3 tokens per word - https://www.anyscale.com/blog/num-every-llm-developer-should-know

Our blog on Frequently Asked Questions we hear from our customers on generative AI - https://www.missioncloud.com/blog/frequently-asked-questions-about-generative-ai

Our offer of 1 free hour of consulting for scaling your generative AI solution - https://aws.amazon.com/marketplace/pp/prodview-rhor3y55mosui

Official Transcript:

Ryan:

Welcome to Mission Generate, the podcast where we explore the power of AWS and the cutting-edge techniques we use with real customers to build generative AI solutions for them. Join us each episode as we dig into the technical details, cut through the hype, and uncover the business possibilities...

I'm Ryan Ries, our generative AI practice lead and your host.

In today's episode, which is part 2 of this 2 part series, we're continuing our conversation about getting your solution into production.

Now, in part 1, we focused a lot on cost. But we're going to switch gears a bit and talk about efficacy.

Efficacy is how you might measure a solution's ability to solve a problem consistently and efficiently. And this is another area where we see teams struggle to scale. Because handling the uncertainties of imperfect data, unpredictable users and models, and complex real world scenarios--those are all major challenges.

Production means crafting a solution that can do all of that. So let's talk about how you can get there and jump back in with part 2!

Casey:

Hey there! Casey again, jumping back in for this episode to continue the conversation. A quick disclaimer before we begin:

Just in case you missed our previous episodes, while Ryan and I may sound like we're in a room together, we're actually two synthetic voices, empowered by generative AI, to tell the stories of this podcast. Hopefully, we've sounded realistic enough so far that we fooled you if you didn't already know that.

But this isn't just a parlor trick--it's another example of the amazing capabilities of this technology. So we thought we'd show it off to you by building a whole podcast with it. Cool, right?

Okay, back to the episode. I want to pick up where we left off.

I know that a lot of our previous conversation was focused on training and running your own model on AWS hardware. But what about if you plan to access a model entirely via an API, like through Bedrock, for example? Ryan, can you talk to us about that? What does that mean for a team that's looking to scale?

Ryan:

Yeah, that can be a great choice for some kinds of problems. The truth is, you probably don't want to be paying for the cost of running or training a model on your own hardware, not unless it's providing real business value to you.

But, let's take your Bedrock example, and explore that a bit further.

Bedrock just recently launched publicly on AWS, but we've actually been using it for a couple of months here at Mission. For those of you that haven't heard, it's Amazon's native way for working with large language models. It gives you some real advantages in terms of speed and security that are worth considering--especially if you're concerned about the efficiency of your solution.

Let's look at a classic bottleneck in production to illustrate this: the number of API requests you're making.

The traditional way, just sending JSON over the internet, can be slow. Your requests may time out when a model is under heavy load, and if you need a long or complex answer--let's say an answer involving multiple steps in a chain-of-thought solution--then you may be left waiting a long time for your response.

Now take those seconds per request and multiply it by the number of users and how often they'll be using your solution. See where I'm going with this? If you're getting an untenably large number from that math, or you realize that user experience would suffer from latency issues, you may need to rework.

And this is one area where Bedrock has a large advantage. Because Bedrock lets you work with a model's API while keeping the traffic internal to AWS. So when you're talking to a model with this service, your requests and responses are much faster--that's not to say it will magically fix all of these problems, by the way. But it can make a real difference for getting to production.

Casey:

Yeah, and since we're on this subject, here's a bit of helpful math for our listeners.

The consensus is that there's an average cost of about 1.3 tokens per word across different models. You'll want to remember this number when you're working with an API because they charge by the token and a token itself is not a word.

And as you can see, this can be a useful way to forecast costs. Most models will price in a way that looks really pretty, by the way. I don't want to say it's disingenuous, but their pricing page will invariably show you some tiny fraction of a cent for ten thousand tokens. But you can see how that can add up quickly.

Ryan:

Yeah, it can add up. And you're always going to have to be aware of that kind of math if you're going to be using a model's API instead of running your own. Like, these companies are trying to make money. And, if we're being totally honest, a lot of them are priced right now as loss-leaders.

Everything I've heard about ChatGPT's API, for example, suggests to me that they're losing money per API call. So that pricing may not last forever.

And of course, this is an issue of scale, as well.

Let's go back to our example solution and say that it's sending a relatively modest number of tokens and asking for a modest number. If that's you, then great! You can hit that model's API as many times as you need to and not worry about cost. But if you're solving a problem where you have to take advantage of a model's full context window, let's say, like feeding it a huge amount of text every prompt—well now you may have built something that won't scale appropriately for the number of people that are going to be using it.

Casey:

Totally. The pricing models are a bit up in the air, I think; like we're still in very early days here. And there's a lot of VC money that's been poured into the space in the hopes of capturing users. We're not even seeing the actual cost for running these APIs because these companies are incentivized to keep their pricing deflated for user acquisition. We can't expect that forever.

But I want to make sure we hit another topic you brought up in the intro, which we haven't covered yet. Hallucinations. That's when a model generates something it thinks you want, but it's entirely made up. Like it predicts the text you're expecting, but when it doesn't know the answer, it just makes up an answer anyway.

This is still a real problem with all the models out there, right? So how does that effect getting a Gen AI solution into production?

Ryan:

Yeah, no one has found the answer to the hallucination problem. I think the jury's still out on the actual solvability of it. But certainly if you're using a model you haven't trained, you're going to be at risk of getting some occasionally wacky responses.

We've even seen things happen where something goes awry when you're making a long series of API calls, and even though you're asking it the same thing on each one, you know, one in one hundred or two hundred might just reply back, "Hey, I'm a generative AI model and I can't do that" even though it's the exact same prompt as the previous one.

The fact is, there's just a level of unpredictability to this technology and you're going to have to account for that in any serious production-grade solution. That's especially true if your use case involves your customers interacting with it.

The last thing you want to have happen is you build a customer support bot, let's say, and then it goes and tells your customer something that's nonsense or even something that might damage their configuration, or cause them to lose data, or something crazy like that.

So when we talk about production, we're also talking about the reality of dealing with users, who may ask things of a model that surprise it. And were talking about the probabilistic dimensions of these models themselves.

But there are ways around this. You can improve accuracy through techniques like retrieval-augmented generation for example.

Casey:

Now that's when you take an LLM and you give it a vector database full of information you want it to refer to. And then it can notice when a prompt contains information in that database, now it's like using those vectors to try and predict the answer. Is that right?

Ryan:

That's a decent summary, I guess, for a product marketer. Ha ha. Just kidding!

I enjoy the meta-irony here that you wrote this script so you're dunking on yourself. I guess I do dunk on you sometimes, huh?

But yeah, retrieval augmented generation can help with accuracy. But you'll also need to think about the quality of data you're feeding it. We say a lot on my team, "Garbage in; garbage out." Which means, if you give your solution bad data, data that hasn't been appropriately cleaned or structured, for instance, your results are going to suffer.

So once again, you can see how getting to production is often just as much about the architecture surrounding the model as it is whatever in particular you're trying to do with that model.

Casey:

I think we've reached a good stopping place with that answer.

To summarize both parts of this series:

When you're thinking about moving a Gen AI solution into production, you're going to need to account for a lot more than what you can get an MVP to do for you.

You need to start thinking about the cost of the underlying infrastructure, and if you're making API calls to your model, are you engineering your prompts to be maximally efficient? You've got to think about how you're supporting your model if you're going to be training it and feeding it data; and you've got to think about the quality of that data itself. Lastly, you've got to think about your users; if you're putting your solution up on the internet, for example, you have a different set of risks associated with the model doing something weird or unpredictable.

Basically, you've got to come up with a plan for how you're going to scale to meet these challenges.

Ryan:

I think that's right, Casey.

So as we close here, I want to end the way we do each episode and make a pitch for my team. A lot of what we've talked about blends strategic and technical thinking, which is something we do a lot of at Mission. If you're having these kinds of conversations at the moment or facing these kinds of decisions and want to talk it through with some experts, reach out to us.

We'll give you an hour of our time, free of charge, to discuss what you've built and how to take it to the next level of scale. You can find us on the web at mission cloud dot com. Thanks for listening to this episode, and, hey, if you think of something you'd like us to cover, tell us!

Best of luck out there and, as always... happy building!