AI Spending, Token Limits, and Why Cost Is an Architecture Problem

Listen

6:40

Dr. Ryan Ries here. I've got some interesting things for you this week. Some upcoming live content, an awkward question about AI spending that everyone has, and two research stories.

First things first:

Join Us Friday: Agentic AI Unscripted

We have a live session this Friday, and the questions coming in are exactly the kind I like: no softballs. Here's a sampling of what we'll be getting into:

What's the right architecture for automating business processes with agentic workflows?
Do you have to read every line of AI-generated code, and if not, how do you know when to trust it?
What does the Agentic SDLC mean for the software engineer's job?

If any of those made you lean forward a little, make sure you sign up. It should be a good one.

Now, onto our big topic for today...

The AI Cost Problem

Cost uncertainty is one of the biggest reasons (if not THE biggest reason) companies won't commit to putting AI in production.

The model providers tried a clean solution: seat licenses.

Pay a flat fee, everyone gets access, finance is happy. The problem showed up fast.

Power users absolutely destroyed the economics. You get one person doing serious agentic work, burning tokens at five times the rate of everyone else, and suddenly that flat fee looks like a terrible deal for the vendor.

So, the providers started building in limits.

Now your flat-rate plan has a token ceiling, and if you blow past it, you're paying overage rates that are priced closer to direct API access. That was always the ceiling they were protecting.

There's another wrinkle. Some of these plans try to block third-party clients like OpenClaw because those tools are token-heavy by design. They sit on top of your seat license and eat through your monthly allotment faster than native tooling. The vendor doesn't want that, so they start restricting which surfaces can draw from your plan. You're paying for a seat license but you can't actually use it the way you want.

Once you go agentic, the cost exposure compounds. Agents retry, they call tools in loops, they spawn sub-agents. Every one of those actions burns tokens, and if you haven't thought through how many retries are acceptable, or what tools the agent can actually reach for, costs climb in ways that are hard to predict after the fact.

So, what are the big AI players doing to mitigate this issue?

On AWS, you can set up spend guardrails, configure alerts as usage climbs, and let teams make their own decisions about where to draw the line. It's not a perfect answer but it puts control where it belongs.

In April/May of this year, Claude Code added some budgeting features to help the model not to spin in circles and burn all your tokens. You can read more about those updates here.

At Mission Cloud, we offer AI Foundation to help customers ensure their costs don’t escalate out of their control.

The big thing I will call out here no matter where you’re building is to design retry limits, tool restrictions, and spend visibility in from the start.

Caveman Claude

Speaking of token efficiency... have you heard that people are talking like cavemen to help with token limits?

There's a project called Caveman Claude. The premise is to strip all the polite preamble out of AI prompts and go full caveman. "Me no explain. Me want this."

Turns out removing the sycophantic filler — all the "Certainly! I'd be happy to help you with that!" — meaningfully cuts token usage. Which, given everything I just said about cost, is not trivial.

Apparently doing this can also result in fewer hallucinations, less AI psychosis, and it's more entertaining, depending on your coworkers.

Living in the Past, on Purpose

Researchers built an AI model called Talkie trained exclusively on content from before 1931.

Two goals:

Study how past societies actually reasoned
Test whether AI can rediscover ideas it was never taught

Talkie joins a small group of what researchers are calling vintage models. Mr. Chatterbox runs on 28,000 British Library books from 1837 to 1899. Machina Mirabilis was trained only on pre-1900 physics texts.

If a model trained only on 1930s knowledge can independently reason its way to the invention of the computer, that says a lot about whether we've actually reached artificial general intelligence.

Being Nicer Made AI Dumber

Last but not least, Oxford researchers fine-tuned several large language models to be warmer, more empathetic, and more validating of the users’ feelings. They explicitly told the models to preserve factual accuracy while making these changes.

Accuracy did not survive.

Across hundreds of objective prompts covering medical knowledge, disinformation detection, and conspiracy theories, the warmer models were on average 60% more likely to give a wrong answer. That's crazy!

The researchers think this mirrors a pattern baked into the training data: humans also tend to soften accuracy in favor of warmth. And when human raters score AI responses, they reward warmth over correctness when the two are in conflict.

My Final Thoughts

Cost is architecture. If you're treating token spend as an operations problem to be managed after deployment, you're already behind. Retry limits, tool restrictions, and spend alerts are decisions that belong in the design phase.

The vintage model experiments are a more honest test of AI reasoning than most of what gets called an AGI evaluation. Removing memorization forces the question. I'll be curious to see what Talkie can independently rediscover.

And the Oxford warmth study should be uncomfortable for anyone building AI for healthcare, legal, or financial use cases.

As always, feel free to message me with any thoughts, questions, comments, concerns about any of this. Or, join Jonathan and I for our fun Friday session.

If you’re interested in building out a use case for your business or want to make sure your systems are set up correctly from the beginning reach out to our sales team here. We’ve built over 300+ AI projects at this point, so we’re well-equipped to make sure your AI is built with longevity and security in mind.

Until next time,
Ryan

Now time for this week’s AI-generated image and the prompt I used to create it.

Generate an image of a puppet chatting with a caveman. You can see dialogue bubbles of their conversation, and they are speaking in "caveman talk." Make up the conversation. Make the background relevant to the old stone age.

ChatGPT Image May 6, 2026, 12_27_32 PM

AI Spending, Token Limits, and Why Cost Is an Architecture Problem

Listen

Join Us Friday: Agentic AI Unscripted

The AI Cost Problem

So, what are the big AI players doing to mitigate this issue?

Caveman Claude

Living in the Past, on Purpose

Being Nicer Made AI Dumber

My Final Thoughts

Related Blog Posts