Anthropic’s new Claude prompt caching will save developers a fortune

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Anthropic introduced prompt caching on its API, which remembers the context between API calls and allows developers to avoid repeating prompts.

The prompt caching feature is available in public beta on Claude 3.5 Sonnet and Claude 3 Haiku, but support for the largest Claude model, Opus, is still coming soon.

Prompt caching, described in this 2023 paper, lets users keep frequently used contexts in their sessions. As the models remember these prompts, users can add additional background information without increasing costs. This is helpful in instances where someone wants to send a large amount of context in a prompt and then refer back to it in different conversations with the model. It also lets developers and other users better fine-tune model responses.

Anthropic said early users “have seen substantial speed and cost improvements with prompt caching for a variety of use cases — from including a full knowledge base to 100-shot examples to including each turn of a conversation in their prompt.”

The company said potential use cases include reducing costs and latency for long instructions and uploaded documents for conversational agents, faster autocompletion of codes, providing multiple instructions to agentic search tools and embedding entire documents in a prompt.

Anthropic (@AnthropicAI) just announced a game-changer for their API: Prompt caching.
Think of prompt caching like this: You're at a coffee shop. The first time you visit, you need to tell the barista your whole order. But next time? Just say "the usual."
That's prompt… pic.twitter.com/ASB1nkdY4U
— Dan Shipper ? (@danshipper)
August 14, 2024

Pricing cached prompts

One advantage of caching prompts is lower prices per token, and Anthropic said using cached prompts “is significantly cheaper” than the base input token price.

For Claude 3.5 Sonnet, writing a prompt to be cached will cost $3.75 per 1 million tokens (MTok), but using a cached prompt will cost $0.30 per MTok. The base price of an input to the Claude 3.5 Sonnet model is $3/MTok, so by paying a little more up front, you can expect to get a 10x savings increase if you use the cached prompt the nexst time.

We just rolled out prompt caching in the Anthropic API.
It cuts API input costs by up to 90% and reduces latency by up to 80%.
Here's how it works:
— Alex Albert (@alexalbert__)
August 14, 2024

Speaking of costs, the initial API call is slightly more expensive (to account for storing the prompt in the cache) but all subsequent calls are one-tenth the normal price. pic.twitter.com/3cPkz8c0rm
— Alex Albert (@alexalbert__)
August 14, 2024

Claude 3 Haiku users will pay $0.30/MTok to cache and $0.03/MTok when using stored prompts.

While prompt caching is not yet available for Claude 3 Opus, Anthropic already published its prices. Writing to cache will cost $18.75/MTok, but accessing the cached prompt will cost $1.50/MTok.

However, as AI influencer Simon Willison noted on X, Anthropic’s cache only has a 5-minute lifetime and is refreshed upon each use.

Looks similar to Gemini's context caching, but the Anthropic pricing model is different
Gemini charge $4.50/million tokens/hour to keep the context cache warm
Anthropic charge for cache writes, and "cache has a 5-minute lifetime, refreshed each time the cached content is used" https://t.co/rfMQE2J3Rs
— Simon Willison (@simonw)
August 14, 2024

Of course, this is not the first time Anthropic has tried to compete against other AI platforms through pricing. Prior to the release of the Claude 3 family of models, Anthropic slashed the prices of its tokens.

It’s now in something of a “race to the bottom” against rivals including Google and OpenAI when it comes to offering low-priced options for third-party developers building atop its platform.

Highly requested feature

Other platforms offer a version of prompt caching. Lamina, an LLM inference system, utilizes KV caching to lower the cost of GPUs. A cursory look through OpenAI’s developer forums or GitHub will bring up questions about how to cache prompts.

Caching prompts are not the same as those of large language model memory. OpenAI’s GPT-4o, for example, offers a memory where the model remembers preferences or details. But it does not store the actual prompts and responses like prompt caching.

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat's Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.