Prompt caching 101: cut Claude and GPT costs by 90% without changing your prompts
Prompt caching is the easiest cost win in AI. Here's how it works, when to use it, and how JJAPI handles it for you.
If you’re paying full price for your AI API calls in 2026, you’re leaving money on the table. Prompt caching — supported natively by both Anthropic and OpenAI — gives you up to 90% off the input cost of repeated prefixes.
This post explains how it works and how to use it through JJAPI without writing cache logic.
The intuition
Most production AI calls have a long stable prefix:
[ 5000 tokens of system prompt + few-shot examples + reference docs ]
+
[ 50 tokens of "current user query" ]
You’re paying input cost on 5050 tokens every call. But the first 5000 are identical between calls. The vendors realized this and added caching: if the prefix matches a recent call, they charge a discounted “cache hit” rate for that portion.
Pricing typically looks like:
| Cache write | Cache read | |
|---|---|---|
| Anthropic | 1.25× input price | 0.1× input price |
| OpenAI | Same as input | 0.5× input price |
The first call (cache write) is slightly more expensive. Every subsequent call within ~5 minutes is dramatically cheaper.
Native, automatic on JJAPI
When you call Claude through JJAPI, prompt caching headers (cache_control: ephemeral) are passed through to Anthropic transparently. You opt in by marking which message chunks are cacheable:
from openai import OpenAI
client = OpenAI(base_url="https://api.jjapi.net/v1", api_key="sk-jjapi-...")
response = client.chat.completions.create(
model="claude-3-5-sonnet",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 5000 tokens
"cache_control": {"type": "ephemeral"},
}
],
},
{"role": "user", "content": user_query},
],
)
For GPT-4o, OpenAI does this automatically — you don’t need to mark anything. As long as your prefix is at least 1024 tokens and identical across calls, cache discounts apply.
When NOT to use caching
- One-shot calls. If a prompt isn’t reused, the cache-write surcharge makes it more expensive, not less.
- Sub-1024-token prefixes. OpenAI requires ≥1024 tokens; Anthropic requires ≥1024 for Sonnet and ≥2048 for Haiku.
- Highly variable prompts. If you template lots of values into the prefix, hash misses kill the cache.
Practical optimization
Reorder your prompt so all stable content comes first:
[ System role + persona ]
[ Stable knowledge base / few-shot examples ]
[ ----- cache boundary ----- ]
[ Variable per-request context ]
[ User query ]
This pattern alone often delivers 70-90% cost reduction for chat applications with rich system prompts.
What’s automatic vs manual on JJAPI
| Vendor | Caching | Your action |
|---|---|---|
| OpenAI (GPT-4o, mini) | Automatic | None — just keep prefixes stable |
| Anthropic (all Claude) | Opt-in | Set cache_control: ephemeral on the cacheable block |
| Google Gemini | Manual via context caching | Use cached_content param (we support it) |
| DeepSeek | Automatic | None |
For a chatbot with a 4000-token system prompt and 1000 daily users averaging 10 turns each, prompt caching can cut your monthly bill from ~$300 to ~$30. Free money — go take it.
Ready to apply this in your app?
Get a JJAPI key — $18 →