← Back to blog May 2, 2026

Prompt caching 101: cut Claude and GPT costs by 90% without changing your prompts

Prompt caching is the easiest cost win in AI. Here's how it works, when to use it, and how JJAPI handles it for you.

If you’re paying full price for your AI API calls in 2026, you’re leaving money on the table. Prompt caching — supported natively by both Anthropic and OpenAI — gives you up to 90% off the input cost of repeated prefixes.

This post explains how it works and how to use it through JJAPI without writing cache logic.

The intuition

Most production AI calls have a long stable prefix:

[ 5000 tokens of system prompt + few-shot examples + reference docs ]
+
[ 50 tokens of "current user query" ]

You’re paying input cost on 5050 tokens every call. But the first 5000 are identical between calls. The vendors realized this and added caching: if the prefix matches a recent call, they charge a discounted “cache hit” rate for that portion.

Pricing typically looks like:

	Cache write	Cache read
Anthropic	1.25× input price	0.1× input price
OpenAI	Same as input	0.5× input price

The first call (cache write) is slightly more expensive. Every subsequent call within ~5 minutes is dramatically cheaper.

Native, automatic on JJAPI

When you call Claude through JJAPI, prompt caching headers (cache_control: ephemeral) are passed through to Anthropic transparently. You opt in by marking which message chunks are cacheable:

from openai import OpenAI
client = OpenAI(base_url="https://api.jjapi.net/v1", api_key="sk-jjapi-...")

response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": LONG_SYSTEM_PROMPT,  # 5000 tokens
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": user_query},
    ],
)

For GPT-4o, OpenAI does this automatically — you don’t need to mark anything. As long as your prefix is at least 1024 tokens and identical across calls, cache discounts apply.

When NOT to use caching

One-shot calls. If a prompt isn’t reused, the cache-write surcharge makes it more expensive, not less.
Sub-1024-token prefixes. OpenAI requires ≥1024 tokens; Anthropic requires ≥1024 for Sonnet and ≥2048 for Haiku.
Highly variable prompts. If you template lots of values into the prefix, hash misses kill the cache.

Practical optimization

Reorder your prompt so all stable content comes first:

[ System role + persona ]
[ Stable knowledge base / few-shot examples ]
[ ----- cache boundary ----- ]
[ Variable per-request context ]
[ User query ]

This pattern alone often delivers 70-90% cost reduction for chat applications with rich system prompts.

What’s automatic vs manual on JJAPI

Vendor	Caching	Your action
OpenAI (GPT-4o, mini)	Automatic	None — just keep prefixes stable
Anthropic (all Claude)	Opt-in	Set `cache_control: ephemeral` on the cacheable block
Google Gemini	Manual via context caching	Use `cached_content` param (we support it)
DeepSeek	Automatic	None

For a chatbot with a 4000-token system prompt and 1000 daily users averaging 10 turns each, prompt caching can cut your monthly bill from ~$300 to ~$30. Free money — go take it.

Ready to apply this in your app?

Get a JJAPI key — $18 →