← Back to blog May 15, 2026

How to fail over between AI vendors without breaking your app

Every major LLM provider had a multi-hour outage in 2026. Here's the failover pattern that keeps your product alive when theirs goes down.

Pop quiz: how does your app respond if Anthropic returns 503 for the next two hours?

If the answer is “users see an error” or “users see nothing at all,” you have a single point of failure. This post is about how to fix that — and how to fix it with one line of routing code instead of a dual-SDK nightmare.

What outages actually look like

Real production data from the last twelve months:

OpenAI: ~2 hours of degraded service in January (token rate-limit explosion), 6 hours of partial in March
Anthropic: 3 separate Claude outages totaling ~8 hours
Google Gemini: Several rate-limit cliffs during launches
DeepSeek: Capacity issues during early R1 release

These aren’t theoretical. They happen quarterly. If you have a single-vendor app, you take the outage 1:1.

The pattern: primary + fallback + budget fallback

from openai import OpenAI
from openai import APIError, RateLimitError, APITimeoutError

client = OpenAI(base_url="https://api.jjapi.net/v1", api_key="sk-jjapi-...")

CHAIN = [
    ("claude-sonnet-4-6", 15.0),   # primary
    ("gpt-5.5", 15.0),              # equivalent quality fallback
    ("deepseek-chat", 30.0),       # cheaper, slower, still good enough
]

def call_with_failover(messages, **kwargs):
    last_error = None
    for model, timeout in CHAIN:
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=timeout,
                **kwargs,
            )
        except (APIError, RateLimitError, APITimeoutError) as e:
            last_error = e
            continue
    raise last_error

That’s the whole pattern. Three tiers, escalating timeouts, swallowed errors. Your users see slightly different prose quality during an outage, not a broken page.

What “equivalent quality” actually means

These models are close enough in capability that switching mid-conversation is rarely visible to users:

claude-sonnet-4-6 ⇄ gpt-5.5 (best ⇄ best)
claude-haiku-4-5 ⇄ gpt-5.4-mini ⇄ deepseek-chat (cheap ⇄ cheap)
gemini-2.5-pro ⇄ claude-sonnet-4-6 (when long context matters)

These are NOT equivalent and will visibly degrade:

o1 → anything (o1’s reasoning chains aren’t replicable)
claude-opus-4-7 → smaller models (Opus output style is distinctive)

Why JJAPI makes this easy

If you implemented the above with three separate SDKs (anthropic, openai, deepseek), you’d write 60 lines of adapter code. With JJAPI, you write the 15 lines above. The wire format is identical because everything goes through one OpenAI-compatible endpoint.

We also do upstream-level failover automatically: if Anthropic returns 5xx for a claude-sonnet-4-6 request, JJAPI retries against alternate Anthropic regions before your code sees an error. Combine that with app-level failover above and you have two layers of protection.

A note on streaming failover

Streaming responses are trickier because some tokens have already been sent. The pattern there is:

Buffer the first 50ms of the stream before flushing to the user.
If the upstream fails before you flush, switch silently.
If it fails after flushing, append a graceful “—continuing on backup—” marker and start the second stream.

Most users tolerate this. None tolerate a half-rendered response that stops with [stream error].

Ready to apply this in your app?

Get a JJAPI key — $18 →