Let’s say your app has a chat feature, and the API route imports one provider SDK directly. Maybe it calls Claude Sonnet for everything: formatting a name into JSON, summarizing a paragraph, writing a React hook, or reasoning through a hard architecture question. The model does not care that some of those tasks are trivial. Your bill does.
The fix is not simply switching to the cheapest model, because that can drop response quality for tasks that actually need a stronger model. The better fix is to stop picking one model for everything.
In this article, we’ll build a routing layer that classifies what the user is asking for, sends the prompt to the cheapest model that can handle the job, and falls back automatically when a model fails. Formatting can go to Gemini Flash, coding can go to Claude Sonnet, and deep reasoning can hit Claude Opus only when it is actually needed. The exact savings depend on your traffic and token usage, but in the sample run below, routing cut estimated model costs by roughly 80 percent.
We’ll build this with OpenRouter and TanStack AI in a Next.js app. OpenRouter gives us a unified gateway for models from providers like OpenAI, Anthropic, Google, Meta, and Mistral, while TanStack AI handles streaming and chat state. For a broader comparison of TanStack AI’s architecture, see LogRocket’s guide to TanStack AI vs. Vercel AI SDK.
Here is what routing looks like in practice:
"Format this as JSON" → Gemini 2.0 Flash"Write a React hook" → Claude Sonnet 4"Reason about P vs. NP" → Claude Opus 4And here is the cost argument from the test run:
| Scenario | Estimated total for five prompts |
|---|---|
| Everything routed to Claude Opus 4 | ~$0.15 |
| Intelligent routing | ~$0.03 |
| Savings | ~80 percent |
At thousands of requests per day, that difference compounds quickly.
The Replay is a weekly newsletter for dev and engineering leaders.
Delivered once a week, it's your curated guide to the most important conversations around frontend dev, emerging AI tools, and the state of modern software.
Four files do most of the work. Everything else is UI:

The basic flow is:
Here is the pattern many teams start with:
// app/api/chat/route.ts — the pattern everyone copies
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function POST(request: Request) {
const { messages } = await request.json();
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
});
return Response.json(response);
}
This is fine for a prototype, but it becomes painful once the feature grows beyond a demo. If you are building this inside the App Router, LogRocket’s guide to using Next.js Route Handlers is a useful reference for the API route pattern itself.
The hardcoded SDK approach has three problems:
| Problem | Why it matters |
|---|---|
| Vendor lock-in | Every route imports one provider SDK directly. Switching providers means rewriting endpoints instead of changing configuration. |
| No cost optimization | A trivial formatting prompt and a complex reasoning prompt go to the same model at the same price. |
| No fault tolerance | If the provider rate-limits your app or a model fails, the user-facing chat experience fails with it. |
You are not really choosing a model in this setup. You are choosing a fixed integration path.
The fix is an abstraction layer between your Next.js route and the model providers. Instead of importing a provider SDK directly, the route imports a routing function that decides which model to use based on the prompt.
OpenRouter is the gateway in this example. One API key gives you access to hundreds of models through a unified interface. You change a model string rather than rewiring every route.
On the client side, TanStack AI manages streaming and chat state. Its OpenRouter adapter is the direct fit here because it lets us call OpenRouter models through TanStack AI’s chat() API. If you have used the Vercel AI SDK before, the concepts are similar, but TanStack AI uses explicit adapters and connection configuration rather than hiding the transport behind framework-specific conventions. For more on streaming AI responses in a Next.js app, see LogRocket’s guide to real-time AI in Next.js with the Vercel AI SDK.
Create a new Next.js project:
npx create-next-app@latest llm-router \ --typescript --tailwind --eslint --app \ --src-dir=false --import-alias="@/*" \ --turbopack --use-npm cd llm-router npm install @tanstack/ai @tanstack/ai-react @tanstack/ai-openrouter zod
Create your .env.local file:
echo "OPENROUTER_API_KEY=sk-or-v1-your-key-here" > .env.local
Then create the directory structure:
mkdir -p lib app/api/chat app/components
The project has three files in lib/ that do the routing work, plus one API route that ties everything together.
First, define the models available to the router. Each model gets a tier, cost metadata, a list of supported task types, and fallbacks to try if the primary model fails:
// lib/models.ts
export type ModelTier = "light" | "standard" | "heavy";
export type TaskType =
| "formatting"
| "summarization"
| "coding"
| "research"
| "reasoning";
export interface ModelConfig {
id: string; // OpenRouter slug
name: string;
tier: ModelTier;
costPer1MInput: number; // USD
costPer1MOutput: number;
capabilities: TaskType[];
fallbacks: string[]; // Slugs to try if this model fails
}
// The key insight: There can be a 10x to 150x cost difference between tiers.
// A formatting task does not need a $15/1M-token model.
export const MODEL_REGISTRY: ModelConfig[] = [
{
id: "google/gemini-2.0-flash-001",
name: "Gemini 2.0 Flash",
tier: "light",
costPer1MInput: 0.1,
costPer1MOutput: 0.4,
capabilities: ["formatting", "summarization"],
fallbacks: ["google/gemini-2.5-flash"],
},
{
id: "anthropic/claude-sonnet-4",
name: "Claude Sonnet 4",
tier: "standard",
costPer1MInput: 3,
costPer1MOutput: 15,
capabilities: ["coding", "research"],
fallbacks: ["openai/gpt-4o", "anthropic/claude-3.5-sonnet"],
},
{
id: "anthropic/claude-opus-4",
name: "Claude Opus 4",
tier: "heavy",
costPer1MInput: 15,
costPer1MOutput: 75,
capabilities: ["reasoning"],
fallbacks: ["openai/o3", "anthropic/claude-sonnet-4"],
},
];
export function getModelById(id: string): ModelConfig | undefined {
return MODEL_REGISTRY.find((model) => model.id === id);
}
export const TIER_META: Record<
ModelTier,
{ label: string; color: string; icon: string }
> = {
light: { label: "Light", color: "green", icon: "⚡" },
standard: { label: "Standard", color: "amber", icon: "⚙️" },
heavy: { label: "Heavy", color: "red", icon: "🧠" },
};
The important part is not the exact list of models. It is the separation of routing policy from request handling. Adding or replacing a model becomes a config change instead of an API route rewrite.
A quick pricing note: model prices change, so treat these numbers as example values and verify them against OpenRouter before publishing or deploying. You can also track actual cost from OpenRouter’s response metadata instead of estimating locally.
Next, build a function that looks at the latest user message and decides what kind of task it is. This first version is intentionally keyword-based. It does not call another LLM just to classify the LLM call.
// lib/router.ts
import { MODEL_REGISTRY, type ModelConfig, type TaskType } from "./models";
const TASK_KEYWORDS: Record<TaskType, { words: string[]; weight: number }> = {
formatting: {
words: [
"format",
"json",
"convert",
"list",
"table",
"csv",
"xml",
"markdown",
"restructure",
"transform",
"parse",
],
weight: 1.0,
},
summarization: {
words: [
"summarize",
"summary",
"tldr",
"brief",
"shorten",
"condense",
"key points",
"overview",
"recap",
"digest",
],
weight: 1.0,
},
coding: {
words: [
"code",
"function",
"component",
"hook",
"debug",
"fix",
"refactor",
"implement",
"typescript",
"react",
"api",
"endpoint",
"bug",
"error",
"script",
],
weight: 1.1,
},
research: {
words: [
"research",
"compare",
"analyze",
"differences",
"versus",
"pros and cons",
"evaluate",
"review",
"alternatives",
"landscape",
"ecosystem",
"benchmark",
],
weight: 1.3,
},
reasoning: {
words: [
"reason",
"explain why",
"implications",
"prove",
"think through",
"step by step",
"logic",
"philosophical",
"theoretical",
"what would happen if",
"argue",
],
weight: 1.5,
},
};
export function classifyTask(message: string): TaskType {
const lower = message.toLowerCase();
let bestTask: TaskType = "coding"; // Default to the standard tier
let bestScore = 0;
for (const [task, config] of Object.entries(TASK_KEYWORDS)) {
const matches = config.words.filter((word) => lower.includes(word)).length;
const score = matches * config.weight;
if (score > bestScore) {
bestScore = score;
bestTask = task as TaskType;
}
}
return bestTask;
}
Keyword routing is not perfect, but it is cheap, transparent, and debuggable. You can log exactly why a prompt went to a model, which is useful when tuning the router.
Then map the task to the cheapest capable model:
export function selectModel(taskType: TaskType): ModelConfig {
const capable = MODEL_REGISTRY.filter((model) =>
model.capabilities.includes(taskType),
);
capable.sort((a, b) => a.costPer1MInput - b.costPer1MInput);
return capable[0];
}
This is the core of the cost optimization. You are not guessing at request time. You are encoding a policy: use the cheapest model that can handle the task.
Now replace the hardcoded SDK call with a route that classifies, selects, and streams:
// app/api/chat/route.ts
import { toServerSentEventsResponse } from "@tanstack/ai";
import { classifyTask, selectModel } from "@/lib/router";
import { chatWithFallback } from "@/lib/fallback";
export async function POST(request: Request) {
const { messages } = await request.json();
// 1. Extract text from the latest user message.
// TanStack AI messages may arrive as { role, parts: [...] }, while raw
// requests may use { role, content }. Handle both shapes.
const lastMessage = messages[messages.length - 1];
const content =
typeof lastMessage.content === "string"
? lastMessage.content
: Array.isArray(lastMessage.parts)
? lastMessage.parts
.filter((part: any) => part.type === "text")
.map((part: any) => part.content)
.join(" ")
: "";
// 2. Classify the task
const taskType = classifyTask(content);
// 3. Select the cheapest capable model
const model = selectModel(taskType);
// 4. Stream with automatic failover
const { stream, model: actualModel } = await chatWithFallback(messages, model);
// 5. Return SSE with routing metadata in the headers
const response = toServerSentEventsResponse(stream);
response.headers.set("X-Model-Used", actualModel.name);
response.headers.set("X-Task-Type", taskType);
response.headers.set("X-Model-Tier", actualModel.tier);
response.headers.set(
"X-Cost-Per-1M",
`$${actualModel.costPer1MInput}/$${actualModel.costPer1MOutput}`,
);
return response;
}
The useful detail is that chatWithFallback returns the model that actually served the response. If the primary model fails and a fallback catches the request, the headers show the real model used, not just the model the router intended to use.
The as any cast you will see in the fallback wrapper is a practical TypeScript issue. TanStack AI’s OpenRouter adapter uses generated model types, and newly released or renamed model slugs can appear before those types catch up.
Failover happens at two layers, and the distinction matters.
Layer 1: Provider-level routing. OpenRouter can route a selected model through different providers and sort provider options by price, latency, or throughput. In this example, we opt into provider fallback behavior through modelOptions:
modelOptions: {
provider: {
allow_fallbacks: true,
sort: "price",
},
}
This gives you provider-level redundancy without changing your application code.
Layer 2: Model-level fallback. Provider fallback does not fully replace application-level fallback. If the selected model is rate-limited, unavailable, or returns an error, you may still want to retry a different model from your own registry. OpenRouter also supports gateway-side model fallbacks through a models array, but an application-level wrapper gives you more control over routing metadata, retries, and per-tier behavior.
Here is the full wrapper with a timeout and stream-error handling:
// lib/fallback.ts
import { chat } from "@tanstack/ai";
import { openRouterText } from "@tanstack/ai-openrouter";
import { MODEL_REGISTRY, type ModelConfig } from "./models";
const TIMEOUT_MS = 15_000;
async function tryModel(modelId: string, messages: any[]) {
const stream = chat({
adapter: openRouterText(modelId as any),
messages,
modelOptions: {
provider: { allow_fallbacks: true, sort: "price" },
},
});
const timeoutPromise = new Promise<never>((_, reject) =>
setTimeout(
() => reject(new Error(`Timeout after ${TIMEOUT_MS}ms`)),
TIMEOUT_MS,
),
);
// Consume the first chunk to verify the stream is healthy.
// Some provider errors can arrive inside the SSE stream rather than as
// immediate HTTP errors, so a try/catch around chat() is not enough.
const reader = stream[Symbol.asyncIterator]();
const firstChunk = await Promise.race([reader.next(), timeoutPromise]);
if (firstChunk.done) {
throw new Error("Stream ended immediately, likely due to an error response");
}
async function* replayStream() {
yield firstChunk.value;
for await (const chunk of { [Symbol.asyncIterator]: () => reader }) {
yield chunk;
}
}
return replayStream();
}
export async function chatWithFallback(
messages: any[],
primaryModel: ModelConfig,
) {
try {
const stream = await tryModel(primaryModel.id, messages);
return { stream, model: primaryModel };
} catch (error) {
console.warn(
`Primary model ${primaryModel.id} failed: ${error}. Trying fallbacks.`,
);
}
for (const fallbackId of primaryModel.fallbacks) {
const fallbackModel = MODEL_REGISTRY.find(
(model) => model.id === fallbackId,
);
if (!fallbackModel) continue;
try {
const stream = await tryModel(fallbackId, messages);
return { stream, model: fallbackModel };
} catch (error) {
console.warn(`Fallback ${fallbackId} failed: ${error}`);
}
}
throw new Error("All models failed; no fallback is available");
}
The key detail is that tryModel() reads the first stream chunk before returning. Without that check, your route can successfully create a stream, hand it to the client, and only then surface a provider error instead of tokens. The timeout keeps one slow model from blocking the entire fallback chain.
For production, you would usually add structured logging around each attempt: requested model, actual model, task type, latency, and final cost. LogRocket has also covered environment-aware model routing as a broader pattern for choosing models based on runtime context.
When you run npm run dev and send prompts across different complexity levels, the routing becomes visible through model badges in the UI:
| Prompt | Task | Model | Tier | Input cost per 1M tokens |
|---|---|---|---|---|
"Convert this to JSON: name John, age 30" |
formatting | Gemini 2.0 Flash | light | $0.10 |
"Summarize this in two sentences" |
summarization | Gemini 2.0 Flash | light | $0.10 |
"Write a React hook that debounces API calls" |
coding | Claude Sonnet 4 | standard | $3.00 |
"Compare React Server Components vs. Astro Islands" |
research | Claude Sonnet 4 | standard | $3.00 |
"Reason step by step about why P = NP is unlikely" |
reasoning | Claude Opus 4 | heavy | $15.00 |
Five prompts. If all five hit the heavy model, you are looking at about $0.15 in estimated token costs. With routing, the same five prompts cost about $0.03. That is roughly 80 percent savings in this small test, and the gap can widen as you scale because many production prompts are simple tasks that do not need frontier-model reasoning.
Let’s test it:


At this point, the router is working. The API route is no longer tied to one provider SDK, and the model decision is visible enough to debug.
This bug cost us time. When we first tested the demo, every prompt routed to Claude Sonnet 4: "format this as JSON," "reason step by step about P vs. NP," everything. The classifier was not broken. The model registry was fine. The bug was in the API route:
// ❌ This reads undefined for TanStack AI messages const taskType = classifyTask(lastMessage.content);
TanStack AI’s useChat hook can send messages as { role, parts: [{ type: "text", content: "..." }] }, not only as { role, content: "..." }. When you read lastMessage.content, you may get undefined. The classifier receives an empty string, no keywords match, and it falls through to the default "coding" task, which maps to the standard tier.
That means your routing layer exists, but every prompt still goes to the same model.
The fix is to extract text from both possible shapes:
// ✅ Handle both formats: parts from TanStack AI and content from raw requests
const content =
typeof lastMessage.content === "string"
? lastMessage.content
: Array.isArray(lastMessage.parts)
? lastMessage.parts
.filter((part: any) => part.type === "text")
.map((part: any) => part.content)
.join(" ")
: "";
If your AI UI streams responses into React state, also make sure the client owns the stream correctly. LogRocket’s article on why useEffect breaks AI streaming responses in React covers a common class of UI-side streaming bugs.
To use this pattern in a real project, replace your hardcoded model call with this flow:
classifyTask → selectModel → chatWithFallback
For a prototype, keyword classification is enough to prove the architecture. For production, you may want an LLM-based classifier instead. It costs a tiny amount per request and adds latency, but it handles ambiguous prompts much better than keywords.
Keyword classifiers are fast and free, but they break on ambiguous prompts. They match patterns, not intent.
| Prompt | Keyword result | Better result |
|---|---|---|
"What are the tradeoffs between microservices and monoliths?" |
coding, because no keyword matched | research |
"Help me think about how to structure my database" |
coding | coding |
"Explain the difference between TCP and UDP" |
coding, because no research keyword matched | research |
"Can you make this paragraph shorter and punchier?" |
coding, because no summarization keyword matched | summarization |
"What happens to entropy in a closed system over time?" |
coding, because no keyword matched | reasoning |
The keyword classifier only works when users phrase things the way you expect. Real users do not.
A simple upgrade is to use your cheapest capable model as the classifier before routing to the real model. In this example, Gemini 2.0 Flash reads the prompt and returns one word: formatting, summarization, coding, research, or reasoning.
The flow looks like this:
The trade-off is straightforward:
| Factor | Impact |
|---|---|
| Cost | Usually negligible because the classifier response is one short token sequence |
| Latency | Adds one extra model call before the real response starts |
| Accuracy | Handles synonyms, rephrased questions, and ambiguous prompts better than keyword matching |
| Reliability | Needs a fallback path in case the classifier fails |
You are now making two LLM calls per request: one cheap classification call and one actual response call. The classification cost is only worth it if the routing savings and quality improvements outweigh the added latency.
You need two changes: a new classifyTaskWithLLM() function in lib/router.ts, and a one-line swap in the API route.
Gemini 2.0 Flash receives the user prompt with a system prompt that forces a single-word classification response. The token cap and temperature settings keep the response short and deterministic. If anything goes wrong, the function silently falls back to keyword matching.
// lib/router.ts
import { chat } from "@tanstack/ai";
import { openRouterText } from "@tanstack/ai-openrouter";
export const CLASSIFIER_MODE: "keyword" | "llm" = "llm";
const VALID_TASK_TYPES: TaskType[] = [
"formatting",
"summarization",
"coding",
"research",
"reasoning",
];
export async function classifyTaskWithLLM(
message: string,
): Promise<TaskType> {
try {
const adapter = openRouterText("google/gemini-2.0-flash-001" as any);
const result = await chat({
adapter,
messages: [{ role: "user" as const, content: message }],
systemPrompts: [
`You are a task classifier. Classify the user's message into exactly one category.
Respond with ONLY one of these words, nothing else:
- formatting (converting data formats, JSON, CSV, restructuring text)
- summarization (condensing text, TLDR, key points, briefs)
- coding (writing code, debugging, building components, technical implementation)
- research (comparing options, analyzing topics, evaluating alternatives)
- reasoning (logical arguments, proofs, philosophical questions, step-by-step thinking, implications)
Examples:
"convert this to JSON" → formatting
"summarize this article" → summarization
"write a React hook" → coding
"compare Next.js vs. Remix" → research
"why is P unlikely to equal NP" → reasoning
Respond with exactly one word.`,
],
stream: false,
modelOptions: {
maxCompletionTokens: 10,
temperature: 0,
},
});
const classification = String(result).trim().toLowerCase() as TaskType;
if (VALID_TASK_TYPES.includes(classification)) {
return classification;
}
return classifyTask(message);
} catch {
return classifyTask(message);
}
}
There are a few design decisions here:
stream: false: The router needs the full classification before choosing the response model.temperature: 0: Classification should be deterministic, not creative.maxCompletionTokens: 10: The response should be one word, so cap the output.Add one wrapper function that chooses the classifier based on the current mode:
export async function classifyTaskSmart(message: string): Promise<TaskType> {
if (CLASSIFIER_MODE === "llm") {
return classifyTaskWithLLM(message);
}
return classifyTask(message);
}
Then update the API route to call the smart classifier:
- import { classifyTask, selectModel } from "@/lib/router";
+ import { classifyTaskSmart, selectModel } from "@/lib/router";
- const taskType = classifyTask(content);
+ const taskType = await classifyTaskSmart(content);
The rest of the routing pipeline stays the same: model selection, fallback, streaming, and response metadata do not need to change.
With LLM classification, the flow looks like this:
User: "What are the tradeoffs between microservices and monoliths?"
│
├─→ Gemini Flash classifier: "research"
│
├─→ selectModel("research"): Claude Sonnet 4
│
└─→ chatWithFallback(): streamed response
The classifier adds one extra step, but it prevents the router from treating every ambiguous prompt as coding.
In lib/router.ts, change the classifier mode:
// Use "keyword" for the fastest path or "llm" for better intent detection export const CLASSIFIER_MODE: "keyword" | "llm" = "llm";
For debugging, it is useful to test both modes against the same prompt.
You can clone the demo repo and try prompts that expose the difference between keyword and LLM classification.
First, switch back to keyword mode:
export const CLASSIFIER_MODE: "keyword" | "llm" = "keyword";
Then send this prompt:
What happens when you mass produce something that was meant to be handmade?

The expected task is closer to reasoning, but because there is no keyword match, the classifier defaults to coding, which uses Claude Sonnet in the standard tier.
Now switch to LLM mode:
export const CLASSIFIER_MODE: "keyword" | "llm" = "llm";
Send the same prompt again:

The LLM classifier recognizes the prompt as reasoning, so the router selects Claude Opus in the heavy tier.
You do not need every prompt to go to your most expensive model. You need a routing layer that understands the shape of the request, picks the cheapest capable model, and falls back when something fails.
OpenRouter makes the provider side easier by giving you one gateway across many models. TanStack AI makes the application side easier by keeping streaming, adapters, and chat state explicit. The router in this article sits between them: it classifies the prompt, selects the model, and keeps the user experience moving even when a provider or model has a bad day.
Start with keyword classification if you want something transparent and free. Add an LLM classifier when ambiguous prompts start routing to the wrong tier. Either way, the biggest architectural win is the same: stop hardcoding a single LLM SDK into every route, and treat model choice as a policy you can change.
Debugging Next applications can be difficult, especially when users experience issues that are difficult to reproduce. If you’re interested in monitoring and tracking state, automatically surfacing JavaScript errors, and tracking slow network requests and component load time, try LogRocket.
LogRocket captures console logs, errors, network requests, and pixel-perfect DOM recordings from user sessions and lets you replay them as users saw it, eliminating guesswork around why bugs happen — compatible with all frameworks.
LogRocket's Galileo AI watches sessions for you, instantly identifying and explaining user struggles with automated monitoring of your entire product experience.
The LogRocket Redux middleware package adds an extra layer of visibility into your user sessions. LogRocket logs all actions and state from your Redux stores.
Modernize how you debug your Next.js apps — start monitoring for free.

TSRX adds first-class control flow, conditional hooks, and scoped styles to React via a TypeScript compiler extension — no new framework required.

Learn how to build a full React Native auth system using Better Auth and Expo — with email/password login, Google OAuth, session persistence, and protected routes.

Compare the top AI development tools and models of June 2026. View updated rankings, feature breakdowns, and find the best fit for you.

Learn how Bloom filters reduce database lookups for username availability checks while preserving correctness at scale.
Hey there, want to help make our blog better?
Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.
Sign up now