4-Tier AI Model Routing: Real Cost Data After 3 Months

For the first two months of ARIA, I used Claude Sonnet for everything. It worked great. It was also dumb.

Not dumb in the “bad results” sense — Sonnet is excellent. Dumb in the “using a table saw to cut bread” sense. When ARIA writes a git commit message, it doesn’t need a frontier model reasoning through multi-step problems. It needs to read a diff and produce a one-liner. A model that costs 15x less does that just fine.

I started tracking costs in November. By January I had enough data to build a routing system. Here’s what I found.

The Problem With One Model for Everything

Before routing, my average monthly Claude API spend was around $28–32. That sounds low in absolute terms, but consider what it was buying:

~40% of spend on commit messages, PR descriptions, and summaries
~25% on morning/evening briefings
~20% on actual coding tasks
~15% on architecture discussions and complex debugging

The first two categories — commit messages and briefings — are high-volume, low-complexity tasks. They run daily, sometimes multiple times a day. They need good formatting and coherent output, but they don’t need deep reasoning. I was spending Sonnet-level money on tasks that a much cheaper model handles equally well.

The 4-Tier System

I settled on four tiers after experimenting with three (no local model initially):

Tier	Model	Cost per 1M tokens (in/out)	Used for
0	Ollama `llama3.2:3b` (local)	Free	Commits, summaries, translations, simple explanations
1	Claude Haiku 3.5	$0.80 / $4.00	Quick lookups, yes/no classification, short Q&A
2	Claude Sonnet (default)	$3.00 / $15.00	Coding, features, bug fixes, briefings
3	Claude Opus	$15.00 / $75.00	Architecture, security audits, complex refactors

Tier 0 is local, runs on my machine, costs nothing per token. The constraint is quality and context window (llama3.2:3b has an 8k context). Tier 3 barely runs — I invoke it explicitly for specific sessions, never automatically.

Routing Rules

The routing lives in two places: the aria_route_query MCP tool (for ARIA-initiated tasks) and a model-suggest pre-tool hook in Claude Code (for interactive sessions).

Tier 0 — Ollama local:

Git commit messages (reads diff, writes message)
PR descriptions (reads commits, writes summary)
Summarizing a file or a set of logs
Translating short text (PT ↔ EN)
Explaining what a command does
Categorizing a transaction description in Neutron

The test: “Could a smart intern do this with 30 seconds of reading?” If yes, Tier 0.

Tier 1 — Haiku:

“Is this a security issue or just a warning?”
Quick regex generation
Classifying a WhatsApp message (task request? question? spam?)
Checking if a string matches a pattern
Generating a short template from a known format

The test: “Does this need actual reasoning, or just pattern matching?” Haiku is surprisingly good at pattern matching on short inputs.

Tier 2 — Sonnet (default):

Everything that doesn’t fit Tier 0 or 1 and isn’t in Tier 3. This is the workhorse. Feature implementation, debugging, refactoring, writing tests, the morning briefing (which requires synthesizing 8 tool call results into coherent output), code review.

Tier 3 — Opus:

System architecture decisions where I need to think through trade-offs carefully
Security audits (reviewing auth flows, checking for injection vulnerabilities)
Complex refactors that touch many files and need to reason about state across the whole codebase
Anything where I’ve already tried Sonnet and it’s getting confused

I invoke Opus explicitly, never automatically. It’s a deliberate choice, like calling a senior consultant instead of your regular dev.

The aria_route_query Tool

For ARIA’s own tasks (not interactive Claude Code sessions), routing is handled by the aria_route_query MCP tool. It classifies a task before deciding which model to use:

server.tool(
  "aria_route_query",
  "Classify a task and suggest the appropriate model tier (0-3). Use before delegating work to determine cost-efficient routing.",
  {
    task: z.string().describe("Description of the task to classify"),
    context_size: z.number().optional().describe("Approximate token count of context"),
  },
  async ({ task, context_size }) => {
    const tier = classifyTask(task, context_size ?? 0);
    return {
      content: [{
        type: "text",
        text: JSON.stringify({
          tier,
          model: TIER_MODELS[tier],
          reasoning: TIER_REASONING[tier],
        })
      }]
    };
  }
);

function classifyTask(task: string, contextSize: number): 0 | 1 | 2 | 3 {
  const lower = task.toLowerCase();

  // Tier 0: free local
  if (contextSize < 4000 && (
    lower.includes("commit") ||
    lower.includes("summarize") ||
    lower.includes("translate") ||
    lower.includes("describe this") ||
    lower.includes("explain what")
  )) return 0;

  // Tier 1: cheap classification
  if (contextSize < 2000 && (
    lower.includes("classify") ||
    lower.includes("is this") ||
    lower.includes("categorize") ||
    lower.includes("yes or no")
  )) return 1;

  // Tier 3: expensive
  if (
    lower.includes("architecture") ||
    lower.includes("security audit") ||
    lower.includes("refactor entire") ||
    lower.includes("design system")
  ) return 3;

  // Default: Sonnet
  return 2;
}

This runs before ARIA delegates any sub-task to a model. The classification itself is pure logic — no LLM call for the router, which would defeat the purpose.

Real Numbers: Before vs After

Monthly API spend, approximate:

Month	Spend	Notes
October (baseline)	$31	Sonnet-only, no routing
November	$26	Started routing commits to Haiku
December	$19	Added Ollama for local tasks
January	$13	Routing fully tuned, ~40% Ollama deflection

That’s roughly a 58% reduction. The tasks I care about — coding sessions, architecture work — still run on Sonnet or Opus. The high-frequency low-complexity tasks (commits, briefing summaries, transaction categorization) moved to Ollama.

To put the Ollama impact in concrete terms: ARIA writes a commit message for every significant git commit I make. In January that was around 85 commits across all projects. At Sonnet pricing, that’s maybe $0.80–$1.20 in commits alone. At Ollama: $0.00.

Surprising Findings

Haiku handles more than I expected. I started with Haiku only for yes/no classification. It turned out to handle short-context summarization well too — well enough that I moved “summarize this changelog” tasks from Sonnet to Haiku. The output is slightly less polished but entirely usable.

Ollama quality for commits is fine. I expected to cringe at llama3.2:3b commit messages. I mostly don’t. The model reads the diff, understands the change type, and writes something like “fix: correct null check in user session handler.” That’s all I need. I reviewed 30 of them; 27 were usable without editing.

Opus is rare but irreplaceable. I used Opus maybe 8 times in three months. Three of those were security-related: reviewing the auth flow in Menthos, auditing webhook handling, checking the credit consumption logic for race conditions. In all three cases, Opus caught things Sonnet missed. The $0.50–$1.00 per session is worth it for high-stakes work.

The briefing is trickier than it looks. The morning briefing synthesizes 8+ tool call results. I tried downgrading it to Haiku. The output was technically correct but felt mechanical — it listed facts without prioritization or narrative. Sonnet adds a layer of judgment (“the Software budget hit 82%, worth watching”) that Haiku doesn’t. Briefings stay on Sonnet.

The model-suggest Hook

For interactive Claude Code sessions, I use a pre-tool hook that fires before the first model call and suggests a tier based on what I typed:

#!/bin/bash
# ~/.claude/hooks/model-suggest.sh
# Runs before Claude Code starts a session

INPUT="$1"
INPUT_LOWER=$(echo "$INPUT" | tr '[:upper:]' '[:lower:]')

suggest_tier() {
  if echo "$INPUT_LOWER" | grep -qE "commit|summary|summarize|translate|explain what"; then
    echo "0 (Ollama — free local)"
  elif echo "$INPUT_LOWER" | grep -qE "architecture|security audit|design system"; then
    echo "3 (Opus — expensive, confirm?)"
  else
    echo "2 (Sonnet — default)"
  fi
}

TIER=$(suggest_tier)
echo "[ARIA] Suggested tier: $TIER" >&2

This doesn’t force a model. It prints a suggestion to stderr that I can act on — or ignore. Most of the time I’m already using the right model by habit, but it’s a useful nudge when I’m about to ask something trivial.

Trade-offs

Ollama latency. llama3.2:3b on my machine (Ryzen 5, 32GB RAM, no discrete GPU) takes 3–8 seconds for a commit message. Sonnet via API takes ~1–2 seconds with network. For commits it’s fine — I’m not staring at the terminal waiting. For interactive tasks it would be unacceptable.

Context window limits. Ollama’s 8k context means I can’t send it large diffs or long files. Anything over ~3k tokens gets bumped to Haiku or Sonnet automatically. The context_size check in classifyTask handles this.

Cold start on Ollama. First call after system restart takes ~15 seconds to load the model. After that it’s warm and fast. I have Ollama set to start on boot so the warm-up happens before I need it.

Routing isn’t perfect. The classifier is keyword-based, not itself an LLM call. It makes mistakes — sometimes routing a complex “explain” question to Ollama when it should go to Sonnet. I catch these and fix the classifier. After three months it’s good enough that I rarely notice the misroutes.

The Actual Lesson

Using one model for everything is the lazy default and it’s expensive. The routing logic is not complicated — a few hundred lines of Node.js and a shell script. The savings compound daily because the high-frequency tasks (commits, summaries, briefings) are exactly the ones you can deflect cheapest.

The mental model I’d suggest: think about your task distribution. If you’re doing AI-assisted development every day, a large fraction of your calls are probably low-complexity. Route those to free or cheap models. Save the expensive reasoning capacity for problems that actually need it.

You don’t need a sophisticated ML routing system. Keyword classification plus context length is enough to get 80% of the value. Start there.

Next in the series: The Offline Queue Problem: Building Resilient AI Workflows