4-Tier AI Model Routing: Real Cost Data After 3 Months

Part of: aria-progress

#aria #ai #cost #llm #devtools

For the first two months of ARIA, I used Claude Sonnet for everything. It worked great. It was also dumb.

Not dumb in the “bad results” sense — Sonnet is excellent. Dumb in the “using a table saw to cut bread” sense. When ARIA writes a git commit message, it doesn’t need a frontier model reasoning through multi-step problems. It needs to read a diff and produce a one-liner. A model that costs 15x less does that just fine.

I started tracking costs in November. By January I had enough data to build a routing system. Here’s what I found.

The Problem With One Model for Everything

Before routing, my average monthly Claude API spend was around $28–32. That sounds low in absolute terms, but consider what it was buying:

  • ~40% of spend on commit messages, PR descriptions, and summaries
  • ~25% on morning/evening briefings
  • ~20% on actual coding tasks
  • ~15% on architecture discussions and complex debugging

The first two categories — commit messages and briefings — are high-volume, low-complexity tasks. They run daily, sometimes multiple times a day. They need good formatting and coherent output, but they don’t need deep reasoning. I was spending Sonnet-level money on tasks that a much cheaper model handles equally well.

The 4-Tier System

I settled on four tiers after experimenting with three (no local model initially):

TierModelCost per 1M tokens (in/out)Used for
0Ollama llama3.2:3b (local)FreeCommits, summaries, translations, simple explanations
1Claude Haiku 3.5$0.80 / $4.00Quick lookups, yes/no classification, short Q&A
2Claude Sonnet (default)$3.00 / $15.00Coding, features, bug fixes, briefings
3Claude Opus$15.00 / $75.00Architecture, security audits, complex refactors

Tier 0 is local, runs on my machine, costs nothing per token. The constraint is quality and context window (llama3.2:3b has an 8k context). Tier 3 barely runs — I invoke it explicitly for specific sessions, never automatically.

Routing Rules

The routing lives in two places: the aria_route_query MCP tool (for ARIA-initiated tasks) and a model-suggest pre-tool hook in Claude Code (for interactive sessions).

Tier 0 — Ollama local:

  • Git commit messages (reads diff, writes message)
  • PR descriptions (reads commits, writes summary)
  • Summarizing a file or a set of logs
  • Translating short text (PT ↔ EN)
  • Explaining what a command does
  • Categorizing a transaction description in Neutron

The test: “Could a smart intern do this with 30 seconds of reading?” If yes, Tier 0.

Tier 1 — Haiku:

  • “Is this a security issue or just a warning?”
  • Quick regex generation
  • Classifying a WhatsApp message (task request? question? spam?)
  • Checking if a string matches a pattern
  • Generating a short template from a known format

The test: “Does this need actual reasoning, or just pattern matching?” Haiku is surprisingly good at pattern matching on short inputs.

Tier 2 — Sonnet (default):

Everything that doesn’t fit Tier 0 or 1 and isn’t in Tier 3. This is the workhorse. Feature implementation, debugging, refactoring, writing tests, the morning briefing (which requires synthesizing 8 tool call results into coherent output), code review.

Tier 3 — Opus:

  • System architecture decisions where I need to think through trade-offs carefully
  • Security audits (reviewing auth flows, checking for injection vulnerabilities)
  • Complex refactors that touch many files and need to reason about state across the whole codebase
  • Anything where I’ve already tried Sonnet and it’s getting confused

I invoke Opus explicitly, never automatically. It’s a deliberate choice, like calling a senior consultant instead of your regular dev.

The aria_route_query Tool

For ARIA’s own tasks (not interactive Claude Code sessions), routing is handled by the aria_route_query MCP tool. It classifies a task before deciding which model to use:

server.tool(
  "aria_route_query",
  "Classify a task and suggest the appropriate model tier (0-3). Use before delegating work to determine cost-efficient routing.",
  {
    task: z.string().describe("Description of the task to classify"),
    context_size: z.number().optional().describe("Approximate token count of context"),
  },
  async ({ task, context_size }) => {
    const tier = classifyTask(task, context_size ?? 0);
    return {
      content: [{
        type: "text",
        text: JSON.stringify({
          tier,
          model: TIER_MODELS[tier],
          reasoning: TIER_REASONING[tier],
        })
      }]
    };
  }
);

function classifyTask(task: string, contextSize: number): 0 | 1 | 2 | 3 {
  const lower = task.toLowerCase();

  // Tier 0: free local
  if (contextSize < 4000 && (
    lower.includes("commit") ||
    lower.includes("summarize") ||
    lower.includes("translate") ||
    lower.includes("describe this") ||
    lower.includes("explain what")
  )) return 0;

  // Tier 1: cheap classification
  if (contextSize < 2000 && (
    lower.includes("classify") ||
    lower.includes("is this") ||
    lower.includes("categorize") ||
    lower.includes("yes or no")
  )) return 1;

  // Tier 3: expensive
  if (
    lower.includes("architecture") ||
    lower.includes("security audit") ||
    lower.includes("refactor entire") ||
    lower.includes("design system")
  ) return 3;

  // Default: Sonnet
  return 2;
}

This runs before ARIA delegates any sub-task to a model. The classification itself is pure logic — no LLM call for the router, which would defeat the purpose.

Real Numbers: Before vs After

Monthly API spend, approximate:

MonthSpendNotes
October (baseline)$31Sonnet-only, no routing
November$26Started routing commits to Haiku
December$19Added Ollama for local tasks
January$13Routing fully tuned, ~40% Ollama deflection

That’s roughly a 58% reduction. The tasks I care about — coding sessions, architecture work — still run on Sonnet or Opus. The high-frequency low-complexity tasks (commits, briefing summaries, transaction categorization) moved to Ollama.

To put the Ollama impact in concrete terms: ARIA writes a commit message for every significant git commit I make. In January that was around 85 commits across all projects. At Sonnet pricing, that’s maybe $0.80–$1.20 in commits alone. At Ollama: $0.00.

Surprising Findings

Haiku handles more than I expected. I started with Haiku only for yes/no classification. It turned out to handle short-context summarization well too — well enough that I moved “summarize this changelog” tasks from Sonnet to Haiku. The output is slightly less polished but entirely usable.

Ollama quality for commits is fine. I expected to cringe at llama3.2:3b commit messages. I mostly don’t. The model reads the diff, understands the change type, and writes something like “fix: correct null check in user session handler.” That’s all I need. I reviewed 30 of them; 27 were usable without editing.

Opus is rare but irreplaceable. I used Opus maybe 8 times in three months. Three of those were security-related: reviewing the auth flow in Menthos, auditing webhook handling, checking the credit consumption logic for race conditions. In all three cases, Opus caught things Sonnet missed. The $0.50–$1.00 per session is worth it for high-stakes work.

The briefing is trickier than it looks. The morning briefing synthesizes 8+ tool call results. I tried downgrading it to Haiku. The output was technically correct but felt mechanical — it listed facts without prioritization or narrative. Sonnet adds a layer of judgment (“the Software budget hit 82%, worth watching”) that Haiku doesn’t. Briefings stay on Sonnet.

The model-suggest Hook

For interactive Claude Code sessions, I use a pre-tool hook that fires before the first model call and suggests a tier based on what I typed:

#!/bin/bash
# ~/.claude/hooks/model-suggest.sh
# Runs before Claude Code starts a session

INPUT="$1"
INPUT_LOWER=$(echo "$INPUT" | tr '[:upper:]' '[:lower:]')

suggest_tier() {
  if echo "$INPUT_LOWER" | grep -qE "commit|summary|summarize|translate|explain what"; then
    echo "0 (Ollama — free local)"
  elif echo "$INPUT_LOWER" | grep -qE "architecture|security audit|design system"; then
    echo "3 (Opus — expensive, confirm?)"
  else
    echo "2 (Sonnet — default)"
  fi
}

TIER=$(suggest_tier)
echo "[ARIA] Suggested tier: $TIER" >&2

This doesn’t force a model. It prints a suggestion to stderr that I can act on — or ignore. Most of the time I’m already using the right model by habit, but it’s a useful nudge when I’m about to ask something trivial.

Trade-offs

Ollama latency. llama3.2:3b on my machine (Ryzen 5, 32GB RAM, no discrete GPU) takes 3–8 seconds for a commit message. Sonnet via API takes ~1–2 seconds with network. For commits it’s fine — I’m not staring at the terminal waiting. For interactive tasks it would be unacceptable.

Context window limits. Ollama’s 8k context means I can’t send it large diffs or long files. Anything over ~3k tokens gets bumped to Haiku or Sonnet automatically. The context_size check in classifyTask handles this.

Cold start on Ollama. First call after system restart takes ~15 seconds to load the model. After that it’s warm and fast. I have Ollama set to start on boot so the warm-up happens before I need it.

Routing isn’t perfect. The classifier is keyword-based, not itself an LLM call. It makes mistakes — sometimes routing a complex “explain” question to Ollama when it should go to Sonnet. I catch these and fix the classifier. After three months it’s good enough that I rarely notice the misroutes.

The Actual Lesson

Using one model for everything is the lazy default and it’s expensive. The routing logic is not complicated — a few hundred lines of Node.js and a shell script. The savings compound daily because the high-frequency tasks (commits, summaries, briefings) are exactly the ones you can deflect cheapest.

The mental model I’d suggest: think about your task distribution. If you’re doing AI-assisted development every day, a large fraction of your calls are probably low-complexity. Route those to free or cheap models. Save the expensive reasoning capacity for problems that actually need it.

You don’t need a sophisticated ML routing system. Keyword classification plus context length is enough to get 80% of the value. Start there.


Next in the series: The Offline Queue Problem: Building Resilient AI Workflows