Budget Controls & Cost Optimization

Set hard token and dollar limits that automatically switch models or stop execution before you get a surprise bill.

Intermediate 15 min read

1The Cost Problem

AI coding agents can be expensive. A Cursor Pro subscription costs $240/year, but power users report spending $7,000+ annually with usage-based pricing. One poorly scoped prompt asking an agent to "refactor the entire codebase" can burn through $50 in tokens in minutes.

⚠ Cost Risk

Without budget controls, a single harness --permission bypass 'Refactor entire codebase' could cost hundreds of dollars. The model has no way to stop itself once it starts.

Harness solves this with a three-layer approach: per-session token limits, per-session cost limits, and automatic model fallback when limits are reached.

⚡ Harness Advantage

Token and cost budgets with automatic fallback are unique to Harness. Claude Code, Cursor, and Copilot have no programmatic budget controls — you only find out what you spent after the invoice arrives.

2Model Cost Catalog

Different models have drastically different price points. Use the table below to understand what a typical 100K-token task costs across providers. The 100K task cost assumes roughly 70K input tokens and 30K output tokens.

Provider	Model	Input $/1M	Output $/1M	100K Task Cost
Anthropic	claude-opus-4	$15.00	$75.00	~$9.00
Anthropic	claude-sonnet-4	$3.00	$15.00	~$1.80
OpenAI	gpt-4o	$2.50	$10.00	~$1.25
OpenAI	o3-mini	$1.10	$4.40	~$0.55
Google	gemini-2.5-pro	$1.25	$10.00	~$1.13
Ollama	any local model	Free	Free	$0.00

ℹ Pricing Note

Prices shown are approximate list prices as of early 2025. Check each provider's pricing page for current rates. Harness cost tracking uses the actual token counts reported by each API.

3TOML Budget Configuration

The simplest way to set budgets is in your .harness/config.toml. These limits are enforced automatically on every harness invocation without any code changes.

TOML .harness/config.toml

[router]
strategy = "cost_optimized"
fallback_chain = ["anthropic", "openai", "google"]
max_cost_per_session = 1.00     # Hard limit: $1 per session
max_tokens_per_session = 500000  # Hard limit: 500K tokens
simple_task_model = "claude-haiku-4-5-20251001"  # Use cheap model for simple tasks

When max_cost_per_session or max_tokens_per_session is reached, Harness raises a BudgetExhaustedError and stops the agent loop cleanly before issuing another API call.

The TOML [router] section maps directly to RouterConfigData:

Python harness/config.py

@dataclass
class RouterConfigData:
    strategy: str = "manual"
    fallback_chain: tuple[str, ...] = ()
    max_cost_per_session: float = 0.0   # 0.0 means unlimited
    max_tokens_per_session: int = 0     # 0 means unlimited
    simple_task_model: str | None = None

A value of 0 (default) means unlimited — Harness will not enforce a budget unless you explicitly set a positive value.

4TokenBudgetTracker in the SDK

For programmatic control, use TokenBudgetTracker directly. It tracks cumulative usage across multiple API calls and raises an error the moment a limit is exceeded.

Class Reference

Python harness/providers/budget.py

from harness.providers.budget import (
    BudgetSnapshot,
    BudgetExhaustedError,
    TokenBudgetTracker,
)

# BudgetSnapshot — immutable point-in-time view
@dataclass
class BudgetSnapshot:
    input_tokens_used: int = 0
    output_tokens_used: int = 0
    total_tokens_used: int = 0
    cost_used: float = 0.0
    tokens_remaining: int = 0   # 0 when no limit is configured
    cost_remaining: float = 0.0  # 0.0 when no limit is configured

# TokenBudgetTracker — stateful accumulator
class TokenBudgetTracker:
    def __init__(self, *, max_tokens: int = 0, max_cost: float = 0.0): ...
    @property
    def total_tokens(self) -> int: ...       # Tokens used so far
    @property
    def total_cost(self) -> float: ...       # Cost accrued so far
    def record_usage(
        self, input_tokens=0, output_tokens=0, cost=0.0
    ) -> BudgetSnapshot: ...                 # Returns snapshot after recording
    def snapshot(self) -> BudgetSnapshot: ... # Current state without modifying
    def is_exhausted(self) -> bool: ...      # True if any limit exceeded
    def check_budget(self) -> None: ...      # Raises BudgetExhaustedError if exhausted
    def reset(self) -> None: ...             # Reset counters (new session)

Full Example

Python cost_runner.py

import asyncio
import harness
from harness.providers.budget import TokenBudgetTracker, BudgetExhaustedError

async def main():
    # Create a budget: max $0.50 or 100K tokens (whichever hits first)
    budget = TokenBudgetTracker(max_tokens=100_000, max_cost=0.50)

    try:
        async for msg in harness.run(
            "Analyze and document this codebase",
            provider="anthropic",
            model="claude-sonnet-4-20250514",
        ):
            if isinstance(msg, harness.Result):
                # Record usage from each Result message
                budget.record_usage(
                    # Note: total_tokens doesn't split input/output.
                    # This 50/50 approximation is for budget tracking only.
                    input_tokens=msg.total_tokens // 2,
                    output_tokens=msg.total_tokens // 2,
                    cost=msg.total_cost,
                )

                snap = budget.snapshot()
                print(f"Used: ${snap.cost_used:.4f} / ${snap.cost_remaining:.4f} remaining")
                print(f"Tokens: {snap.total_tokens_used:,} / {snap.tokens_remaining:,} remaining")

                # Manually check before issuing the next turn
                budget.check_budget()

    except BudgetExhaustedError as e:
        print(f"Budget exceeded! {e}")
        print(f"Final: ${e.snapshot.cost_used:.4f} spent, "
              f"{e.snapshot.total_tokens_used:,} tokens used")

asyncio.run(main())

ℹ Result Message

harness.Result is yielded at the end of each agent turn. It contains total_tokens and total_cost attributes that reflect the entire turn's usage.

5Handling BudgetExhaustedError

BudgetExhaustedError carries a snapshot attribute so you can inspect why the budget was exhausted and take different recovery actions.

Python

from harness.providers.budget import BudgetExhaustedError

try:
    budget.check_budget()
except BudgetExhaustedError as e:
    snap = e.snapshot

    if snap.cost_remaining <= 0:
        print(f"Cost limit hit! Spent ${snap.cost_used:.4f}")
        print("Switching to free local model via Ollama...")
        # Re-run with Ollama provider (free)
        async for msg in harness.run(task, provider="ollama", model="llama3.2"):
            ...

    elif snap.tokens_remaining <= 0:
        print(f"Token limit hit! Used {snap.total_tokens_used:,} tokens")
        print("Compacting context and retrying with summary...")
        # Summarize results so far and continue with compact context
        summary = summarize_progress(results_so_far)
        budget.reset()
        async for msg in harness.run(
            f"Continue from: {summary}",
            provider="anthropic",
            model="claude-haiku-4-5-20251001",  # Cheaper model for remainder
        ):
            ...

✓ Best Practice

Always inspect snap.cost_remaining vs snap.tokens_remaining to choose the right recovery strategy. A cost limit often means switching to a free model; a token limit often means compacting context.

6Model Router Strategies

ModelRouter wraps a primary provider and automatically selects the right model based on your chosen strategy. This lets you save money on simple tasks without sacrificing quality for complex ones.

You explicitly choose the model for every run. Useful when you need deterministic behavior.

Python

from harness.providers.router import ModelRouter, RoutingStrategy

router = ModelRouter(
    primary=anthropic_provider,
    strategy=RoutingStrategy.MANUAL,
    # No automatic switching — you control the model
)

Harness analyzes each task and routes simple requests (short prompts, single-file edits) to the cheaper simple_task_provider.

Python

from harness.providers.router import ModelRouter, RoutingStrategy
from harness.providers.budget import TokenBudgetTracker

budget = TokenBudgetTracker(max_cost=5.00)

router = ModelRouter(
    primary=anthropic_provider,           # Sonnet 4 for complex tasks
    strategy=RoutingStrategy.COST_OPTIMIZED,
    simple_task_provider=haiku_provider,  # Haiku for simple tasks
    budget=budget,
)

Always uses the highest-quality available model, regardless of cost, until the budget is exhausted — then falls back.

Python

from harness.providers.router import ModelRouter, RoutingStrategy

router = ModelRouter(
    primary=opus_provider,   # Always try Opus first
    strategy=RoutingStrategy.QUALITY_FIRST,
    budget=budget,           # Falls back when budget is hit
)

Selects the fastest model (typically a smaller, cheaper model) for each task to minimize response time in latency-sensitive workflows.

Python

from harness.providers.router import ModelRouter, RoutingStrategy

router = ModelRouter(
    primary=haiku_provider,  # Fast model as primary
    strategy=RoutingStrategy.LATENCY_FIRST,
    # Optimizes for time-to-first-token
)

RoutingStrategy Enum

Python harness/providers/router.py

from enum import Enum

class RoutingStrategy(Enum):
    MANUAL         = "manual"          # You choose the model explicitly
    COST_OPTIMIZED = "cost_optimized"  # Routes simple tasks to cheaper models
    QUALITY_FIRST  = "quality_first"   # Always uses best model, budget permitting
    LATENCY_FIRST  = "latency_first"   # Uses fastest model for each task

class ModelRouter:
    def __init__(
        self,
        primary,                         # Primary ProviderAdapter
        *,
        strategy=RoutingStrategy.MANUAL,
        simple_task_provider=None,       # Cheap provider for simple tasks
        budget=None,                     # TokenBudgetTracker instance
    ): ...

7Automatic Fallback

FallbackProvider wraps a list of providers and tries each in order if the previous one fails (network error, rate limit, service outage). Your agent keeps running even if a provider goes down.

Python fallback_example.py

from harness.providers.fallback import FallbackProvider

# Try Anthropic first, then OpenAI, then Google
fallback = FallbackProvider([
    anthropic_provider,
    openai_provider,
    google_provider,
])

# active_provider reflects whichever is currently serving requests
print(f"Active: {fallback.active_provider}")

# If Anthropic is down, automatically tries OpenAI
# If OpenAI is down, automatically tries Google
async for msg in harness.run(
    "Review this PR",
    # Advanced: pass a pre-configured provider instance
    # (internal API — prefer TOML fallback_chain config instead)
    _provider=fallback,
):
    print(msg)

TOML Equivalent

The same fallback chain can be configured in TOML — no code required:

TOML .harness/config.toml

[router]
fallback_chain = ["anthropic", "openai", "google"]
# Harness will try providers in this order on failure

⚡ No Other Agent Has This

No other coding agent has automatic provider failover. When Anthropic has an outage, Claude Code stops working entirely — there is no fallback. Cursor is locked to a single provider. Harness keeps running by switching to the next healthy provider in your chain automatically.

FallbackProvider catches ConnectionError, OSError, and TimeoutError — if the current provider fails with any of these, it automatically tries the next provider in the chain. It does not catch BudgetExhaustedError — budget limits always propagate to the caller.

Python harness/providers/fallback.py

class FallbackProvider:
    def __init__(self, providers: list[ProviderAdapter]): ...

    @property
    def active_provider(self) -> ProviderAdapter:
        """Returns the currently active (healthy) provider."""
        ...

8Next Steps

You now have full control over what your agents spend. Here is a recommended workflow:

Set max_cost_per_session = 1.00 in .harness/config.toml as a safety net
Use strategy = "cost_optimized" with a cheap simple_task_model to reduce spend by 60-80%
Add a fallback_chain so your workflows survive provider outages
In scripts, wrap harness.run() with a TokenBudgetTracker and handle BudgetExhaustedError

▶ Try It Now

Add this to your .harness/config.toml and run any harness command — you will see a cost summary printed at the end of each session:

TOML

[router]
max_cost_per_session = 0.50
max_tokens_per_session = 100000

Next, learn how to govern what your agent can do (not just how much it costs) with Policy-as-Code.