Budget Controls & Cost Optimization
Set hard token and dollar limits that automatically switch models or stop execution before you get a surprise bill.
1The Cost Problem
AI coding agents can be expensive. A Cursor Pro subscription costs $240/year, but power users report spending $7,000+ annually with usage-based pricing. One poorly scoped prompt asking an agent to "refactor the entire codebase" can burn through $50 in tokens in minutes.
harness --permission bypass 'Refactor entire codebase'
could cost hundreds of dollars. The model has no way to stop itself once it starts.
Harness solves this with a three-layer approach: per-session token limits, per-session cost limits, and automatic model fallback when limits are reached.
2Model Cost Catalog
Different models have drastically different price points. Use the table below to understand what a typical 100K-token task costs across providers. The 100K task cost assumes roughly 70K input tokens and 30K output tokens.
| Provider | Model | Input $/1M | Output $/1M | 100K Task Cost |
|---|---|---|---|---|
| Anthropic | claude-opus-4 | $15.00 | $75.00 | ~$9.00 |
| Anthropic | claude-sonnet-4 | $3.00 | $15.00 | ~$1.80 |
| OpenAI | gpt-4o | $2.50 | $10.00 | ~$1.25 |
| OpenAI | o3-mini | $1.10 | $4.40 | ~$0.55 |
| gemini-2.5-pro | $1.25 | $10.00 | ~$1.13 | |
| Ollama | any local model | Free | Free | $0.00 |
3TOML Budget Configuration
The simplest way to set budgets is in your .harness/config.toml. These limits are enforced
automatically on every harness invocation without any code changes.
[router]
strategy = "cost_optimized"
fallback_chain = ["anthropic", "openai", "google"]
max_cost_per_session = 1.00 # Hard limit: $1 per session
max_tokens_per_session = 500000 # Hard limit: 500K tokens
simple_task_model = "claude-haiku-4-5-20251001" # Use cheap model for simple tasks
When max_cost_per_session or max_tokens_per_session is reached, Harness raises
a BudgetExhaustedError and stops the agent loop cleanly before issuing another API call.
The TOML [router] section maps directly to RouterConfigData:
@dataclass
class RouterConfigData:
strategy: str = "manual"
fallback_chain: tuple[str, ...] = ()
max_cost_per_session: float = 0.0 # 0.0 means unlimited
max_tokens_per_session: int = 0 # 0 means unlimited
simple_task_model: str | None = None
A value of 0 (default) means unlimited — Harness will not enforce a budget unless you explicitly set a positive value.
4TokenBudgetTracker in the SDK
For programmatic control, use TokenBudgetTracker directly. It tracks cumulative usage across
multiple API calls and raises an error the moment a limit is exceeded.
Class Reference
from harness.providers.budget import (
BudgetSnapshot,
BudgetExhaustedError,
TokenBudgetTracker,
)
# BudgetSnapshot — immutable point-in-time view
@dataclass
class BudgetSnapshot:
input_tokens_used: int = 0
output_tokens_used: int = 0
total_tokens_used: int = 0
cost_used: float = 0.0
tokens_remaining: int = 0 # 0 when no limit is configured
cost_remaining: float = 0.0 # 0.0 when no limit is configured
# TokenBudgetTracker — stateful accumulator
class TokenBudgetTracker:
def __init__(self, *, max_tokens: int = 0, max_cost: float = 0.0): ...
@property
def total_tokens(self) -> int: ... # Tokens used so far
@property
def total_cost(self) -> float: ... # Cost accrued so far
def record_usage(
self, input_tokens=0, output_tokens=0, cost=0.0
) -> BudgetSnapshot: ... # Returns snapshot after recording
def snapshot(self) -> BudgetSnapshot: ... # Current state without modifying
def is_exhausted(self) -> bool: ... # True if any limit exceeded
def check_budget(self) -> None: ... # Raises BudgetExhaustedError if exhausted
def reset(self) -> None: ... # Reset counters (new session)
Full Example
import asyncio
import harness
from harness.providers.budget import TokenBudgetTracker, BudgetExhaustedError
async def main():
# Create a budget: max $0.50 or 100K tokens (whichever hits first)
budget = TokenBudgetTracker(max_tokens=100_000, max_cost=0.50)
try:
async for msg in harness.run(
"Analyze and document this codebase",
provider="anthropic",
model="claude-sonnet-4-20250514",
):
if isinstance(msg, harness.Result):
# Record usage from each Result message
budget.record_usage(
# Note: total_tokens doesn't split input/output.
# This 50/50 approximation is for budget tracking only.
input_tokens=msg.total_tokens // 2,
output_tokens=msg.total_tokens // 2,
cost=msg.total_cost,
)
snap = budget.snapshot()
print(f"Used: ${snap.cost_used:.4f} / ${snap.cost_remaining:.4f} remaining")
print(f"Tokens: {snap.total_tokens_used:,} / {snap.tokens_remaining:,} remaining")
# Manually check before issuing the next turn
budget.check_budget()
except BudgetExhaustedError as e:
print(f"Budget exceeded! {e}")
print(f"Final: ${e.snapshot.cost_used:.4f} spent, "
f"{e.snapshot.total_tokens_used:,} tokens used")
asyncio.run(main())
harness.Result is yielded at the end of each agent turn. It contains total_tokens
and total_cost attributes that reflect the entire turn's usage.
5Handling BudgetExhaustedError
BudgetExhaustedError carries a snapshot attribute so you can inspect
why the budget was exhausted and take different recovery actions.
from harness.providers.budget import BudgetExhaustedError
try:
budget.check_budget()
except BudgetExhaustedError as e:
snap = e.snapshot
if snap.cost_remaining <= 0:
print(f"Cost limit hit! Spent ${snap.cost_used:.4f}")
print("Switching to free local model via Ollama...")
# Re-run with Ollama provider (free)
async for msg in harness.run(task, provider="ollama", model="llama3.2"):
...
elif snap.tokens_remaining <= 0:
print(f"Token limit hit! Used {snap.total_tokens_used:,} tokens")
print("Compacting context and retrying with summary...")
# Summarize results so far and continue with compact context
summary = summarize_progress(results_so_far)
budget.reset()
async for msg in harness.run(
f"Continue from: {summary}",
provider="anthropic",
model="claude-haiku-4-5-20251001", # Cheaper model for remainder
):
...
snap.cost_remaining vs snap.tokens_remaining to choose the right
recovery strategy. A cost limit often means switching to a free model; a token limit often means compacting context.
6Model Router Strategies
ModelRouter wraps a primary provider and automatically selects the right model based on
your chosen strategy. This lets you save money on simple tasks without sacrificing quality for complex ones.
You explicitly choose the model for every run. Useful when you need deterministic behavior.
from harness.providers.router import ModelRouter, RoutingStrategy
router = ModelRouter(
primary=anthropic_provider,
strategy=RoutingStrategy.MANUAL,
# No automatic switching — you control the model
)
Harness analyzes each task and routes simple requests (short prompts, single-file edits) to the cheaper simple_task_provider.
from harness.providers.router import ModelRouter, RoutingStrategy
from harness.providers.budget import TokenBudgetTracker
budget = TokenBudgetTracker(max_cost=5.00)
router = ModelRouter(
primary=anthropic_provider, # Sonnet 4 for complex tasks
strategy=RoutingStrategy.COST_OPTIMIZED,
simple_task_provider=haiku_provider, # Haiku for simple tasks
budget=budget,
)
Always uses the highest-quality available model, regardless of cost, until the budget is exhausted — then falls back.
from harness.providers.router import ModelRouter, RoutingStrategy
router = ModelRouter(
primary=opus_provider, # Always try Opus first
strategy=RoutingStrategy.QUALITY_FIRST,
budget=budget, # Falls back when budget is hit
)
Selects the fastest model (typically a smaller, cheaper model) for each task to minimize response time in latency-sensitive workflows.
from harness.providers.router import ModelRouter, RoutingStrategy
router = ModelRouter(
primary=haiku_provider, # Fast model as primary
strategy=RoutingStrategy.LATENCY_FIRST,
# Optimizes for time-to-first-token
)
RoutingStrategy Enum
from enum import Enum
class RoutingStrategy(Enum):
MANUAL = "manual" # You choose the model explicitly
COST_OPTIMIZED = "cost_optimized" # Routes simple tasks to cheaper models
QUALITY_FIRST = "quality_first" # Always uses best model, budget permitting
LATENCY_FIRST = "latency_first" # Uses fastest model for each task
class ModelRouter:
def __init__(
self,
primary, # Primary ProviderAdapter
*,
strategy=RoutingStrategy.MANUAL,
simple_task_provider=None, # Cheap provider for simple tasks
budget=None, # TokenBudgetTracker instance
): ...
7Automatic Fallback
FallbackProvider wraps a list of providers and tries each in order if the previous one
fails (network error, rate limit, service outage). Your agent keeps running even if a provider goes down.
from harness.providers.fallback import FallbackProvider
# Try Anthropic first, then OpenAI, then Google
fallback = FallbackProvider([
anthropic_provider,
openai_provider,
google_provider,
])
# active_provider reflects whichever is currently serving requests
print(f"Active: {fallback.active_provider}")
# If Anthropic is down, automatically tries OpenAI
# If OpenAI is down, automatically tries Google
async for msg in harness.run(
"Review this PR",
# Advanced: pass a pre-configured provider instance
# (internal API — prefer TOML fallback_chain config instead)
_provider=fallback,
):
print(msg)
TOML Equivalent
The same fallback chain can be configured in TOML — no code required:
[router]
fallback_chain = ["anthropic", "openai", "google"]
# Harness will try providers in this order on failure
FallbackProvider catches ConnectionError, OSError, and TimeoutError — if the current provider fails with any of these, it automatically tries the next provider in the chain. It does not catch
BudgetExhaustedError — budget limits always propagate to the caller.
class FallbackProvider:
def __init__(self, providers: list[ProviderAdapter]): ...
@property
def active_provider(self) -> ProviderAdapter:
"""Returns the currently active (healthy) provider."""
...
8Next Steps
You now have full control over what your agents spend. Here is a recommended workflow:
- Set
max_cost_per_session = 1.00in.harness/config.tomlas a safety net - Use
strategy = "cost_optimized"with a cheapsimple_task_modelto reduce spend by 60-80% - Add a
fallback_chainso your workflows survive provider outages - In scripts, wrap
harness.run()with aTokenBudgetTrackerand handleBudgetExhaustedError
.harness/config.toml and run any harness command — you will see a cost
summary printed at the end of each session:
[router]
max_cost_per_session = 0.50
max_tokens_per_session = 100000
Next, learn how to govern what your agent can do (not just how much it costs) with Policy-as-Code.