Skip to main content

Optimizing Session Costs

This guide provides strategies for reducing costs and improving efficiency in Radium agent sessions.

Understanding Session Reports​

Session reports provide detailed metrics to help you understand where costs are incurred:

Key Metrics​

  • Total Cost: Sum of all model API costs for the session
  • Input Tokens: Tokens sent to models (typically more expensive)
  • Output Tokens: Tokens generated by models
  • Cached Tokens: Tokens served from cache (free or reduced cost)
  • Tool Calls: Number of tool executions (affects total time)

Reading a Session Report​

rad stats session

Look for:

  1. Model Usage: Which models are being used and their token counts
  2. Cache Hit Rate: Percentage of tokens served from cache
  3. Tool Success Rate: Percentage of successful tool calls
  4. Performance Breakdown: Where time is spent (API vs tools)

Cost Optimization Strategies​

1. Leverage Caching​

Cache effectiveness is shown in session reports:

  • Cache Hit Rate: Higher is better (aim for >50%)
  • Cached Tokens: These tokens are free or significantly cheaper

Tips:

  • Use context files that can be cached
  • Reuse prompts and templates when possible
  • Enable caching in your agent configuration

Context Caching​

Context caching reduces token costs by 50%+ for repeated context by caching processed tokens at the provider level. This is different from model instance caching - it caches the actual prompt tokens, not the model objects.

Enable Context Caching:

use radium_models::{ModelConfig, ModelType};
use std::time::Duration;

let config = ModelConfig::new(ModelType::Claude, "claude-3-sonnet".to_string())
.with_context_caching(true)
.with_cache_ttl(Duration::from_secs(300));

Provider-Specific Notes:

  • Claude: Use cache breakpoints to mark stable context boundaries
  • OpenAI: Automatic for GPT-4+ models (just enable caching)
  • Gemini: Use cache identifiers to reuse cached content

Monitor Cache Performance:

Check cache_usage in ModelResponse to see cache hit rates and cost savings. Aim for >50% cache hit rate for optimal cost reduction.

See Context Caching Documentation for detailed information.

2. Model Selection​

Different models have different costs:

  • Flash models: Lower cost, faster, good for simple tasks
  • Pro models: Higher cost, more capable, use for complex reasoning

Strategy:

  • Use flash models for routine operations
  • Reserve pro models for complex reasoning tasks
  • Check rad stats model to see which models you're using most

Example:

Model                          Requests  Input Tokens  Output Tokens  Cost
─────────────────────────────────────────────────────────────────────────
gemini-3-pro-preview 168 31,056,954 44,268 $0.1250
gemini-2.5-flash-lite 28 60,389 2,422 $0.0025

In this example, switching more requests to flash-lite could significantly reduce costs.

3. Reduce Tool Calls​

Tool calls consume time and can increase costs indirectly:

  • Success Rate: Higher success rate means fewer retries
  • Tool Calls: Fewer calls = faster sessions = lower costs

Strategies:

  • Improve tool reliability to reduce failures
  • Batch operations when possible
  • Use more efficient tools that require fewer calls

4. Optimize Prompts​

Shorter, more focused prompts reduce input tokens:

  • Be specific in your instructions
  • Remove unnecessary context
  • Use structured formats (JSON, YAML) when possible

5. Monitor and Compare​

Use session comparison to identify improvements:

rad stats compare <old-session> <new-session>

Look for:

  • Token Delta: Reduction in total tokens
  • Cost Delta: Reduction in total cost
  • Success Rate: Improvement in tool success rate

Example Analysis:

Token Usage
───────────
Delta: -5200 (-10.0%) ← Good: 10% reduction

Cost
────
Delta: -0.0150 (-12.0%) ← Good: 12% cost reduction

Success Rate: 92.6% β†’ 95.2% (+2.6%) ← Good: Fewer failures

Identifying Expensive Operations​

High Token Usage​

Check which models are consuming the most tokens:

rad stats model

Look for:

  • Models with very high input token counts
  • Models with low cache hit rates
  • Models used frequently but inefficiently

Long-Running Sessions​

Check performance metrics:

rad stats session

Look for:

  • High wall time relative to agent active time (indicates waiting/idle time)
  • High API time percentage (indicates model calls are slow)
  • High tool time percentage (indicates tools are slow)

Frequent Failures​

Check success rates:

rad stats session

Look for:

  • Low success rate (<90% indicates problems)
  • High number of failed tool calls
  • Patterns in failures (specific tools or operations)

Best Practices​

Regular Monitoring​

  1. Weekly Review: Check rad stats history weekly to spot trends
  2. Compare Sessions: Use rad stats compare after making changes
  3. Export Data: Use rad stats export to analyze trends over time

Cost Budgeting​

  1. Set Limits: Monitor costs and set budgets per project
  2. Track Trends: Export data and track cost trends over time
  3. Alert on Spikes: Watch for sudden cost increases

Optimization Workflow​

  1. Baseline: Establish baseline with rad stats session
  2. Identify Issues: Look for high costs, low cache rates, frequent failures
  3. Make Changes: Adjust models, prompts, or tools
  4. Compare: Use rad stats compare to verify improvements
  5. Iterate: Continue optimizing based on results

Example Optimization Scenario​

Initial State​

$ rad stats session
Total Cost: $0.1250
Model Usage: gemini-3-pro-preview (168 requests, 31M input tokens)
Cache Hit Rate: 15%
Success Rate: 92.6%

Issues Identified​

  1. Low cache hit rate (15% - should be >50%)
  2. All requests using expensive pro model
  3. Some tool failures (7.4% failure rate)

Changes Made​

  1. Enabled caching for context files
  2. Switched routine operations to flash-lite model
  3. Improved tool error handling

Results​

$ rad stats compare <old-session> <new-session>
Cost Delta: -0.0350 (-28.0%) ← 28% cost reduction!
Cache Hit Rate: 15% β†’ 58% ← Much better caching
Success Rate: 92.6% β†’ 96.1% ← Fewer failures

Additional Resources​