Optimizing Session Costs
This guide provides strategies for reducing costs and improving efficiency in Radium agent sessions.
Understanding Session Reportsβ
Session reports provide detailed metrics to help you understand where costs are incurred:
Key Metricsβ
- Total Cost: Sum of all model API costs for the session
- Input Tokens: Tokens sent to models (typically more expensive)
- Output Tokens: Tokens generated by models
- Cached Tokens: Tokens served from cache (free or reduced cost)
- Tool Calls: Number of tool executions (affects total time)
Reading a Session Reportβ
rad stats session
Look for:
- Model Usage: Which models are being used and their token counts
- Cache Hit Rate: Percentage of tokens served from cache
- Tool Success Rate: Percentage of successful tool calls
- Performance Breakdown: Where time is spent (API vs tools)
Cost Optimization Strategiesβ
1. Leverage Cachingβ
Cache effectiveness is shown in session reports:
- Cache Hit Rate: Higher is better (aim for >50%)
- Cached Tokens: These tokens are free or significantly cheaper
Tips:
- Use context files that can be cached
- Reuse prompts and templates when possible
- Enable caching in your agent configuration
Context Cachingβ
Context caching reduces token costs by 50%+ for repeated context by caching processed tokens at the provider level. This is different from model instance caching - it caches the actual prompt tokens, not the model objects.
Enable Context Caching:
use radium_models::{ModelConfig, ModelType};
use std::time::Duration;
let config = ModelConfig::new(ModelType::Claude, "claude-3-sonnet".to_string())
.with_context_caching(true)
.with_cache_ttl(Duration::from_secs(300));
Provider-Specific Notes:
- Claude: Use cache breakpoints to mark stable context boundaries
- OpenAI: Automatic for GPT-4+ models (just enable caching)
- Gemini: Use cache identifiers to reuse cached content
Monitor Cache Performance:
Check cache_usage in ModelResponse to see cache hit rates and cost savings. Aim for >50% cache hit rate for optimal cost reduction.
See Context Caching Documentation for detailed information.
2. Model Selectionβ
Different models have different costs:
- Flash models: Lower cost, faster, good for simple tasks
- Pro models: Higher cost, more capable, use for complex reasoning
Strategy:
- Use flash models for routine operations
- Reserve pro models for complex reasoning tasks
- Check
rad stats modelto see which models you're using most
Example:
Model Requests Input Tokens Output Tokens Cost
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
gemini-3-pro-preview 168 31,056,954 44,268 $0.1250
gemini-2.5-flash-lite 28 60,389 2,422 $0.0025
In this example, switching more requests to flash-lite could significantly reduce costs.
3. Reduce Tool Callsβ
Tool calls consume time and can increase costs indirectly:
- Success Rate: Higher success rate means fewer retries
- Tool Calls: Fewer calls = faster sessions = lower costs
Strategies:
- Improve tool reliability to reduce failures
- Batch operations when possible
- Use more efficient tools that require fewer calls
4. Optimize Promptsβ
Shorter, more focused prompts reduce input tokens:
- Be specific in your instructions
- Remove unnecessary context
- Use structured formats (JSON, YAML) when possible
5. Monitor and Compareβ
Use session comparison to identify improvements:
rad stats compare <old-session> <new-session>
Look for:
- Token Delta: Reduction in total tokens
- Cost Delta: Reduction in total cost
- Success Rate: Improvement in tool success rate
Example Analysis:
Token Usage
βββββββββββ
Delta: -5200 (-10.0%) β Good: 10% reduction
Cost
ββββ
Delta: -0.0150 (-12.0%) β Good: 12% cost reduction
Success Rate: 92.6% β 95.2% (+2.6%) β Good: Fewer failures
Identifying Expensive Operationsβ
High Token Usageβ
Check which models are consuming the most tokens:
rad stats model
Look for:
- Models with very high input token counts
- Models with low cache hit rates
- Models used frequently but inefficiently
Long-Running Sessionsβ
Check performance metrics:
rad stats session
Look for:
- High wall time relative to agent active time (indicates waiting/idle time)
- High API time percentage (indicates model calls are slow)
- High tool time percentage (indicates tools are slow)
Frequent Failuresβ
Check success rates:
rad stats session
Look for:
- Low success rate (<90% indicates problems)
- High number of failed tool calls
- Patterns in failures (specific tools or operations)
Best Practicesβ
Regular Monitoringβ
- Weekly Review: Check
rad stats historyweekly to spot trends - Compare Sessions: Use
rad stats compareafter making changes - Export Data: Use
rad stats exportto analyze trends over time
Cost Budgetingβ
- Set Limits: Monitor costs and set budgets per project
- Track Trends: Export data and track cost trends over time
- Alert on Spikes: Watch for sudden cost increases
Optimization Workflowβ
- Baseline: Establish baseline with
rad stats session - Identify Issues: Look for high costs, low cache rates, frequent failures
- Make Changes: Adjust models, prompts, or tools
- Compare: Use
rad stats compareto verify improvements - Iterate: Continue optimizing based on results
Example Optimization Scenarioβ
Initial Stateβ
$ rad stats session
Total Cost: $0.1250
Model Usage: gemini-3-pro-preview (168 requests, 31M input tokens)
Cache Hit Rate: 15%
Success Rate: 92.6%
Issues Identifiedβ
- Low cache hit rate (15% - should be >50%)
- All requests using expensive pro model
- Some tool failures (7.4% failure rate)
Changes Madeβ
- Enabled caching for context files
- Switched routine operations to flash-lite model
- Improved tool error handling
Resultsβ
$ rad stats compare <old-session> <new-session>
Cost Delta: -0.0350 (-28.0%) β 28% cost reduction!
Cache Hit Rate: 15% β 58% β Much better caching
Success Rate: 92.6% β 96.1% β Fewer failures
Additional Resourcesβ
- Session Analytics Documentation - Complete feature documentation
- Agent Configuration - Configure agents for efficiency