ADR-001: YOLO Mode Autonomous Orchestration Architecture
Status: Accepted Date: 2025-12-07 Decision Makers: Radium Core Team Related REQs: REQ-165, REQ-166, REQ-167, REQ-168, REQ-169, REQ-170, REQ-171
Contextβ
Implement fully autonomous execution mode where Radium can complete entire implementations from high-level goals (e.g., "complete the implementation in REQ-123") without user intervention. The system must make intelligent decisions about:
- Agent selection based on task requirements and agent capabilities
- Resource allocation across multiple agents and AI providers
- Error recovery when tasks fail or providers are exhausted
- Multi-agent coordination for complex tasks requiring specialized skills
User Storyβ
"As a user, I would like to open Radium desktop, CLI, or TUI and simply execute a single command: 'please complete the entire implementation found in <source> for me.' Radium orchestrator would then verify and analyze source which could be Braingrid, Jira, local MD files, etc. Once it verified it had access and ability to read all source material, it would operate in a mode similar to YOLO mode in Gemini/Claude and would oversee execution entirely to completion."
Decisionβ
1. Workflow Generation Strategyβ
Leverage Existing: PlanGenerator (crates/radium-core/src/planning/mod.rs)
Enhancements Required:
- Extend plan generation to create full workflow DAGs with dependencies
- Add task dependency analysis using existing planning infrastructure
- Integrate with
WorkflowEngine(crates/radium-core/src/workflow/engine.rs)
Implementation:
// Extend PlanGenerator with dependency tracking
pub struct WorkflowPlan {
pub tasks: Vec<TaskNode>,
pub dependencies: HashMap<TaskId, Vec<TaskId>>,
pub estimated_cost: TokenBudget,
}
impl PlanGenerator {
pub async fn generate_autonomous_workflow(&self, goal: &str, sources: Vec<String>) -> Result<WorkflowPlan> {
// 1. Verify all sources accessible (REQ-165)
// 2. Parse goals into task tree
// 3. Analyze dependencies
// 4. Estimate resource requirements
// 5. Generate workflow DAG
}
}
2. Agent Selection Algorithmβ
Leverage Existing: AgentMetadata system (crates/radium-core/src/agents/metadata.rs)
Strategy:
- Capability Matching: Parse task requirements β match against agent capabilities
- Cost Optimization: Use
ModelSelectorlogic for budget-aware selection - Performance Tracking: Track agent success rates and execution times
- Dynamic Reassignment: Switch agents on repeated failures
Implementation:
pub struct AgentSelector {
registry: Arc<AgentRegistry>,
model_selector: Arc<ModelSelector>,
performance_tracker: Arc<PerformanceTracker>,
}
impl AgentSelector {
pub async fn select_agent(&self, task: &Task, context: &SelectionContext) -> Result<AgentConfig> {
// 1. Extract task requirements (code, docs, testing, etc.)
// 2. Filter agents by capability match
// 3. Score candidates by: performance + cost + availability
// 4. Apply fallback chain if primary unavailable
}
}
3. Error Recovery Strategyβ
Leverage Existing:
Checkpointsystem (crates/radium-core/src/checkpoint/mod.rs)- Hook system (
crates/radium-orchestrator/src/executor.rs) - Model fallback (
crates/radium-core/src/models/selector.rs)
Recovery Hierarchy:
- Retry with exponential backoff (max 3 attempts)
- Checkpoint restoration for workflow state recovery
- Agent fallback: Primary β Specialized fallback β General-purpose agent
- Provider switching when quotas exhausted
Implementation:
pub struct RecoveryStrategy {
max_retries: u32,
backoff_ms: u64,
checkpoint_interval: u32, // Create checkpoint every N steps
}
impl RecoveryStrategy {
pub async fn handle_failure(&self, error: &ExecutionError, context: &WorkflowContext) -> RecoveryAction {
match error {
ExecutionError::TransientError(_) => RecoveryAction::Retry { attempt: context.retries + 1, delay_ms: self.backoff_ms * 2_u64.pow(context.retries) },
ExecutionError::AgentFailure(_) => RecoveryAction::SwitchAgent { fallback_agent: self.select_fallback(context) },
ExecutionError::QuotaExhausted(_) => RecoveryAction::SwitchProvider { fallback_provider: self.next_provider(context) },
ExecutionError::PermanentError(_) => RecoveryAction::Checkpoint { restore_to: context.last_checkpoint },
}
}
}
4. Resource Managementβ
Leverage Existing:
TelemetryRecord(crates/radium-core/src/monitoring/telemetry.rs)ModelSelectorbudget tracking (crates/radium-core/src/models/selector.rs)
Critical Error Detection:
- Track total tokens consumed across all providers
- Implement circuit breaker when provider returns 429 (quota exceeded)
- Pause execution when all authenticated providers exhausted
- Alert user with clear actionable message
Implementation:
pub struct ResourceManager {
budget: TokenBudget,
provider_quotas: HashMap<ProviderId, QuotaStatus>,
circuit_breakers: HashMap<ProviderId, CircuitBreaker>,
}
impl ResourceManager {
pub fn check_critical_errors(&self) -> Option<CriticalError> {
let exhausted_providers = self.provider_quotas.values().filter(|q| q.is_exhausted()).count();
if exhausted_providers == self.provider_quotas.len() {
return Some(CriticalError::AllProvidersExhausted {
message: "All AI providers have exhausted their quotas. Please add credits or wait for quota reset.",
providers: self.provider_quotas.keys().cloned().collect(),
});
}
None
}
}
5. Multi-Agent Coordination (REQ-171)β
New System Required: Agent-to-agent communication bus
Key Components:
- Message Bus: Pub/sub system for agent communication
- Shared Workspace: Read/write access to plan artifacts
- Task Delegation: Agents can spawn sub-tasks for other agents
- Conflict Resolution: Detect duplicate work and coordinate
Implementation:
pub struct AgentCoordinator {
message_bus: Arc<MessageBus>,
shared_workspace: Arc<Workspace>,
task_queue: Arc<TaskQueue>,
}
impl AgentCoordinator {
pub async fn delegate_subtask(&self, from: AgentId, to: AgentId, task: SubTask) -> Result<TaskHandle> {
// 1. Validate target agent has required capabilities
// 2. Submit to task queue with dependency tracking
// 3. Subscribe to task completion events
// 4. Return handle for monitoring
}
pub async fn resolve_conflict(&self, agents: Vec<AgentId>, resource: ResourceId) -> Resolution {
// 1. Check task priorities
// 2. Apply conflict resolution policy (first-come, priority-based, etc.)
// 3. Notify losing agents to skip duplicate work
}
}
Consequencesβ
Positiveβ
β Leverages Existing Infrastructure: 90% of required systems already exist β Backward Compatible: Manual agent execution still works; YOLO mode is opt-in via policy engine β Incremental Rollout: Can deploy features progressively (REQ-165 β REQ-170) β Production-Ready Design: Built on battle-tested patterns (checkpoints, hooks, policies)
Negativeβ
β οΈ Increased Complexity: Adds orchestration layer on top of existing systems β οΈ Resource Intensive: Autonomous mode consumes more tokens than guided execution β οΈ Debugging Challenges: Multi-agent coordination failures can be hard to trace
Risksβ
π¨ Provider Costs: Autonomous mode could rack up significant API costs if not monitored π¨ Runaway Execution: Need kill switch for infinite loops or repeated failures π¨ Quality Control: Autonomous decisions might not match user intent without oversight
Mitigation Strategiesβ
- Cost Controls: Enforce hard token budgets, alert at 80% threshold
- Kill Switch: User can abort at any time; max iteration limits enforced
- Quality Gates: Use
VibeCheckbehavior (crates/radium-core/src/workflow/behaviors/vibe_check.rs) at checkpoints - Audit Trail: Full telemetry of all decisions, agent switches, and retries
Implementation Planβ
Phase 1: Foundation (Weeks 1-2)β
- REQ-164: Complete test coverage β
- REQ-165: Source verification (6 tasks) π§ 17% complete
- REQ-168: Error handling & circuit breaker
Phase 2: Intelligence (Weeks 3-4)β
- REQ-170: Workflow decomposition with DAG
- REQ-166: Dynamic agent selection
- REQ-169: Unified
rad completecommand
Phase 3: Coordination (Weeks 5-6)β
- REQ-171: Multi-agent collaboration
- REQ-167: Continuous execution mode (YOLO)
- Integration testing and refinement
Referencesβ
-
Existing Systems:
- Plan Generator:
crates/radium-core/src/planning/mod.rs - Workflow Engine:
crates/radium-core/src/workflow/engine.rs - Model Selector:
crates/radium-core/src/models/selector.rs - Policy Engine:
crates/radium-core/src/policy/types.rs:L56 - Checkpoint System:
crates/radium-core/src/checkpoint/mod.rs - Learning System:
crates/radium-core/src/learning/mod.rs
- Plan Generator:
-
Related Documentation:
- Integration Map:
docs/yolo-mode/integration-map.md - User Story: Autonomous Orchestration Request (2025-12-07)
- Integration Map: