Cloud-to-Self-Hosted Migration Guide
Overviewβ
This guide walks you through migrating your Radium workspace from cloud-based AI providers (Gemini, OpenAI, Claude) to self-hosted model infrastructure. The migration can be done gradually, allowing you to test and validate at each step.
Pre-Migration Assessmentβ
Checklistβ
Before starting migration, assess your current setup:
- Inventory Agents: List all agents using cloud providers
- Assess Hardware: Verify you have sufficient resources for self-hosted models
- Plan Timeline: Schedule migration during low-usage periods
- Backup Configurations: Save current agent configurations
- Test Environment: Set up a test environment if possible
- Identify Critical Agents: Determine which agents can be migrated first
Agent Configuration Auditβ
Find all cloud-based agents:
# Find agents using cloud providers
grep -r "engine = \"gemini\"" agents/
grep -r "engine = \"openai\"" agents/
grep -r "engine = \"claude\"" agents/
# List all agent files
find agents/ -name "*.toml" -type f
Document current configuration:
# Create backup
cp -r agents/ agents-backup-$(date +%Y%m%d)/
# Export current configuration
find agents/ -name "*.toml" > agent-list.txt
Hardware Assessmentβ
Minimum Requirements:
- Ollama: 8GB RAM (16GB recommended)
- vLLM: NVIDIA GPU with 16GB+ VRAM
- LocalAI: 8GB RAM (16GB recommended)
Check your system:
# System memory
free -h
# GPU (if available)
nvidia-smi
# Disk space (for models)
df -h
Model Equivalency Mappingβ
Cloud to Self-Hosted Equivalentsβ
| Cloud Model | Self-Hosted Equivalent | Quality Match | Notes |
|---|---|---|---|
| GPT-4 | Llama-3-70B (vLLM) | ~85-90% | Best quality match |
| GPT-4 | Mixtral 8x7B | ~80-85% | Good alternative |
| GPT-3.5-turbo | Llama-3-8B | ~80-85% | Good match for most tasks |
| GPT-3.5-turbo | Mistral 7B | ~75-80% | Faster alternative |
| Claude 3 Opus | Llama-3-70B | ~80-85% | Close match |
| Claude 3 Sonnet | Llama-3-13B | ~75-80% | Good match |
| Gemini Pro | Llama-3-8B | ~75-80% | Reasonable match |
| Gemini Flash | Llama-3-3B | ~70-75% | Faster, lower quality |
Quality vs Performance Trade-offsβ
High Quality (Slower):
- Llama-3-70B (vLLM) - Best quality, requires GPU
- Mixtral 8x7B - High quality, requires significant VRAM
Balanced:
- Llama-3-13B - Good balance of quality and speed
- Llama-3-8B - Fast and capable
Fast (Lower Quality):
- Llama-3-3B - Very fast, good for simple tasks
- Mistral 7B - Fast and efficient
Migration Strategyβ
Gradual Migration Approachβ
Phase 1: Low-Risk Agents (Week 1)
- Migrate non-critical agents first
- Test agents with simple tasks
- Validate output quality
Phase 2: Medium-Risk Agents (Week 2-3)
- Migrate agents with moderate importance
- Use multi-tier strategy (local primary, cloud fallback)
- Monitor performance and quality
Phase 3: Critical Agents (Week 4+)
- Migrate high-importance agents
- Full testing and validation
- Keep cloud as premium tier
Multi-Tier Safety Netβ
Use Radium's multi-tier model strategy during migration:
[agent.persona.models]
primary = "llama3.2" # Self-hosted (new)
fallback = "gpt-4o-mini" # Cloud (safety net)
premium = "gpt-4o" # Cloud (when needed)
This allows:
- Testing self-hosted models as primary
- Automatic fallback to cloud if issues occur
- Gradual confidence building
Step-by-Step Migrationβ
Step 1: Set Up Self-Hosted Model Serverβ
Choose a provider:
- Ollama: Easiest setup, good for testing
- vLLM: Best performance, requires GPU
- LocalAI: Most flexible, good for experimentation
Follow setup guide:
Verify server is running:
# Ollama
curl http://localhost:11434/api/tags
# vLLM
curl http://localhost:8000/v1/models
# LocalAI
curl http://localhost:8080/v1/models
Step 2: Configure Environmentβ
Set environment variables:
# Ollama
export UNIVERSAL_BASE_URL="http://localhost:11434/v1"
# vLLM
export UNIVERSAL_BASE_URL="http://localhost:8000/v1"
# LocalAI
export UNIVERSAL_BASE_URL="http://localhost:8080/v1"
Step 3: Migrate Single Agentβ
Example: Migrating a code agent
Before (Cloud):
[agent]
id = "code-agent"
name = "Code Agent"
description = "Code implementation agent"
prompt_path = "prompts/agents/core/code-agent.md"
engine = "gemini"
model = "gemini-2.0-flash-exp"
[agent.persona.models]
primary = "gemini-2.0-flash-exp"
fallback = "gpt-4o-mini"
After (Self-Hosted with Fallback):
[agent]
id = "code-agent"
name = "Code Agent"
description = "Code implementation agent"
prompt_path = "prompts/agents/core/code-agent.md"
engine = "universal"
model = "llama3.2"
[agent.persona.models]
primary = "llama3.2" # Self-hosted (new)
fallback = "gemini-2.0-flash-exp" # Cloud (safety net)
premium = "gpt-4o" # Cloud (when needed)
Steps:
- Update agent TOML file
- Set environment variables
- Test agent execution
- Compare outputs with cloud version
Step 4: Test and Validateβ
Functional Testing:
# Test agent with same prompts as before
rad run code-agent "Implement a function to sort a list"
# Compare outputs
# - Quality: Is the output acceptable?
# - Completeness: Does it meet requirements?
# - Style: Is it consistent with previous outputs?
Performance Testing:
# Measure response time
time rad run code-agent "Test prompt"
# Compare with cloud version
# - Latency: Acceptable delay?
# - Throughput: Can handle load?
Step 5: Monitor and Adjustβ
Monitor for issues:
- Check error rates
- Monitor response times
- Review output quality
- Track fallback usage
Adjust configuration:
- Tune model parameters
- Adjust reasoning effort
- Optimize hardware usage
- Fine-tune model selection
Testing Methodologyβ
Output Comparisonβ
Side-by-Side Testing:
- Run same prompt on both cloud and self-hosted
- Compare outputs for:
- Accuracy
- Completeness
- Style consistency
- Code quality (if applicable)
Example Test:
# Cloud version
rad run code-agent-cloud "Implement quicksort"
# Self-hosted version
rad run code-agent "Implement quicksort"
# Compare outputs
Performance Comparisonβ
Metrics to Track:
- Latency: Time to first token, total response time
- Throughput: Requests per second
- Resource Usage: CPU, memory, GPU utilization
- Cost: Compare cloud costs vs hardware costs
Measurement:
# Time execution
time rad run <agent> "<prompt>"
# Monitor resources
htop # CPU/memory
nvidia-smi # GPU (if applicable)
Quality Assessmentβ
Criteria:
- Correctness: Is the output correct?
- Completeness: Does it address all requirements?
- Quality: Is it production-ready?
- Consistency: Similar quality to cloud version?
Rollback Procedureβ
Quick Rollbackβ
Revert agent configuration:
# Restore from backup
cp agents-backup-YYYYMMDD/<agent>.toml agents/<agent>.toml
# Or manually edit
# Change engine back to cloud provider
# Change model back to cloud model
Restart Radium:
# Restart server if needed
# Or just re-run agent
rad run <agent> "<prompt>"
Full Rollbackβ
If migration fails completely:
- Restore all agent configurations from backup
- Remove environment variables
- Restart Radium services
- Verify cloud agents work
Post-Migration Optimizationβ
Performance Tuningβ
Optimize model selection:
- Use smaller models for simple tasks
- Use larger models for complex tasks
- Balance quality vs speed
Tune parameters:
[agent]
reasoning_effort = "medium" # Adjust based on needs
[agent.persona.performance]
profile = "balanced" # speed, balanced, thinking, expert
Cost Optimizationβ
Maximize local usage:
- Use self-hosted as primary for all agents
- Keep cloud only as premium tier
- Monitor cloud usage to minimize costs
Hardware optimization:
- Right-size hardware for workload
- Use GPU when available
- Optimize model selection
Monitoring Setupβ
Set up monitoring:
- Track model server health
- Monitor agent performance
- Alert on failures
- Track cost savings
Success Metricsβ
Migration Success Indicatorsβ
- Functionality: All agents work as before
- Quality: Output quality maintained or improved
- Performance: Acceptable latency and throughput
- Cost: Reduced cloud costs
- Reliability: Stable operation
Measuring Successβ
Before Migration:
- Document current cloud costs
- Measure current performance
- Capture sample outputs
After Migration:
- Compare costs (cloud vs hardware)
- Compare performance metrics
- Compare output quality
- Track error rates
Common Migration Issuesβ
Quality Degradationβ
Problem: Self-hosted model output quality is lower
Solutions:
- Use a larger/better model
- Adjust reasoning effort
- Fine-tune prompts
- Use cloud as fallback for critical tasks
Performance Issuesβ
Problem: Self-hosted model is too slow
Solutions:
- Use a smaller/faster model
- Optimize hardware (GPU, more RAM)
- Reduce context length
- Use cloud for time-sensitive tasks
Compatibility Issuesβ
Problem: Some features don't work with self-hosted models
Solutions:
- Check model capabilities
- Use cloud for unsupported features
- Update to newer model version
- Report issues for future support
Best Practicesβ
- Start Small: Migrate one agent at a time
- Test Thoroughly: Validate before full migration
- Keep Fallback: Use multi-tier strategy
- Monitor Closely: Watch for issues early
- Document Changes: Keep track of what works
- Optimize Gradually: Fine-tune over time
Next Stepsβ
- Review Setup Guides for provider installation
- Check Configuration Guide for agent setup
- See Troubleshooting Guide for issues
- Explore Advanced Configuration for optimization