Troubleshooting Guide

Overview

This guide helps you diagnose and resolve common issues when using self-hosted models with Radium. Issues are organized by symptom to help you quickly find solutions.

Quick Diagnostic Commands

Check Model Server Status

# Ollama
curl http://localhost:11434/api/tags

# vLLM
curl http://localhost:8000/v1/models

# LocalAI
curl http://localhost:8080/v1/models

Check Network Connectivity

# Test port accessibility
telnet localhost 11434  # Ollama
telnet localhost 8000   # vLLM
telnet localhost 8080   # LocalAI

# Check if port is in use
netstat -an | grep 11434
lsof -i :11434

Check Radium Configuration

# List agents
rad agents list

# Test agent execution
rad run <agent-id> "test"

Common Errors

Connection Refused

Error Message:

RequestError: Connection refused
RequestError: Network error: ... connection refused

Diagnosis:

Check if model server is running:

# Ollama
ps aux | grep ollama
docker ps | grep ollama

# vLLM
docker ps | grep vllm

# LocalAI
docker ps | grep localai

Verify the port is correct:

# Ollama: 11434
# vLLM: 8000
# LocalAI: 8080

Check firewall settings:

# Linux
sudo ufw status
sudo iptables -L

# macOS
# Check System Preferences → Security & Privacy → Firewall

Solutions:

Start the model server:

# Ollama
ollama serve

# vLLM (Docker)
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model <model>

# LocalAI (Docker)
docker-compose up -d

Verify environment variables:

echo $UNIVERSAL_BASE_URL
# Should be: http://localhost:11434/v1 (Ollama)
#            http://localhost:8000/v1 (vLLM)
#            http://localhost:8080/v1 (LocalAI)

Check base URL includes /v1:

# Correct
export UNIVERSAL_BASE_URL="http://localhost:11434/v1"

# Incorrect (missing /v1)
export UNIVERSAL_BASE_URL="http://localhost:11434"

For remote servers, check network access:

ping <server-ip>
curl http://<server-ip>:11434/api/tags

Model Not Found

Error Message:

ModelResponseError: Model not found
UnsupportedModelProvider: Model 'xxx' not available

Diagnosis:

Ollama - List available models:
```
ollama list
```
vLLM - Check loaded models:
```
curl http://localhost:8000/v1/models
```

LocalAI - Check configured models:

curl http://localhost:8080/v1/models
ls -la config/  # Check model config files

Solutions:

Download the model:

# Ollama
ollama pull llama3.2

# vLLM - Model loads automatically from Hugging Face
# Check server logs for download progress

# LocalAI - Install via gallery or manually
curl http://localhost:8080/models/apply -d '{"id": "ggml-gpt4all-j"}'

Verify model name matches exactly:

# Case-sensitive, must match exactly
# Correct: llama3.2
# Incorrect: Llama3.2, llama-3.2, llama3

Check agent configuration:

[agent]
model = "llama3.2"  # Must match model name on server

Timeout Errors

Error Message:

RequestError: Request timeout
RequestError: Operation timed out

Diagnosis:

Check server response time:

time curl http://localhost:11434/v1/models

Check hardware resources:

# CPU usage
top
htop

# Memory usage
free -h

# GPU usage (if applicable)
nvidia-smi

Check model server logs for errors

Solutions:

Increase timeout (if configurable):

export UNIVERSAL_TIMEOUT=120  # Increase from default 60s

Reduce request complexity:
- Use smaller max_tokens
- Reduce context length
- Use a faster/smaller model
Optimize hardware:
- Ensure sufficient RAM/VRAM
- Use GPU if available
- Close other applications

Check for resource constraints:

# Check swap usage (indicates memory pressure)
swapon --show
free -h

Out of Memory

Error Message:

ModelResponseError: Out of memory
RequestError: Insufficient memory

Diagnosis:

Check available memory:

# System memory
free -h

# GPU memory (if using GPU)
nvidia-smi

Check model size vs available memory:
- 7B model: ~14GB RAM/VRAM
- 13B model: ~26GB RAM/VRAM
- 30B+ model: 40GB+ VRAM

Solutions:

Use a smaller model:

# Ollama - Use quantized model
ollama pull llama3.2        # ~2GB
# Instead of
ollama pull llama3.2:13b    # ~7GB

Reduce memory usage:

# vLLM - Reduce GPU memory utilization
vllm serve <model> --gpu-memory-utilization 0.7

# LocalAI - Reduce context size
# Edit config YAML: context_size: 2048

Close other applications to free memory
Use CPU-only inference (slower but uses less VRAM)

API Compatibility Issues

Error Message:

SerializationError: Failed to parse response
ModelResponseError: Invalid API response format

Diagnosis:

Test API endpoint directly:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "test"}]
  }'

Check response format matches OpenAI spec

Solutions:

Verify endpoint path:

# Must include /v1
export UNIVERSAL_BASE_URL="http://localhost:11434/v1"

Check server supports OpenAI API:
- Ollama: Requires OpenAI-compatible endpoint (available by default)
- vLLM: Native OpenAI compatibility
- LocalAI: Configured via model YAML
Update server version if using outdated software

Authentication Errors

Error Message:

UnsupportedModelProvider: Authentication failed
RequestError: 401 Unauthorized

Diagnosis:

Check if server requires authentication:

# Most local servers don't require auth
curl http://localhost:11434/v1/models

Verify API key if required:
```
echo $UNIVERSAL_API_KEY
```

Solutions:

Remove API key for local servers:

unset UNIVERSAL_API_KEY
# Or use without_auth() constructor

Set correct API key if required:
```
export UNIVERSAL_API_KEY="your-api-key"
```
Check server authentication settings

Performance Issues

Slow Inference

Symptoms:

Long response times
Low tokens/second

Diagnosis:

Check hardware utilization:
```
# CPU
top

# GPU
nvidia-smi -l 1
```
Check model server logs for warnings

Solutions:

Use GPU if available:

# Ollama - Automatically uses GPU if detected
# vLLM - Requires GPU
# LocalAI - Configure gpu_layers in model config

Use a faster model:
- Smaller models are faster
- Quantized models (Q4, Q8) are faster

Optimize server settings:

# vLLM - Increase batch size
vllm serve <model> --max-num-seqs 512

# LocalAI - Increase threads
# Edit config: threads: 8

Reduce context length if not needed

High Latency

Symptoms:

Long time to first token
Slow initial response

Solutions:

Pre-warm the model:

# Make a test request to load model
curl http://localhost:11434/v1/chat/completions ...

Use streaming for better perceived performance
Reduce model size for faster loading

Network Issues

Cannot Access Remote Server

Symptoms:

Connection works locally but not remotely
Timeout when accessing from another machine

Solutions:

Check server binding:

# Ollama - Bind to 0.0.0.0
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Configure firewall:

# Allow port in firewall
sudo ufw allow 11434

Check network connectivity:

ping <server-ip>
telnet <server-ip> 11434

Verify base URL:

# Use server IP or hostname
export UNIVERSAL_BASE_URL="http://192.168.1.100:11434/v1"

Diagnostic Decision Tree

Is the model server running?
├─ No → Start the server
└─ Yes → Can you connect to the endpoint?
    ├─ No → Check firewall/network
    └─ Yes → Is the model available?
        ├─ No → Download/configure the model
        └─ Yes → Check agent configuration
            ├─ Wrong model name → Fix model name
            ├─ Wrong endpoint → Fix base URL
            └─ Other → Check logs

Log Analysis

Radium Logs

Location:

Default: logs/radium-core.log
Or check console output

What to look for:

Connection errors
Model creation failures
Request/response details

Model Server Logs

Ollama:

# Check service logs
journalctl -u ollama -f

# Docker logs
docker logs ollama -f

vLLM:

# Docker logs
docker logs vllm -f

LocalAI:

# Docker logs
docker logs localai -f

Common Log Patterns

Connection Refused:

ERROR: Connection refused
ERROR: Failed to connect to http://localhost:11434

Model Not Found:

ERROR: Model 'xxx' not found
ERROR: Model not available

Timeout:

ERROR: Request timeout
ERROR: Operation timed out

Still Stuck?

Additional Resources

Check Provider Documentation:
Review Setup Guides:
Check Configuration:
- Agent Configuration
- Configuration Examples
Community Support:
- GitHub Issues
- Discord/Slack (if available)
- Stack Overflow

Collecting Debug Information

When seeking help, provide:

Error message (exact text)
Model server logs (last 50 lines)
Radium logs (relevant sections)

Configuration:

# Agent config
cat agents/my-agents/<agent>.toml

# Environment variables
env | grep UNIVERSAL

System information:

# OS
uname -a

# Docker (if used)
docker version

# GPU (if applicable)
nvidia-smi

Next Steps

Review Setup Guides for installation issues
Check Configuration Guide for config problems
See Advanced Configuration for optimization
Explore Migration Guide for transition issues

Overview​

Quick Diagnostic Commands​

Check Model Server Status​

Check Network Connectivity​

Check Radium Configuration​

Common Errors​

Connection Refused​

Model Not Found​

Timeout Errors​

Out of Memory​

API Compatibility Issues​

Authentication Errors​

Performance Issues​

Slow Inference​

High Latency​

Network Issues​

Cannot Access Remote Server​

Diagnostic Decision Tree​

Log Analysis​

Radium Logs​

Model Server Logs​

Common Log Patterns​

Still Stuck?​

Additional Resources​

Collecting Debug Information​

Next Steps​

Overview

Quick Diagnostic Commands

Check Model Server Status

Check Network Connectivity

Check Radium Configuration

Common Errors

Connection Refused

Model Not Found

Timeout Errors

Out of Memory

API Compatibility Issues

Authentication Errors

Performance Issues

Slow Inference

High Latency

Network Issues

Cannot Access Remote Server

Diagnostic Decision Tree

Log Analysis

Radium Logs

Model Server Logs

Common Log Patterns

Still Stuck?

Additional Resources

Collecting Debug Information

Next Steps