Advanced Configuration Guide

Overview

This guide covers advanced deployment patterns for self-hosted models, including load balancing, high availability, health checks, and performance optimization.

Load Balancing

nginx Load Balancer for Ollama

Configuration: nginx.conf

upstream ollama_backend {
    least_conn;
    server localhost:11434;
    server localhost:11435;
    server localhost:11436;
}

server {
    listen 11434;
    
    location / {
        proxy_pass http://ollama_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }
}

Docker Compose:

version: '3.8'

services:
  ollama1:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama1-data:/root/.ollama

  ollama2:
    image: ollama/ollama:latest
    ports:
      - "11435:11434"
    volumes:
      - ollama2-data:/root/.ollama

  ollama3:
    image: ollama/ollama:latest
    ports:
      - "11436:11434"
    volumes:
      - ollama3-data:/root/.ollama

  nginx:
    image: nginx:alpine
    ports:
      - "11434:11434"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - ollama1
      - ollama2
      - ollama3

volumes:
  ollama1-data:
  ollama2-data:
  ollama3-data:

HAProxy Load Balancer for vLLM

Configuration: haproxy.cfg

global
    log stdout format raw local0
    maxconn 4096

defaults
    mode http
    timeout connect 60s
    timeout client 60s
    timeout server 60s

frontend vllm_frontend
    bind *:8000
    default_backend vllm_backend

backend vllm_backend
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200
    server vllm1 localhost:8000 check
    server vllm2 localhost:8001 check
    server vllm3 localhost:8002 check

High Availability

Active-Passive Failover

Agent Configuration:

[agent]
id = "ha-agent"
name = "High Availability Agent"
description = "Agent with automatic failover"
prompt_path = "prompts/agents/my-agents/ha-agent.md"
engine = "universal"
model = "llama3.2"

[agent.persona.models]
primary = "llama3.2"           # Primary endpoint
fallback = "llama3.2"          # Secondary endpoint (same model, different server)
premium = "gpt-4o-mini"        # Cloud fallback

Environment Setup:

# Primary endpoint
export UNIVERSAL_BASE_URL="http://ollama-primary:11434/v1"

# Fallback endpoint (configured in engine system)
# Note: This may require custom engine configuration

Health Checks

Simple Health Check Script: health-check.sh

#!/bin/bash

ENDPOINT="${1:-http://localhost:11434/v1/models}"
TIMEOUT=5

response=$(curl -s -o /dev/null -w "%{http_code}" --max-time $TIMEOUT "$ENDPOINT")

if [ "$response" -eq 200 ]; then
    echo "Healthy"
    exit 0
else
    echo "Unhealthy (HTTP $response)"
    exit 1
fi

Cron Job for Monitoring:

# Check every minute
* * * * * /path/to/health-check.sh http://localhost:11434/v1/models

Kubernetes Health Checks

Deployment with Liveness/Readiness Probes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        livenessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 10
          periodSeconds: 5

Performance Optimization

Connection Pooling

Optimized HTTP Client Configuration:

// Example: Configure connection pool for UniversalModel
// This is handled internally, but you can optimize timeouts:

// Environment variables for tuning:
export UNIVERSAL_TIMEOUT=120        # Request timeout (seconds)
export UNIVERSAL_MAX_RETRIES=3      # Retry attempts
export UNIVERSAL_RETRY_DELAY=1     # Retry delay (seconds)

Batch Processing

Agent Configuration for Batch:

[agent]
id = "batch-agent"
name = "Batch Processing Agent"
description = "Optimized for batch processing"
prompt_path = "prompts/agents/my-agents/batch-agent.md"
engine = "universal"
model = "llama3.2"
reasoning_effort = "low"

[agent.persona.performance]
profile = "speed"
estimated_tokens = 1000

Model-Specific Tuning

vLLM Performance Tuning:

# High throughput
vllm serve meta-llama/Llama-3-8B-Instruct \
  --max-num-seqs 512 \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 1

# Low latency
vllm serve meta-llama/Llama-3-8B-Instruct \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.8 \
  --tensor-parallel-size 1

Ollama Performance Tuning:

# Use GPU layers
OLLAMA_NUM_GPU=1 ollama serve

# Set thread count
OLLAMA_NUM_THREAD=8 ollama serve

Cost Optimization

Local-First Strategy

Agent Configuration:

[agent]
id = "cost-optimized"
name = "Cost-Optimized Agent"
description = "Maximize local usage"
prompt_path = "prompts/agents/my-agents/cost-optimized.md"
engine = "universal"
model = "llama3.2"

[agent.persona.models]
primary = "llama3.2"           # Free local
fallback = "llama3.2:13b"      # Free local (better)
premium = "gpt-4o-mini"        # Cheap cloud ($0.15/1M tokens)

Hybrid Cost Strategy

Agent Configuration:

[agent]
id = "hybrid-cost"
name = "Hybrid Cost Agent"
description = "Balance cost and quality"
prompt_path = "prompts/agents/my-agents/hybrid-cost.md"
engine = "universal"
model = "llama3.2"

[agent.persona.models]
primary = "llama3.2"           # Free local (most requests)
fallback = "gemini-2.0-flash-exp"  # Cheap cloud ($0.075/1M tokens)
premium = "gpt-4o"             # Expensive cloud (only when needed)

Monitoring and Observability

Prometheus Metrics

Example Metrics to Track:

Request latency (p50, p95, p99)
Request rate (requests/second)
Error rate (errors/requests)
Model availability (uptime)
Token usage (tokens/second)

Grafana Dashboard

Key Metrics to Display:

Request Rate: Requests per second by model
Latency: Response time percentiles
Error Rate: Failed requests percentage
Model Health: Uptime and availability
Resource Usage: CPU, memory, GPU utilization

Logging

Structured Logging:

# Enable debug logging
export RUST_LOG=radium_core::engines=debug

# Log to file
export RUST_LOG=radium_core::engines=debug,file=radium.log

Security

Network Isolation

Docker Network Configuration:

version: '3.8'

networks:
  internal:
    driver: bridge
    internal: true  # No external access

services:
  ollama:
    image: ollama/ollama:latest
    networks:
      - internal
    # No ports exposed externally

  radium:
    image: radium:latest
    networks:
      - internal
    # Can access ollama via internal network

Authentication

API Key Protection:

# Use secrets management
export UNIVERSAL_API_KEY=$(cat /run/secrets/universal_api_key)

# Or use environment file with restricted permissions
chmod 600 .env

Firewall Rules

iptables Example:

# Allow only localhost access
iptables -A INPUT -p tcp --dport 11434 -s 127.0.0.1 -j ACCEPT
iptables -A INPUT -p tcp --dport 11434 -j DROP

Scaling Strategies

Horizontal Scaling

Multiple Model Instances:

version: '3.8'

services:
  ollama-1:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
  
  ollama-2:
    image: ollama/ollama:latest
    ports:
      - "11435:11434"
  
  ollama-3:
    image: ollama/ollama:latest
    ports:
      - "11436:11434"
  
  nginx:
    image: nginx:alpine
    # Load balance across instances

Vertical Scaling

Resource Allocation:

services:
  vllm:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        limits:
          cpus: '8'
          memory: 32G
          nvidia.com/gpu: 1
        reservations:
          cpus: '4'
          memory: 16G
          nvidia.com/gpu: 1

Best Practices

Start Simple: Begin with a single instance, then scale as needed
Monitor First: Set up monitoring before optimizing
Test Failover: Regularly test failover scenarios
Document Configurations: Keep track of what works
Version Control: Store configurations in version control
Backup Models: Regularly backup model files
Security First: Apply security best practices from the start

Next Steps

Review Configuration Examples for more patterns
Check Troubleshooting Guide for issues
See Setup Guides for provider-specific details
Explore Migration Guide for transitioning strategies

Overview​

Load Balancing​

nginx Load Balancer for Ollama​

HAProxy Load Balancer for vLLM​

High Availability​

Active-Passive Failover​

Health Checks​

Kubernetes Health Checks​

Performance Optimization​

Connection Pooling​

Batch Processing​

Model-Specific Tuning​

Cost Optimization​

Local-First Strategy​

Hybrid Cost Strategy​

Monitoring and Observability​

Prometheus Metrics​

Grafana Dashboard​

Logging​

Security​

Network Isolation​

Authentication​

Firewall Rules​

Scaling Strategies​

Horizontal Scaling​

Vertical Scaling​

Best Practices​

Next Steps​

Overview

Load Balancing

nginx Load Balancer for Ollama

HAProxy Load Balancer for vLLM

High Availability

Active-Passive Failover

Health Checks

Kubernetes Health Checks

Performance Optimization

Connection Pooling

Batch Processing

Model-Specific Tuning

Cost Optimization

Local-First Strategy

Hybrid Cost Strategy

Monitoring and Observability

Prometheus Metrics

Grafana Dashboard

Logging

Security

Network Isolation

Authentication

Firewall Rules

Scaling Strategies

Horizontal Scaling

Vertical Scaling

Best Practices

Next Steps