vLLM Setup Guide
Overviewβ
vLLM is a high-performance LLM inference server optimized for throughput and low latency. It's ideal for production deployments requiring GPU acceleration and high throughput.
Setup Time: ~15 minutes
Best For: Production deployments, GPU inference, high throughput
Prerequisitesβ
- NVIDIA GPU with CUDA support (required)
- CUDA 11.8+ or CUDA 12.1+
- Python 3.8+ (if not using Docker)
- Docker (recommended) or Docker Compose
- 16GB+ VRAM (for 7B models), 40GB+ VRAM (for 13B+ models)
Installationβ
Docker (Recommended)β
The easiest way to run vLLM is with Docker:
docker run --gpus all \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--port 8000
This command:
- Uses all available GPUs (
--gpus all) - Exposes port 8000
- Mounts Hugging Face cache for model persistence
- Loads the Llama-3-8B-Instruct model
Docker Composeβ
For easier management, use Docker Compose:
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
--model meta-llama/Llama-3-8B-Instruct
--port 8000
--max-num-seqs 256
--gpu-memory-utilization 0.9
Save as docker-compose.yml and run:
docker-compose up -d
Python Installation (Advanced)β
For bare metal installation:
# Install CUDA toolkit first (if not already installed)
# Then install vLLM
pip install vllm
# Or from source
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
Starting the Serverβ
Basic Commandβ
vllm serve meta-llama/Llama-3-8B-Instruct --port 8000
With Custom Parametersβ
vllm serve meta-llama/Llama-3-8B-Instruct \
--port 8000 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 1
Common Parametersβ
| Parameter | Description | Default | Recommended |
|---|---|---|---|
--port | Server port | 8000 | 8000 |
--max-num-seqs | Max concurrent sequences | 256 | 256 (adjust based on VRAM) |
--gpu-memory-utilization | GPU memory usage (0-1) | 0.9 | 0.9 |
--tensor-parallel-size | Number of GPUs for tensor parallelism | 1 | 1 (single GPU) or 2+ (multi-GPU) |
--max-model-len | Max sequence length | Auto | 4096 or 8192 |
--quantization | Quantization method | None | awq or gptq for smaller VRAM |
Kubernetes Deploymentβ
For production Kubernetes deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- meta-llama/Llama-3-8B-Instruct
- --port
- "8000"
- --max-num-seqs
- "256"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache
---
apiVersion: v1
kind: Service
metadata:
name: vllm
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: LoadBalancer
GPU Requirementsβ
Minimum Requirementsβ
| Model Size | VRAM Required | GPU Examples |
|---|---|---|
| 7B | 16GB | RTX 3090, RTX 4090, A100 40GB |
| 13B | 24GB | RTX 4090, A100 40GB |
| 30B+ | 40GB+ | A100 40GB, A100 80GB |
Recommended GPUsβ
- NVIDIA A100 (40GB or 80GB) - Best for production
- NVIDIA RTX 4090 (24GB) - Good for 7B-13B models
- NVIDIA RTX 3090 (24GB) - Budget option for 7B models
Multi-GPU Setupβ
For models requiring multiple GPUs:
vllm serve meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--port 8000
This distributes the model across 4 GPUs using tensor parallelism.
Model Loadingβ
Supported Modelsβ
vLLM supports models from Hugging Face:
- Llama:
meta-llama/Llama-3-8B-Instruct,meta-llama/Llama-3-70B-Instruct - Mistral:
mistralai/Mistral-7B-Instruct-v0.2 - CodeLlama:
codellama/CodeLlama-7b-Instruct-hf - Mixtral:
mistralai/Mixtral-8x7B-Instruct-v0.1
Model Downloadβ
Models are automatically downloaded from Hugging Face on first use. They're cached in ~/.cache/huggingface/.
To pre-download a model:
python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3-8B-Instruct')"
Performance Tuningβ
Throughput Optimizationβ
Increase concurrent requests:
vllm serve meta-llama/Llama-3-8B-Instruct \
--max-num-seqs 512 \
--gpu-memory-utilization 0.95
Latency Optimizationβ
Reduce concurrent requests for lower latency:
vllm serve meta-llama/Llama-3-8B-Instruct \
--max-num-seqs 64 \
--gpu-memory-utilization 0.8
Memory Optimizationβ
Use quantization for smaller VRAM footprint:
vllm serve meta-llama/Llama-3-8B-Instruct \
--quantization awq \
--gpu-memory-utilization 0.9
Verifying Installationβ
Test that vLLM is running:
# Check server health
curl http://localhost:8000/health
# List available models
curl http://localhost:8000/v1/models
# Test a completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Using with Radiumβ
Configuration via Universal Providerβ
vLLM provides an OpenAI-compatible API. Configure Radium to use it:
Environment Variables:
export UNIVERSAL_BASE_URL="http://localhost:8000/v1"
export UNIVERSAL_MODEL_ID="meta-llama/Llama-3-8B-Instruct"
Agent Configuration (TOML):
[agent]
id = "vllm-agent"
name = "vLLM Agent"
description = "Agent using vLLM for high-performance inference"
prompt_path = "prompts/agents/my-agents/vllm-agent.md"
engine = "universal"
model = "meta-llama/Llama-3-8B-Instruct"
Example Agent Configurationβ
Create agents/my-agents/vllm-agent.toml:
[agent]
id = "vllm-agent"
name = "vLLM Agent"
description = "High-performance agent using vLLM"
prompt_path = "prompts/agents/my-agents/vllm-agent.md"
engine = "universal"
model = "meta-llama/Llama-3-8B-Instruct"
[agent.persona.models]
primary = "meta-llama/Llama-3-8B-Instruct"
fallback = "meta-llama/Llama-3-70B-Instruct"
Note: The exact configuration format may vary based on how Radium's engine system resolves Universal provider endpoints. Check the agent configuration guide for the latest patterns.
Troubleshootingβ
GPU Not Detectedβ
Problem: vLLM doesn't detect GPU.
Solutions:
- Verify CUDA is installed:
nvidia-smi - Check Docker GPU access:
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi - Ensure NVIDIA Container Toolkit is installed
Out of Memoryβ
Problem: Model fails to load due to insufficient VRAM.
Solutions:
- Use a smaller model
- Reduce
--gpu-memory-utilization(e.g., 0.7) - Use quantization:
--quantization awq - Reduce
--max-num-seqs
Slow Performanceβ
Problem: Inference is slower than expected.
Solutions:
- Check GPU utilization:
nvidia-smi - Increase
--max-num-seqsfor better throughput - Verify model is loaded on GPU (not CPU)
- Check for thermal throttling
Connection Refusedβ
Problem: Can't connect to vLLM server.
Solutions:
- Verify server is running:
curl http://localhost:8000/health - Check port is correct:
netstat -an | grep 8000 - Verify firewall settings
- Check Docker port mapping
Next Stepsβ
- Configure Your Agent: See the agent configuration guide
- Optimize Performance: Tune parameters based on your workload
- Set Up Monitoring: Monitor GPU usage and throughput
- Scale Deployment: Consider Kubernetes for production