LocalAI Setup Guide
Overviewβ
LocalAI is a versatile local inference server that supports multiple model backends (llama.cpp, transformers, etc.) and provides an OpenAI-compatible API. It's ideal for flexible deployments with CPU or GPU support.
Setup Time: ~15 minutes
Best For: Flexible deployments, multiple backends, CPU/GPU support
Prerequisitesβ
- Docker and Docker Compose (recommended)
- 8GB+ RAM (for 7B models), 16GB+ RAM (for 13B+ models)
- CPU: Modern x86_64 or ARM64 processor
- GPU: Optional, but improves performance (NVIDIA with CUDA or Apple Silicon)
Installationβ
Docker Compose (Recommended)β
The easiest way to run LocalAI is with Docker Compose:
version: '3.8'
services:
localai:
image: localai/localai:latest-aio-cuda
ports:
- "8080:8080"
volumes:
- ./models:/models
- ./config:/config
environment:
- MODELS_PATH=/models
- CONFIG_PATH=/config
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Save as docker-compose.yml and run:
docker-compose up -d
Docker (Standalone)β
For a simple Docker deployment:
docker run -d \
-p 8080:8080 \
-v $(pwd)/models:/models \
-v $(pwd)/config:/config \
-e MODELS_PATH=/models \
-e CONFIG_PATH=/config \
localai/localai:latest-aio-cuda
Standalone Binaryβ
For systems without Docker:
- Download the binary from LocalAI releases
- Extract and make executable:
chmod +x localai - Run:
./localai
Model Configurationβ
Model Galleryβ
LocalAI supports a model gallery for easy model installation:
# List available models
curl http://localhost:8080/models/available
# Install a model from the gallery
curl http://localhost:8080/models/apply -d '{
"id": "ggml-gpt4all-j"
}'
Manual Model Configurationβ
Create model configuration files in the config directory:
Example: config/gpt-3.5-turbo.yaml
name: gpt-3.5-turbo
backend: llama-cpp
parameters:
model: gpt-3.5-turbo.gguf
context_size: 4096
f16: true
threads: 4
gpu_layers: 35
Downloading Modelsβ
Download models manually:
# Create models directory
mkdir -p models
# Download a model (example: GPT-4All)
wget https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin -O models/gpt-3.5-turbo.gguf
Supported Model Formatsβ
- GGUF (llama.cpp) - Recommended for CPU inference
- GGML (legacy llama.cpp format)
- Transformers (Hugging Face) - Requires GPU
Backend Selectionβ
llama.cpp (CPU, Recommended)β
Best for CPU inference and most models:
backend: llama-cpp
parameters:
model: model.gguf
threads: 4
f16: true
Transformers (GPU)β
For GPU acceleration with Hugging Face models:
backend: transformers
parameters:
model: meta-llama/Llama-3-8B-Instruct
gpu_layers: 35
Whisper (Audio)β
For audio transcription:
backend: whisper
parameters:
model: whisper-base
Hardware Requirementsβ
CPU-Only Setupβ
| Model Size | RAM Required | CPU Cores | Performance |
|---|---|---|---|
| 7B | 8GB | 4+ | Slow (5-10 tokens/s) |
| 13B | 16GB | 8+ | Very Slow (2-5 tokens/s) |
GPU Setupβ
| Model Size | VRAM Required | GPU Examples |
|---|---|---|
| 7B | 8GB | RTX 3060, RTX 4060 |
| 13B | 16GB | RTX 3090, RTX 4090 |
| 30B+ | 24GB+ | RTX 4090, A100 |
Configuration Optionsβ
Environment Variablesβ
# Model storage path
MODELS_PATH=/models
# Configuration path
CONFIG_PATH=/config
# Backend selection
BACKEND=llama-cpp
# GPU settings
CUDA_VISIBLE_DEVICES=0
# Thread count (CPU)
THREADS=4
Model Parametersβ
Common parameters in model YAML files:
| Parameter | Description | Default | Recommended |
|---|---|---|---|
threads | CPU threads | Auto | 4-8 |
gpu_layers | GPU layers (llama.cpp) | 0 | 35 (for 7B models) |
context_size | Context window | 512 | 4096 or 8192 |
f16 | Use FP16 | false | true (if supported) |
batch_size | Batch size | 512 | 512 |
Verifying Installationβ
Test that LocalAI is running:
# Check server health
curl http://localhost:8080/readyz
# List available models
curl http://localhost:8080/v1/models
# Test a completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Using with Radiumβ
Configuration via Universal Providerβ
LocalAI provides an OpenAI-compatible API. Configure Radium to use it:
Environment Variables:
export UNIVERSAL_BASE_URL="http://localhost:8080/v1"
export UNIVERSAL_MODEL_ID="gpt-3.5-turbo"
Agent Configuration (TOML):
[agent]
id = "localai-agent"
name = "LocalAI Agent"
description = "Agent using LocalAI for flexible inference"
prompt_path = "prompts/agents/my-agents/localai-agent.md"
engine = "universal"
model = "gpt-3.5-turbo"
Example Agent Configurationβ
Create agents/my-agents/localai-agent.toml:
[agent]
id = "localai-agent"
name = "LocalAI Agent"
description = "Flexible agent using LocalAI"
prompt_path = "prompts/agents/my-agents/localai-agent.md"
engine = "universal"
model = "gpt-3.5-turbo"
[agent.persona.models]
primary = "gpt-3.5-turbo"
fallback = "gpt-4"
Note: The exact configuration format may vary based on how Radium's engine system resolves Universal provider endpoints. Check the agent configuration guide for the latest patterns.
Multi-Model Servingβ
LocalAI can serve multiple models simultaneously. Configure each model in separate YAML files:
config/model1.yaml:
name: model1
backend: llama-cpp
parameters:
model: model1.gguf
config/model2.yaml:
name: model2
backend: llama-cpp
parameters:
model: model2.gguf
Both models will be available via the API.
Troubleshootingβ
Model Not Foundβ
Problem: Model not available when making requests.
Solutions:
- Verify model file exists in
MODELS_PATH - Check model configuration YAML is correct
- Verify model name matches configuration
- Check LocalAI logs:
docker logs localai
Out of Memoryβ
Problem: Model fails to load due to insufficient RAM/VRAM.
Solutions:
- Use a smaller model
- Reduce
context_sizein model config - Use quantization (GGUF Q4 or Q8)
- Close other applications
Slow Performanceβ
Problem: Inference is very slow.
Solutions:
- Use GPU if available (set
gpu_layers) - Increase
threadsfor CPU inference - Use a smaller/faster model
- Enable
f16if supported - Check CPU/GPU utilization
Connection Refusedβ
Problem: Can't connect to LocalAI server.
Solutions:
- Verify server is running:
docker psor check process - Check port is correct:
netstat -an | grep 8080 - Verify firewall settings
- Check Docker port mapping
Next Stepsβ
- Configure Your Agent: See the agent configuration guide
- Optimize Performance: Tune backend and model parameters
- Add More Models: Configure additional models for different use cases
- Set Up Monitoring: Monitor resource usage and performance