Running LLMs through a third-party API means every prompt you send is logged, used for training, and subject to rate limits and outages outside your control. For production applications, privacy-sensitive use cases, or any workload that needs consistent sub-second inference running your own model on a VPS is the right call. And with Ollama, it's genuinely a 10-minute setup.
Why Run LLMs on Your Own VPS?
- Privacy: Your prompts never leave your infrastructure. Critical for legal, medical, financial, or proprietary data.
- No API costs: Run Llama 3.2, Mistral, or Qwen2.5 indefinitely for the fixed cost of your VPS no per-token billing.
- Full control: Customize system prompts, fine-tune on your data, adjust temperature and parameters without restrictions.
- Consistent latency: No shared infrastructure, no traffic spikes from other users, no rate limits throttling your application.
- Offline capability: Once deployed, your stack works independently of external APIs.
What Specs Do You Actually Need?
Model size (in parameters) determines your minimum RAM/VRAM. Here's the practical breakdown:
| Model Size | Examples | Min RAM (CPU) | VRAM (GPU) | Speed (CPU) |
|---|---|---|---|---|
| 13B | Llama 3.2 1B, Phi-3 mini | 4 GB | 4 GB | Fast (1030 tok/s) |
| 78B | Llama 3.1 8B, Mistral 7B | 8 GB | 8 GB | OK (515 tok/s) |
| 13B | Llama 2 13B, CodeLlama 13B | 16 GB | 12 GB | Slow (25 tok/s) |
| 3034B | CodeLlama 34B, Qwen 32B | 32 GB | 24 GB | Very slow (<2 tok/s) |
| 70B | Llama 3.1 70B, Qwen 72B | 64 GB | 48 GB VRAM | Impractical on CPU |
For most use cases, an 8B model on a 16GB RAM VPS is the sweet spot. You get GPT-3.5-level quality at zero per-token cost. ZentisLabs AI Pro (16 vCPU, 64GB RAM) comfortably runs 70B models in CPU inference.
Step 1: Choose Your VPS
ZentisLabs AI VPS plans are pre-optimized for LLM inference high-RAM configurations, fast NVMe storage for model files, and unmetered bandwidth for API traffic:
AI Starter
$29/mo
- 4 vCPU
- 16 GB RAM
- 200 GB NVMe
- Up to 8B models
AI Pro
$89/mo
- 8 vCPU
- 64 GB RAM
- 500 GB NVMe
- Up to 70B models
AI Max
$179/mo
- 16 vCPU
- 128 GB RAM
- 1 TB NVMe
- Multiple 70B models
Step 2: Install Ollama
SSH into your VPS and run the one-line installer:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start the Ollama service (runs on port 11434 by default)
ollama serve &Step 3: Pull and Run a Model
# Pull Llama 3.2 (3B fast, 2GB download)
ollama pull llama3.2
# Pull Llama 3.1 8B (recommended starting point)
ollama pull llama3.1
# Pull Mistral 7B (excellent for code/reasoning)
ollama pull mistral
# Pull Qwen2.5 Coder (best for coding tasks)
ollama pull qwen2.5-coder
# Run interactively
ollama run llama3.1
# Run via API (from another terminal)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain proxy rotation in one paragraph",
"stream": false
}'Step 4: Set Up Open WebUI (Optional but Recommended)
Open WebUI gives you a ChatGPT-like interface for your models. Install with Docker:
# Install Docker if not already installed
curl -fsSL https://get.docker.com | sh
# Run Open WebUI (connects to local Ollama automatically)
docker run -d
-p 3000:8080
--add-host=host.docker.internal:host-gateway
-v open-webui:/app/backend/data
--name open-webui
--restart always
ghcr.io/open-webui/open-webui:main
# Access at http://YOUR_VPS_IP:3000Step 5: Configure Nginx Reverse Proxy with SSL
Expose your setup at a proper domain with HTTPS:
# Install nginx and certbot
apt install nginx certbot python3-certbot-nginx -y
# Create nginx config
cat > /etc/nginx/sites-available/ollama << 'EOF'
server {
server_name your-domain.com;
# Proxy to Open WebUI
location / {
proxy_pass http://localhost:3000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
# Proxy Ollama API (optional secure this with auth!)
location /api/ {
proxy_pass http://localhost:11434/api/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
EOF
# Enable site
ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx
# Get SSL certificate
certbot --nginx -d your-domain.comStep 6: Systemd Auto-Restart
# Create systemd service for Ollama
cat > /etc/systemd/system/ollama.service << 'EOF'
[Unit]
Description=Ollama LLM Server
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=5
Environment=OLLAMA_HOST=0.0.0.0
[Install]
WantedBy=multi-user.target
EOF
# Enable and start
systemctl daemon-reload
systemctl enable ollama
systemctl start ollama
# Check status
systemctl status ollamaUsing Your Deployment from Code
import requests
OLLAMA_URL = "https://your-domain.com/api" # Or http://your-vps-ip:11434/api
def chat(prompt, model="llama3.1"):
response = requests.post(
f"{OLLAMA_URL}/generate",
json={"model": model, "prompt": prompt, "stream": False},
timeout=120
)
return response.json()["response"]
# Use like any LLM API
reply = chat("Write a Python function to rotate proxies with retry logic")
print(reply)
# OpenAI-compatible endpoint (Ollama v0.1.24+)
from openai import OpenAI
client = OpenAI(
base_url="https://your-domain.com/v1",
api_key="ollama" # Not validated, just required by the SDK
)
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)Bonus: ZentisLabs One-Click LLM Stack
Don't want to run through the steps above? ZentisLabs's AI VPS plans include a one-click LLM Stack that automates everything: Ollama install, Open WebUI, nginx with SSL, and systemd service. Deploy a private ChatGPT alternative in under 5 minutes from the ZentisLabs dashboard.
ZentisLabs AI Pro benchmarks: Llama 3.1 8B 18 tokens/sec average. Llama 3.1 70B 3.2 tokens/sec (CPU inference). For GPU inference on 70B models, contact us about dedicated GPU VPS options.
