Running AI workloads on a VPS gives you full control over your models, data privacy, and costs. No API rate limits, no per-token billing, no data leaving your infrastructure. But choosing the wrong VPS spec means either burning money on idle resources or bottlenecking your models.
What Makes a VPS "AI-Ready"?
- RAM — LLMs need massive memory. A 7B parameter model needs ~4GB RAM (quantized) or ~14GB (FP16)
- CPU cores — CPU inference scales with core count. 8+ cores recommended
- NVMe storage — model files are 4-70GB. Fast disk means fast loading
- GPU (optional) — 10-50x faster inference, but significantly more expensive
VPS Specs by Model Size
| Model Size | Min RAM | CPU | Examples |
|---|---|---|---|
| 1-3B params | 4 GB | 4 cores | Phi-3 Mini, TinyLlama |
| 7B params | 8 GB | 8 cores | Llama 3.1 7B, Mistral 7B |
| 13B params | 16 GB | 8+ cores | Llama 3.1 13B, CodeLlama |
| 30-34B params | 32 GB | 12+ cores | DeepSeek 33B, Yi 34B |
| 70B params | 64 GB+ | 16+ cores | Llama 3.1 70B (quantized) |
Running Ollama on ZentisLabs VPS
The fastest path from zero to running LLMs:
# SSH into your ZentisLabs VPSssh root@your-server-ip
# Install Ollama (one command)curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a modelollama pull llama3.1:8bollama run llama3.1:8b
# Serve via APIollama serve &curl http://localhost:11434/api/generate -d '{ "model": "llama3.1:8b", "prompt": "Explain proxy rotation in 3 sentences"}'ZentisLabs AI VPS plans come with Ollama pre-installed, so you skip the setup entirely.
CPU vs GPU Inference
CPU inference is practical for models up to 13B parameters (quantized), low-to-medium throughput (1-10 requests/second), and budget-conscious deployments.
GPU inference is necessary for models above 30B parameters, high throughput (100+ requests/second), real-time applications, and training/fine-tuning. For most use cases — internal tools, chatbots, document processing — CPU inference on a properly specced VPS is more cost-effective.
Quantization: Run Bigger Models on Smaller Servers
Quantization reduces model precision from FP16 (2 bytes per parameter) to INT4 (0.5 bytes), cutting memory requirements by 4x with minimal quality loss.
# Pull quantized model (4-bit) — uses ~4GB RAM instead of ~14GBollama pull llama3.1:8b-q4_0
# Check model sizeollama list| Quantization | RAM for 7B | Quality Loss | Speed |
|---|---|---|---|
| FP16 | ~14 GB | None | Baseline |
| Q8_0 | ~7 GB | Negligible | ~1.2x faster |
| Q5_K_M | ~5 GB | Minimal | ~1.5x faster |
| Q4_0 | ~4 GB | Slight | ~2x faster |
Production AI Stack on VPS
# 1. Install Ollama + pull modelscurl -fsSL https://ollama.com/install.sh | shollama pull llama3.1:8b
# 2. Install Open WebUI (chat interface)docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ ghcr.io/open-webui/open-webui:main
# 3. Set up reverse proxy with SSLapt install nginx certbot python3-certbot-nginx -y
# 4. Monitor resource usageapt install htop -yRecommendations by Use Case
- Personal AI assistant — Starter plan (4GB RAM), Llama 3.1 7B Q4
- Internal company chatbot — Business plan (8GB RAM), Llama 3.1 8B
- Document processing pipeline — Advanced plan (16GB RAM), Mistral 7B + embeddings
- Multi-model serving — Supreme plan (24GB RAM), multiple quantized models
- Production API — Enterprise plan (32GB RAM), load balancing, monitoring
🤖 ZentisLabs AI VPS plans include Ollama pre-installed, OpenClaw AI management, and non-expiring bandwidth. The integrated stack — proxies for data collection, VPS for processing, all on one platform.
