Deploy Ollama on a VPS: Run LLMs Privately in 10 Minutes

Running LLMs through a third-party API means every prompt you send is logged, used for training, and subject to rate limits and outages outside your control. For production applications, privacy-sensitive use cases, or any workload that needs consistent sub-second inference running your own model on a VPS is the right call. And with Ollama, it's genuinely a 10-minute setup.

Why Run LLMs on Your Own VPS?

Privacy: Your prompts never leave your infrastructure. Critical for legal, medical, financial, or proprietary data.
No API costs: Run Llama 3.2, Mistral, or Qwen2.5 indefinitely for the fixed cost of your VPS no per-token billing.
Full control: Customize system prompts, fine-tune on your data, adjust temperature and parameters without restrictions.
Consistent latency: No shared infrastructure, no traffic spikes from other users, no rate limits throttling your application.
Offline capability: Once deployed, your stack works independently of external APIs.

What Specs Do You Actually Need?

Model size (in parameters) determines your minimum RAM/VRAM. Here's the practical breakdown:

Model Size	Examples	Min RAM (CPU)	VRAM (GPU)	Speed (CPU)
13B	Llama 3.2 1B, Phi-3 mini	4 GB	4 GB	Fast (1030 tok/s)
78B	Llama 3.1 8B, Mistral 7B	8 GB	8 GB	OK (515 tok/s)
13B	Llama 2 13B, CodeLlama 13B	16 GB	12 GB	Slow (25 tok/s)
3034B	CodeLlama 34B, Qwen 32B	32 GB	24 GB	Very slow (<2 tok/s)
70B	Llama 3.1 70B, Qwen 72B	64 GB	48 GB VRAM	Impractical on CPU

For most use cases, an 8B model on a 16GB RAM VPS is the sweet spot. You get GPT-3.5-level quality at zero per-token cost. ZentisLabs AI Pro (16 vCPU, 64GB RAM) comfortably runs 70B models in CPU inference.

Step 1: Choose Your VPS

ZentisLabs AI VPS plans are pre-optimized for LLM inference high-RAM configurations, fast NVMe storage for model files, and unmetered bandwidth for API traffic:

AI Starter

$35/mo

4 vCPU
16 GB RAM
200 GB NVMe
Up to 8B models

AI Pro

$107/mo

8 vCPU
64 GB RAM
500 GB NVMe
Up to 70B models

AI Max

$215/mo

16 vCPU
128 GB RAM
1 TB NVMe
Multiple 70B models

Step 2: Install Ollama

SSH into your VPS and run the one-line installer:

bash

# Install Ollama

curl -fsSL https://ollama.com/install.sh | sh



# Verify installation

ollama --version



# Start the Ollama service (runs on port 11434 by default)

ollama serve &

Step 3: Pull and Run a Model

bash

# Pull Llama 3.2 (3B  fast, 2GB download)

ollama pull llama3.2



# Pull Llama 3.1 8B (recommended starting point)

ollama pull llama3.1



# Pull Mistral 7B (excellent for code/reasoning)

ollama pull mistral



# Pull Qwen2.5 Coder (best for coding tasks)

ollama pull qwen2.5-coder



# Run interactively

ollama run llama3.1



# Run via API (from another terminal)

curl http://localhost:11434/api/generate -d '{

  "model": "llama3.1",

  "prompt": "Explain proxy rotation in one paragraph",

  "stream": false

}'

Step 4: Set Up Open WebUI (Optional but Recommended)

Open WebUI gives you a ChatGPT-like interface for your models. Install with Docker:

bash

# Install Docker if not already installed

curl -fsSL https://get.docker.com | sh



# Run Open WebUI (connects to local Ollama automatically)

docker run -d 
  -p 3000:8080 
  --add-host=host.docker.internal:host-gateway 
  -v open-webui:/app/backend/data 
  --name open-webui 
  --restart always 
  ghcr.io/open-webui/open-webui:main



# Access at http://YOUR_VPS_IP:3000

Step 5: Configure Nginx Reverse Proxy with SSL

Expose your setup at a proper domain with HTTPS:

bash

# Install nginx and certbot

apt install nginx certbot python3-certbot-nginx -y



# Create nginx config

cat > /etc/nginx/sites-available/ollama << 'EOF'

server {

    server_name your-domain.com;



    # Proxy to Open WebUI

    location / {

        proxy_pass http://localhost:3000;

        proxy_http_version 1.1;

        proxy_set_header Upgrade $http_upgrade;

        proxy_set_header Connection 'upgrade';

        proxy_set_header Host $host;

        proxy_cache_bypass $http_upgrade;

    }



    # Proxy Ollama API (optional  secure this with auth!)

    location /api/ {

        proxy_pass http://localhost:11434/api/;

        proxy_set_header Host $host;

        proxy_set_header X-Real-IP $remote_addr;

    }

}

EOF



# Enable site

ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/

nginx -t && systemctl reload nginx



# Get SSL certificate

certbot --nginx -d your-domain.com

Step 6: Systemd Auto-Restart

bash

# Create systemd service for Ollama

cat > /etc/systemd/system/ollama.service << 'EOF'

[Unit]

Description=Ollama LLM Server

After=network.target



[Service]

Type=simple

User=root

ExecStart=/usr/local/bin/ollama serve

Restart=always

RestartSec=5

Environment=OLLAMA_HOST=0.0.0.0



[Install]

WantedBy=multi-user.target

EOF



# Enable and start

systemctl daemon-reload

systemctl enable ollama

systemctl start ollama



# Check status

systemctl status ollama

Using Your Deployment from Code

python

import requests



OLLAMA_URL = "https://your-domain.com/api"  # Or http://your-vps-ip:11434/api



def chat(prompt, model="llama3.1"):

    response = requests.post(

        f"{OLLAMA_URL}/generate",

        json={"model": model, "prompt": prompt, "stream": False},

        timeout=120

    )

    return response.json()["response"]



# Use like any LLM API

reply = chat("Write a Python function to rotate proxies with retry logic")

print(reply)



# OpenAI-compatible endpoint (Ollama v0.1.24+)

from openai import OpenAI



client = OpenAI(

    base_url="https://your-domain.com/v1",

    api_key="ollama"  # Not validated, just required by the SDK

)



response = client.chat.completions.create(

    model="llama3.1",

    messages=[{"role": "user", "content": "Hello!"}]

)

print(response.choices[0].message.content)

Bonus: ZentisLabs One-Click LLM Stack

Don't want to run through the steps above? ZentisLabs's AI VPS plans include a one-click LLM Stack that automates everything: Ollama install, Open WebUI, nginx with SSL, and systemd service. Deploy a private ChatGPT alternative in under 5 minutes from the ZentisLabs dashboard.

ZentisLabs AI Pro benchmarks: Llama 3.1 8B 18 tokens/sec average. Llama 3.1 70B 3.2 tokens/sec (CPU inference). For GPU inference on 70B models, contact us about dedicated GPU VPS options.

Why Run LLMs on Your Own VPS?

Privacy: Your prompts never leave your infrastructure. Critical for legal, medical, financial, or proprietary data.
No API costs: Run Llama 3.2, Mistral, or Qwen2.5 indefinitely for the fixed cost of your VPS no per-token billing.
Full control: Customize system prompts, fine-tune on your data, adjust temperature and parameters without restrictions.
Consistent latency: No shared infrastructure, no traffic spikes from other users, no rate limits throttling your application.
Offline capability: Once deployed, your stack works independently of external APIs.

What Specs Do You Actually Need?

Model size (in parameters) determines your minimum RAM/VRAM. Here's the practical breakdown:

Model Size	Examples	Min RAM (CPU)	VRAM (GPU)	Speed (CPU)
13B	Llama 3.2 1B, Phi-3 mini	4 GB	4 GB	Fast (1030 tok/s)
78B	Llama 3.1 8B, Mistral 7B	8 GB	8 GB	OK (515 tok/s)
13B	Llama 2 13B, CodeLlama 13B	16 GB	12 GB	Slow (25 tok/s)
3034B	CodeLlama 34B, Qwen 32B	32 GB	24 GB	Very slow (<2 tok/s)
70B	Llama 3.1 70B, Qwen 72B	64 GB	48 GB VRAM	Impractical on CPU

Step 1: Choose Your VPS

ZentisLabs AI VPS plans are pre-optimized for LLM inference high-RAM configurations, fast NVMe storage for model files, and unmetered bandwidth for API traffic:

AI Starter

$35/mo

4 vCPU
16 GB RAM
200 GB NVMe
Up to 8B models

AI Pro

$107/mo

8 vCPU
64 GB RAM
500 GB NVMe
Up to 70B models

AI Max

$215/mo

16 vCPU
128 GB RAM
1 TB NVMe
Multiple 70B models

Step 2: Install Ollama

SSH into your VPS and run the one-line installer:

bash

# Install Ollama

curl -fsSL https://ollama.com/install.sh | sh



# Verify installation

ollama --version



# Start the Ollama service (runs on port 11434 by default)

ollama serve &

Step 3: Pull and Run a Model

bash

# Pull Llama 3.2 (3B  fast, 2GB download)

ollama pull llama3.2



# Pull Llama 3.1 8B (recommended starting point)

ollama pull llama3.1



# Pull Mistral 7B (excellent for code/reasoning)

ollama pull mistral



# Pull Qwen2.5 Coder (best for coding tasks)

ollama pull qwen2.5-coder



# Run interactively

ollama run llama3.1



# Run via API (from another terminal)

curl http://localhost:11434/api/generate -d '{

  "model": "llama3.1",

  "prompt": "Explain proxy rotation in one paragraph",

  "stream": false

}'

Step 4: Set Up Open WebUI (Optional but Recommended)

Open WebUI gives you a ChatGPT-like interface for your models. Install with Docker:

bash

# Install Docker if not already installed

curl -fsSL https://get.docker.com | sh



# Run Open WebUI (connects to local Ollama automatically)

docker run -d 
  -p 3000:8080 
  --add-host=host.docker.internal:host-gateway 
  -v open-webui:/app/backend/data 
  --name open-webui 
  --restart always 
  ghcr.io/open-webui/open-webui:main



# Access at http://YOUR_VPS_IP:3000

Step 5: Configure Nginx Reverse Proxy with SSL

Expose your setup at a proper domain with HTTPS:

bash

# Install nginx and certbot

apt install nginx certbot python3-certbot-nginx -y



# Create nginx config

cat > /etc/nginx/sites-available/ollama << 'EOF'

server {

    server_name your-domain.com;



    # Proxy to Open WebUI

    location / {

        proxy_pass http://localhost:3000;

        proxy_http_version 1.1;

        proxy_set_header Upgrade $http_upgrade;

        proxy_set_header Connection 'upgrade';

        proxy_set_header Host $host;

        proxy_cache_bypass $http_upgrade;

    }



    # Proxy Ollama API (optional  secure this with auth!)

    location /api/ {

        proxy_pass http://localhost:11434/api/;

        proxy_set_header Host $host;

        proxy_set_header X-Real-IP $remote_addr;

    }

}

EOF



# Enable site

ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/

nginx -t && systemctl reload nginx



# Get SSL certificate

certbot --nginx -d your-domain.com

Step 6: Systemd Auto-Restart

bash

# Create systemd service for Ollama

cat > /etc/systemd/system/ollama.service << 'EOF'

[Unit]

Description=Ollama LLM Server

After=network.target



[Service]

Type=simple

User=root

ExecStart=/usr/local/bin/ollama serve

Restart=always

RestartSec=5

Environment=OLLAMA_HOST=0.0.0.0



[Install]

WantedBy=multi-user.target

EOF



# Enable and start

systemctl daemon-reload

systemctl enable ollama

systemctl start ollama



# Check status

systemctl status ollama

Using Your Deployment from Code

python

import requests



OLLAMA_URL = "https://your-domain.com/api"  # Or http://your-vps-ip:11434/api



def chat(prompt, model="llama3.1"):

    response = requests.post(

        f"{OLLAMA_URL}/generate",

        json={"model": model, "prompt": prompt, "stream": False},

        timeout=120

    )

    return response.json()["response"]



# Use like any LLM API

reply = chat("Write a Python function to rotate proxies with retry logic")

print(reply)



# OpenAI-compatible endpoint (Ollama v0.1.24+)

from openai import OpenAI



client = OpenAI(

    base_url="https://your-domain.com/v1",

    api_key="ollama"  # Not validated, just required by the SDK

)



response = client.chat.completions.create(

    model="llama3.1",

    messages=[{"role": "user", "content": "Hello!"}]

)

print(response.choices[0].message.content)

Bonus: ZentisLabs One-Click LLM Stack

ZentisLabs AI Pro benchmarks: Llama 3.1 8B 18 tokens/sec average. Llama 3.1 70B 3.2 tokens/sec (CPU inference). For GPU inference on 70B models, contact us about dedicated GPU VPS options.

Deploy Ollama on a VPS: Run LLMs Privately in 10 Minutes

Why Run LLMs on Your Own VPS?

What Specs Do You Actually Need?

Step 1: Choose Your VPS

AI Starter

AI Pro

AI Max

Step 2: Install Ollama

Step 3: Pull and Run a Model

Step 4: Set Up Open WebUI (Optional but Recommended)

Step 5: Configure Nginx Reverse Proxy with SSL

Step 6: Systemd Auto-Restart

Using Your Deployment from Code

Bonus: ZentisLabs One-Click LLM Stack

Ready to get started?

Related Articles

How to Set Up a Rotating Proxy in Python, Node.js, and Bash (2025 Guide)

Best VPS for Web Scraping in 2025: Performance Benchmarks

How to Bypass CAPTCHAs with Rotating Proxies (2025)

Deploy Ollama on a VPS: Run LLMs Privately in 10 Minutes

Why Run LLMs on Your Own VPS?

What Specs Do You Actually Need?

Step 1: Choose Your VPS

AI Starter

AI Pro

AI Max

Step 2: Install Ollama

Step 3: Pull and Run a Model

Step 4: Set Up Open WebUI (Optional but Recommended)

Step 5: Configure Nginx Reverse Proxy with SSL

Step 6: Systemd Auto-Restart

Using Your Deployment from Code

Bonus: ZentisLabs One-Click LLM Stack

Ready to get started?

Related Articles

How to Set Up a Rotating Proxy in Python, Node.js, and Bash (2025 Guide)

Best VPS for Web Scraping in 2025: Performance Benchmarks

How to Bypass CAPTCHAs with Rotating Proxies (2025)