Self-Host Your Own AI: Ollama, Open WebUI, and LocalAI on a VPS

Why Self-Host Your Own AI?

Every time you send a message to ChatGPT, Claude, or Gemini, your data travels to someone else's server. For personal chats, that is fine. But when you are processing confidential business documents, proprietary code, customer data, or medical records, sending that information to a third-party API raises serious privacy and compliance concerns. Self-hosting your own AI solves this completely: your data never leaves your server.

Beyond privacy, self-hosting eliminates API costs that can spiral quickly. A team of developers using GPT-4 for coding assistance can easily spend $500-2000 per month on API calls. A self-hosted model on a $50/month VPS has zero per-token costs and no rate limits. You can query it thousands of times per day without worrying about a surprise bill at the end of the month.

Per-token cost with self-hosted AI

100%

Data privacy (never leaves your server)

Hardware Requirements

The most important thing to understand about running AI models locally is that model size directly determines RAM requirements. The model must fit entirely in memory (RAM for CPU inference, VRAM for GPU inference). Here is what you need for the most popular models:

Model	Parameters	Quantization	RAM Required	Quality
Phi-3 Mini	3.8B	Q4_K_M	3 GB	Good for simple tasks
Llama 3 8B	8B	Q4_K_M	5 GB	Great all-rounder
Mistral 7B	7B	Q4_K_M	5 GB	Excellent for its size
CodeLlama 13B	13B	Q4_K_M	9 GB	Specialized for code
Llama 3 70B	70B	Q4_K_M	42 GB	Near GPT-4 quality
Mixtral 8x7B	47B (MoE)	Q4_K_M	28 GB	Excellent mixture-of-experts

What is Quantization? Quantization reduces the precision of model weights from 16-bit floating point (FP16) to lower-bit representations (Q4, Q5, Q8). A Q4 quantized model uses roughly 4 bits per parameter, cutting memory requirements by 4x with minimal quality loss. For most use cases, Q4_K_M offers the best balance between size and quality.

CPU vs GPU Inference

CPU Inference

Works on any server with enough RAM
No special hardware required
Slower: 5-15 tokens/second for 7B models
Good enough for personal use and small teams
VPS-friendly (no GPU needed)

GPU Inference

Requires NVIDIA GPU with CUDA
Much faster: 30-100+ tokens/second
Needs VRAM to hold model weights
Essential for production workloads
More expensive VPS/dedicated servers

Installing Ollama

Ollama is the easiest way to run large language models locally. It handles model downloading, quantization, and serving with a simple CLI interface and a REST API that is compatible with many tools.

# Install Ollama (one-liner)
$ curl -fsSL https://ollama.com/install.sh | sh
>>> Downloading ollama...
>>> Installing ollama to /usr/local/bin...
>>> Ollama is now installed!

# Verify installation
$ ollama --version
ollama version 0.6.2

# Start the Ollama service
$ systemctl status ollama
ollama.service - Ollama Service
Active: active (running)

Downloading and Running Models

Ollama has a model library with hundreds of pre-quantized models ready to download and run. Let us start with a few popular ones:

# Download and run Llama 3 (8B parameters)
$ ollama run llama3
pulling manifest...
pulling dde5aa3fc5fc... 100%
>>> Send a message (/? for help)

>>> What is the capital of France?
The capital of France is Paris. It is the largest city in
France and serves as the country's political, economic, and
cultural center.

# Download other popular models
$ ollama pull mistral # Mistral 7B
$ ollama pull codellama # Code-specialized Llama
$ ollama pull phi3 # Microsoft Phi-3 Mini
$ ollama pull gemma2 # Google Gemma 2

# List downloaded models
$ ollama list
NAME SIZE MODIFIED
llama3:latest 4.7 GB 2 minutes ago
mistral:latest 4.1 GB 5 minutes ago
codellama:latest 3.8 GB 8 minutes ago

Using the Ollama API

Ollama exposes a REST API on port 11434, which you can use from any application:

# Generate a response
$ curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain Docker in one paragraph",
  "stream": false
}'

# Chat endpoint (with conversation history)
$ curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ],
  "stream": false
}'

Open WebUI: A ChatGPT-Like Interface

Ollama is powerful but command-line only. Open WebUI (formerly Ollama WebUI) provides a beautiful, ChatGPT-like web interface that connects to your Ollama instance. It supports multiple models, conversation history, file uploads, and even multimodal (image) models.

# Run Open WebUI with Docker (connects to host Ollama)
$ docker run -d \
  --name open-webui \
  --network=host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Access at http://your-server:8080

Docker Compose Setup (Ollama + Open WebUI)

For a production setup, use Docker Compose to manage both services together:

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    restart: always
    deploy:
      resources:
        limits:
          memory: 16G

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    depends_on:
      - ollama
    volumes:
      - webui_data:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    restart: always

volumes:
  ollama_data:
  webui_data:

# Start the stack
$ docker compose up -d
Creating network "ai_default"
Creating ollama ... done
Creating open-webui ... done

# Pull a model inside the container
$ docker exec ollama ollama pull llama3
pulling manifest... done
success

LocalAI: OpenAI-Compatible Server

LocalAI is an alternative to Ollama that provides a fully OpenAI-compatible API. This means you can point any application that works with the OpenAI API at your LocalAI instance by simply changing the base URL. No code changes required.

# Run LocalAI with Docker
$ docker run -d \
  --name localai \
  -p 8080:8080 \
  -v localai_models:/models \
  localai/localai:latest

# Use with OpenAI Python SDK (just change base_url)
from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:8080/v1",
  api_key="not-needed"
)

response = client.chat.completions.create(
  model="llama3",
  messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Setting Up a Reverse Proxy with SSL

To access your AI interface securely over the internet, set up Nginx as a reverse proxy with SSL certificates:

# /etc/nginx/sites-available/ai.example.com
server {
    listen 443 ssl http2;
    server_name ai.example.com;

    ssl_certificate /etc/letsencrypt/live/ai.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.example.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support for streaming
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Long timeout for AI responses
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Authentication is critical: Open WebUI includes built-in user authentication. Always enable it and set a strong admin password. Never expose Ollama's API port (11434) directly to the internet without authentication — anyone could use your server's resources for free.

Performance Tuning

Getting the best performance from self-hosted AI requires understanding a few key parameters:

Context Size

Context size (also called context window or context length) determines how much text the model can process at once. Larger context requires more RAM. For most use cases, the default 2048-4096 tokens is sufficient. Only increase it if you need to process long documents.

# Run with custom context size
$ ollama run llama3 --context-size 8192

# Or via API
$ curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Summarize this document...",
  "options": { "num_ctx": 8192 }
}'

Quantization Levels Compared

Quantization	Bits/Param	Size (7B model)	Quality	Speed
FP16	16	14 GB	Best	Slowest
Q8_0	8	7.7 GB	Near-perfect	Medium
Q5_K_M	5	5.3 GB	Very good	Fast
Q4_K_M	4	4.4 GB	Good	Fast
Q3_K_M	3	3.5 GB	Acceptable	Fastest
Q2_K	2	2.8 GB	Degraded	Fastest

Cost Analysis: Self-Hosted vs API

Let us compare the real costs for a small development team of 5 people using AI daily:

Factor	OpenAI API (GPT-4)	Self-Hosted (Llama 3 8B)
Monthly server cost	$0 (no server needed)	$40-80/month VPS
Per-token cost	$0.03/1K input + $0.06/1K output	$0
50K queries/month	$800-1,500	$40-80 (VPS only)
Data privacy	Data goes to OpenAI	Data stays on your server
Model quality	State-of-the-art	Very good (smaller model)
Uptime/reliability	99.9% SLA	Self-managed
Rate limits	Yes	None

Break-even point: For teams making more than 10,000 API calls per month, self-hosting becomes cheaper than using GPT-4 API. For individual developers making fewer calls, the API is more cost-effective. The sweet spot for self-hosting is small-to-medium teams with regular, predictable usage and strong privacy requirements.

Practical Use Cases

Coding Assistant

Use CodeLlama or DeepSeek Coder as a private coding assistant. Integrate with VS Code via the Continue extension pointing at your local Ollama instance. Your proprietary code never leaves your network.

Document Q&A

Upload internal documents, contracts, or manuals and ask questions. Tools like PrivateGPT and RAGFlow connect to Ollama for retrieval-augmented generation on your private data.

Translation

Run multilingual models for real-time translation of internal communications, documentation, or customer messages without sending confidential content to external services.

Content Generation

Generate marketing copy, product descriptions, email templates, and social media posts. Train with your brand voice by using custom system prompts and few-shot examples.

Managing Multiple Models

# List all downloaded models
$ ollama list
NAME SIZE MODIFIED
llama3:latest 4.7 GB 1 day ago
codellama:latest 3.8 GB 2 days ago
mistral:latest 4.1 GB 3 days ago
phi3:latest 2.3 GB 3 days ago

# Check running models
$ ollama ps
NAME SIZE PROCESSOR UNTIL
llama3:latest 5.5 GB 100% CPU 4 minutes from now

# Remove a model to free space
$ ollama rm phi3
deleted 'phi3'

# Create a custom model with a system prompt
$ cat Modelfile
FROM llama3
SYSTEM "You are a senior DevOps engineer. Answer questions about server management, Docker, and CI/CD."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

$ ollama create devops-assistant -f Modelfile
success

Deploy on a Panelica-Managed Server

Deploy Ollama and Open WebUI on any Panelica-managed server using Docker. The panel handles reverse proxy configuration, SSL certificates, and resource management for your AI stack. Create a domain like ai.yourdomain.com, point it to your server, enable SSL through the panel, and set up Docker containers — all without touching the command line if you prefer the GUI approach.

Security Checklist

Open WebUI authentication enabled with strong admin password
Ollama API port (11434) not exposed to the public internet
Nginx reverse proxy configured with SSL/TLS
Firewall rules restrict access to authorized IPs or VPN
Regular updates applied to Ollama and Open WebUI containers
Docker volumes backed up for conversation history
Resource limits set to prevent one user from consuming all RAM/CPU
Monitoring set up to alert on high memory or CPU usage

Conclusion

Self-hosting AI is no longer a bleeding-edge experiment. Tools like Ollama and Open WebUI have made it genuinely straightforward to run powerful language models on your own hardware. A $50/month VPS can run Llama 3 8B fast enough for a small team, and the privacy and cost benefits are compelling for anyone processing sensitive data or making heavy use of AI APIs.

Start simple: install Ollama, pull Llama 3, and play with it from the command line. When you are comfortable, add Open WebUI for a proper chat interface. Experiment with different models to find the right balance of quality and speed for your use case. And remember, you can always use self-hosted AI alongside commercial APIs — use self-hosted for sensitive work and APIs for tasks that need the absolute best model quality. The key is having the option, and now you do.