Tutorial

Self-Host Your Own AI: Ollama, Open WebUI, and LocalAI on a VPS

June 02, 2026

Back to Blog
A modern alternative to cPanel, Plesk and CyberPanel — isolated, secure, AI-assisted.
Start free

Why Self-Host Your Own AI?

Every time you send a message to ChatGPT, Claude, or Gemini, your data travels to someone else's server. For personal chats, that is fine. But when you are processing confidential business documents, proprietary code, customer data, or medical records, sending that information to a third-party API raises serious privacy and compliance concerns. Self-hosting your own AI solves this completely: your data never leaves your server.

Beyond privacy, self-hosting eliminates API costs that can spiral quickly. A team of developers using GPT-4 for coding assistance can easily spend $500-2000 per month on API calls. A self-hosted model on a $50/month VPS has zero per-token costs and no rate limits. You can query it thousands of times per day without worrying about a surprise bill at the end of the month.

$0
Per-token cost with self-hosted AI
100%
Data privacy (never leaves your server)

Hardware Requirements

The most important thing to understand about running AI models locally is that model size directly determines RAM requirements. The model must fit entirely in memory (RAM for CPU inference, VRAM for GPU inference). Here is what you need for the most popular models:

ModelParametersQuantizationRAM RequiredQuality
Phi-3 Mini3.8BQ4_K_M3 GBGood for simple tasks
Llama 3 8B8BQ4_K_M5 GBGreat all-rounder
Mistral 7B7BQ4_K_M5 GBExcellent for its size
CodeLlama 13B13BQ4_K_M9 GBSpecialized for code
Llama 3 70B70BQ4_K_M42 GBNear GPT-4 quality
Mixtral 8x7B47B (MoE)Q4_K_M28 GBExcellent mixture-of-experts
What is Quantization? Quantization reduces the precision of model weights from 16-bit floating point (FP16) to lower-bit representations (Q4, Q5, Q8). A Q4 quantized model uses roughly 4 bits per parameter, cutting memory requirements by 4x with minimal quality loss. For most use cases, Q4_K_M offers the best balance between size and quality.

CPU vs GPU Inference

CPU Inference

  • Works on any server with enough RAM
  • No special hardware required
  • Slower: 5-15 tokens/second for 7B models
  • Good enough for personal use and small teams
  • VPS-friendly (no GPU needed)

GPU Inference

  • Requires NVIDIA GPU with CUDA
  • Much faster: 30-100+ tokens/second
  • Needs VRAM to hold model weights
  • Essential for production workloads
  • More expensive VPS/dedicated servers

Installing Ollama

Ollama is the easiest way to run large language models locally. It handles model downloading, quantization, and serving with a simple CLI interface and a REST API that is compatible with many tools.

# Install Ollama (one-liner)
$ curl -fsSL https://ollama.com/install.sh | sh
>>> Downloading ollama...
>>> Installing ollama to /usr/local/bin...
>>> Ollama is now installed!

# Verify installation
$ ollama --version
ollama version 0.6.2

# Start the Ollama service
$ systemctl status ollama
ollama.service - Ollama Service
Active: active (running)

Downloading and Running Models

Ollama has a model library with hundreds of pre-quantized models ready to download and run. Let us start with a few popular ones:

# Download and run Llama 3 (8B parameters)
$ ollama run llama3
pulling manifest...
pulling dde5aa3fc5fc... 100%
>>> Send a message (/? for help)

>>> What is the capital of France?
The capital of France is Paris. It is the largest city in
France and serves as the country's political, economic, and
cultural center.

# Download other popular models
$ ollama pull mistral # Mistral 7B
$ ollama pull codellama # Code-specialized Llama
$ ollama pull phi3 # Microsoft Phi-3 Mini
$ ollama pull gemma2 # Google Gemma 2

# List downloaded models
$ ollama list
NAME SIZE MODIFIED
llama3:latest 4.7 GB 2 minutes ago
mistral:latest 4.1 GB 5 minutes ago
codellama:latest 3.8 GB 8 minutes ago

Using the Ollama API

Ollama exposes a REST API on port 11434, which you can use from any application:

# Generate a response
$ curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain Docker in one paragraph",
  "stream": false
}'

# Chat endpoint (with conversation history)
$ curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ],
  "stream": false
}'

Open WebUI: A ChatGPT-Like Interface

Ollama is powerful but command-line only. Open WebUI (formerly Ollama WebUI) provides a beautiful, ChatGPT-like web interface that connects to your Ollama instance. It supports multiple models, conversation history, file uploads, and even multimodal (image) models.

# Run Open WebUI with Docker (connects to host Ollama)
$ docker run -d \
  --name open-webui \
  --network=host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Access at http://your-server:8080

Docker Compose Setup (Ollama + Open WebUI)

For a production setup, use Docker Compose to manage both services together:

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    restart: always
    deploy:
      resources:
        limits:
          memory: 16G

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    depends_on:
      - ollama
    volumes:
      - webui_data:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    restart: always

volumes:
  ollama_data:
  webui_data:
# Start the stack
$ docker compose up -d
Creating network "ai_default"
Creating ollama ... done
Creating open-webui ... done

# Pull a model inside the container
$ docker exec ollama ollama pull llama3
pulling manifest... done
success

LocalAI: OpenAI-Compatible Server

LocalAI is an alternative to Ollama that provides a fully OpenAI-compatible API. This means you can point any application that works with the OpenAI API at your LocalAI instance by simply changing the base URL. No code changes required.

# Run LocalAI with Docker
$ docker run -d \
  --name localai \
  -p 8080:8080 \
  -v localai_models:/models \
  localai/localai:latest

# Use with OpenAI Python SDK (just change base_url)
from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:8080/v1",
  api_key="not-needed"
)

response = client.chat.completions.create(
  model="llama3",
  messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Setting Up a Reverse Proxy with SSL

To access your AI interface securely over the internet, set up Nginx as a reverse proxy with SSL certificates:

# /etc/nginx/sites-available/ai.example.com
server {
    listen 443 ssl http2;
    server_name ai.example.com;

    ssl_certificate /etc/letsencrypt/live/ai.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.example.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support for streaming
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Long timeout for AI responses
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}
Authentication is critical: Open WebUI includes built-in user authentication. Always enable it and set a strong admin password. Never expose Ollama's API port (11434) directly to the internet without authentication — anyone could use your server's resources for free.

Performance Tuning

Getting the best performance from self-hosted AI requires understanding a few key parameters:

Context Size

Context size (also called context window or context length) determines how much text the model can process at once. Larger context requires more RAM. For most use cases, the default 2048-4096 tokens is sufficient. Only increase it if you need to process long documents.

# Run with custom context size
$ ollama run llama3 --context-size 8192

# Or via API
$ curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Summarize this document...",
  "options": { "num_ctx": 8192 }
}'

Quantization Levels Compared

QuantizationBits/ParamSize (7B model)QualitySpeed
FP161614 GBBestSlowest
Q8_087.7 GBNear-perfectMedium
Q5_K_M55.3 GBVery goodFast
Q4_K_M44.4 GBGoodFast
Q3_K_M33.5 GBAcceptableFastest
Q2_K22.8 GBDegradedFastest

Cost Analysis: Self-Hosted vs API

Let us compare the real costs for a small development team of 5 people using AI daily:

FactorOpenAI API (GPT-4)Self-Hosted (Llama 3 8B)
Monthly server cost$0 (no server needed)$40-80/month VPS
Per-token cost$0.03/1K input + $0.06/1K output$0
50K queries/month$800-1,500$40-80 (VPS only)
Data privacyData goes to OpenAIData stays on your server
Model qualityState-of-the-artVery good (smaller model)
Uptime/reliability99.9% SLASelf-managed
Rate limitsYesNone
Break-even point: For teams making more than 10,000 API calls per month, self-hosting becomes cheaper than using GPT-4 API. For individual developers making fewer calls, the API is more cost-effective. The sweet spot for self-hosting is small-to-medium teams with regular, predictable usage and strong privacy requirements.

Practical Use Cases

Coding Assistant

Use CodeLlama or DeepSeek Coder as a private coding assistant. Integrate with VS Code via the Continue extension pointing at your local Ollama instance. Your proprietary code never leaves your network.

Document Q&A

Upload internal documents, contracts, or manuals and ask questions. Tools like PrivateGPT and RAGFlow connect to Ollama for retrieval-augmented generation on your private data.

Translation

Run multilingual models for real-time translation of internal communications, documentation, or customer messages without sending confidential content to external services.

Content Generation

Generate marketing copy, product descriptions, email templates, and social media posts. Train with your brand voice by using custom system prompts and few-shot examples.

Managing Multiple Models

# List all downloaded models
$ ollama list
NAME SIZE MODIFIED
llama3:latest 4.7 GB 1 day ago
codellama:latest 3.8 GB 2 days ago
mistral:latest 4.1 GB 3 days ago
phi3:latest 2.3 GB 3 days ago

# Check running models
$ ollama ps
NAME SIZE PROCESSOR UNTIL
llama3:latest 5.5 GB 100% CPU 4 minutes from now

# Remove a model to free space
$ ollama rm phi3
deleted 'phi3'

# Create a custom model with a system prompt
$ cat Modelfile
FROM llama3
SYSTEM "You are a senior DevOps engineer. Answer questions about server management, Docker, and CI/CD."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

$ ollama create devops-assistant -f Modelfile
success

Deploy on a Panelica-Managed Server

Deploy Ollama and Open WebUI on any Panelica-managed server using Docker. The panel handles reverse proxy configuration, SSL certificates, and resource management for your AI stack. Create a domain like ai.yourdomain.com, point it to your server, enable SSL through the panel, and set up Docker containers — all without touching the command line if you prefer the GUI approach.

Security Checklist

  • Open WebUI authentication enabled with strong admin password
  • Ollama API port (11434) not exposed to the public internet
  • Nginx reverse proxy configured with SSL/TLS
  • Firewall rules restrict access to authorized IPs or VPN
  • Regular updates applied to Ollama and Open WebUI containers
  • Docker volumes backed up for conversation history
  • Resource limits set to prevent one user from consuming all RAM/CPU
  • Monitoring set up to alert on high memory or CPU usage

Conclusion

Self-hosting AI is no longer a bleeding-edge experiment. Tools like Ollama and Open WebUI have made it genuinely straightforward to run powerful language models on your own hardware. A $50/month VPS can run Llama 3 8B fast enough for a small team, and the privacy and cost benefits are compelling for anyone processing sensitive data or making heavy use of AI APIs.

Start simple: install Ollama, pull Llama 3, and play with it from the command line. When you are comfortable, add Open WebUI for a proper chat interface. Experiment with different models to find the right balance of quality and speed for your use case. And remember, you can always use self-hosted AI alongside commercial APIs — use self-hosted for sensitive work and APIs for tasks that need the absolute best model quality. The key is having the option, and now you do.

Security-first hosting panel

Run your servers on a modern panel.

Panelica is a modern, security-first hosting panel — isolated services, built-in Docker and AI-assisted management, with one-click migration from any panel.

Zero-downtime migration Fully isolated services Cancel anytime
Share:
Looking for a Plesk alternative?