Running Production Services on a Self-Hosted Proxmox Cluster
How we serve real customers from a 3-node Proxmox VE cluster running 10+ VMs behind a single public IP, with automated backups, cloud-init provisioning, and power-conscious node management.
There is a persistent myth in IT that production workloads require cloud hosting or at minimum a colocated rack. We disagree. For the past several years, Lake Forest Computer Company has run every one of its production services — e-commerce platforms, SaaS applications, monitoring tools, and client websites — from a self-hosted Proxmox VE cluster. These are real services handling real transactions for real customers, and they have been doing so reliably since 2003.
This article is a practical guide to running production infrastructure on a self-hosted Proxmox cluster. We will cover the architecture decisions, the problems we solved, and the lessons we learned. If you are an IT professional evaluating self-hosted infrastructure, this should give you a clear picture of what works, what does not, and what to watch out for.
Why Proxmox Over ESXi
We ran VMware ESXi for years before switching to Proxmox VE. The move was not ideological. It was practical. When Broadcom acquired VMware in late 2023 and began restructuring licensing, the free ESXi tier — which many small businesses and self-hosted environments depended on — was effectively killed. Overnight, the hypervisor that powered thousands of production environments became a subscription product with pricing designed for enterprise data centers.
Proxmox VE gave us everything ESXi provided and several things it did not:
- No licensing cost. Proxmox VE is free and open source. The optional subscription gets you the enterprise repository and support, but the software itself is fully functional without it.
- Native clustering. Proxmox clustering is built in. Creating a multi-node cluster with live migration, shared configuration, and a unified web interface takes minutes. ESXi required vCenter Server for equivalent functionality, which was itself a separately licensed product.
- KVM + LXC. Proxmox supports both full KVM virtual machines and lightweight LXC containers from the same interface. For services that do not need a full OS — DNS servers, lightweight web apps, monitoring agents — LXC containers use a fraction of the resources.
- Cloud-init integration. Proxmox has first-class cloud-init support, which means we can provision VMs from cloud images with static IPs, SSH keys, and hostnames baked in. No manual OS installation, no ISO mounting.
- ZFS and LVM-thin. Proxmox handles both ZFS pools and LVM-thin provisioning natively. We use
local-lvm(lvmthin) for VM disks, which gives us thin provisioning and fast snapshots without the overhead of a full ZFS stack on every node.
The migration was not trivial — we had to convert VMDK disks to qcow2, rebuild network configs, and re-establish backup schedules — but within a week every VM was running on Proxmox with zero downtime for end users.
The Cluster Architecture
Our Proxmox cluster consists of six physical nodes, but only three run at any given time. The other three are powered off. This is deliberate. Running six servers 24/7 when three can handle the load is a waste of electricity and hardware life. We keep the offline nodes available for failover, capacity bursts, or hardware rotation.
Every node sits on the same private subnet, bridged through a virtual network bridge. A pfSense firewall handles routing, DHCP reservations, NAT, and DNS resolution via Unbound. All VM IPs are statically assigned through cloud-init and registered in pfSense so that internal DNS resolution works without maintaining a separate DNS zone.
Quorum with Fewer Than Half the Nodes
Proxmox uses corosync for cluster communication, and corosync requires quorum — a majority of nodes must be online for the cluster to accept configuration changes. With a 6-node cluster, quorum requires 4 nodes. When you only want to run 3, you have a problem.
The solution is straightforward but important to understand: you set the expected votes manually. On each running node, you adjust /etc/pve/corosync.conf to reflect the actual number of participating nodes, or you use pvecm expected 3 to temporarily override quorum expectations. This gives the 3 active nodes a valid quorum of 2 and allows normal cluster operations. The key is to remember that powering on an offline node without updating quorum settings can cause a split-brain scenario, so we document the power-on procedure and never boot cold nodes without checking corosync state first.
The VM Inventory
Here is what actually runs on this cluster, all serving production traffic:
- Reverse Proxy VM — The SSL termination gateway. Apache with mod_proxy handles TLS certificates via Let's Encrypt and routes incoming HTTPS traffic to the correct backend VM based on domain name. Every public-facing site passes through this VM.
- GigDataServ — Our enterprise IT hardware e-commerce platform. Next.js 16, PostgreSQL, Redis, eBay API sync, Stripe payments. This is the largest VM in terms of resource allocation because it runs a full application stack with background workers.
- AI-Signed — The AI-Signed trust verification SaaS. Runs 43 automated checks across 8 parallel scanners, stores results in PostgreSQL, serves a Shopify embedded app, and processes Stripe subscriptions.
- UIPing — A real-time Ubiquiti stock monitoring platform that tracks ui.com for inventory changes and pushes instant notifications via ntfy, Discord, Slack, email, and webhooks.
- Immich — Self-hosted photo management running Immich with machine learning classification. Handles automatic iPhone photo backups and serves a custom photo slideshow to a Home Assistant-controlled display.
- Home Assistant — Home automation platform. Controls lights, cameras, climate, and the photo display tablet. Integrates with the Immich VM for photo rotation.
- lakeforestcomputer.com — This website. Debian 13, Caddy, Python contact form backend, file-based testimonials system. The simplest VM in the fleet: 2 CPU cores, 2 GB RAM, 20 GB disk.
Resource allocation follows a simple rule: give each VM the minimum it needs, and leave headroom on the host. Most website VMs get 2 CPU cores and 2 GB RAM. Database-heavy VMs like gigdataserv get more. We never overcommit memory because Linux's OOM killer does not care about your SLA.
Single WAN IP, Multiple Domains
This is the problem that trips up most self-hosted operators trying to go production: you have one public IP address from your ISP, but you need to serve seven different domains. Cloud providers solve this by assigning a unique IP per load balancer. On-premises, you solve it with a reverse proxy.
Our public IP is NAT'd by pfSense to a dedicated reverse proxy VM. That VM runs Apache with virtual hosts configured for each domain:
lakeforestcomputer.com→ website VM (Caddy)ai-signed.com→ SaaS VM (Next.js)gigdataserv.com→ e-commerce VM (Next.js)petturfdeoderizer.com→ product VMitadtools.com→ ITAD VMuiping.com→ monitoring VMsapp.ai-signed.com→ Shopify app VM
SSL termination happens at the reverse proxy using Let's Encrypt certificates managed by certbot. The backend VMs receive plain HTTP traffic over the private network. This means Caddy on the lakeforestcomputer.com VM listens on port 80 only — it never handles TLS. This architecture keeps certificate management centralized and lets each backend VM focus on serving its application without duplicating SSL configuration.
The downside is that the reverse proxy is a single point of failure. If that VM goes down, every public-facing site goes dark. We mitigate this with Proxmox's HA (high availability) feature, which can automatically restart a VM on a different node if the host fails. For planned maintenance, we live-migrate the reverse proxy to another node before touching the hardware.
Cloud-Init: Treating VMs Like Cloud Instances
Every VM in our cluster is provisioned using cloud-init, and this single decision has saved us more time than any other tooling choice. Here is how it works in practice:
- Download a cloud image (we use Debian's
genericcloudimages from cloud.debian.org). - Import it as a Proxmox VM template.
- Clone the template and attach a cloud-init drive.
- Set the IP address, gateway, DNS, SSH keys, and hostname in the Proxmox GUI or via
qm setcommands. - Boot the VM. It comes up fully configured in under 30 seconds.
No ISO, no installer, no manual partitioning, no first-boot wizard. The VM boots with a static IP, the correct hostname, and your SSH keys already in authorized_keys. You SSH in and start deploying your application.
This is the same workflow that cloud providers use. When you launch an EC2 instance or a DigitalOcean droplet, cloud-init is what configures it on first boot. Using the same tooling on Proxmox means our provisioning process is identical whether we are deploying locally or in the cloud. If we ever need to burst a workload to AWS, the cloud-init config transfers directly.
Backup Strategy
Backups are the difference between a hobby project and production infrastructure. Our backup policy is simple and aggressive:
- Daily backups at 3:00 AM for all standard VMs. Snapshot mode, zstd compression, keep-last=3. This gives us three days of rollback for any VM.
- Weekly backups for large VMs. Our largest VM, which carries over 400 GB of data, runs weekly with keep-last=1. Daily backups of a 400+ GB disk would eat through storage faster than we can provision it.
- NFS storage from a dedicated NAS on the same network. Backup files are written to an NFS mount, keeping them physically separate from the Proxmox nodes. If a node's local disk dies, the backups survive.
Proxmox's built-in vzdump handles the backup process. It supports three modes: stop (shut down the VM, back up, restart), suspend (pause RAM, back up disk, resume), and snapshot (back up a live snapshot with no downtime). We use snapshot mode for everything because our VMs serve production traffic and cannot tolerate even brief downtime windows.
Each node also has two local storage targets: local (a standard directory on the root filesystem for ISOs and templates) and local-lvm (an LVM-thin pool for VM disks). The thin provisioning means a VM with a 20 GB virtual disk only uses as much physical space as the data it has actually written. This is critical for efficient disk utilization when you are running 10+ VMs on consumer-grade SSDs.
Power Efficiency and Cost
One of the less-discussed advantages of on-premises infrastructure over cloud hosting is cost predictability. Our three active nodes consume roughly 150-200 watts combined. At California electricity rates (around $0.30/kWh), that is approximately $40-45 per month in power. Compare that to equivalent cloud resources:
- Seven VMs on DigitalOcean (mix of $12-48/mo droplets): $150-250/month
- Equivalent EC2 instances on AWS: $200-400/month
- Managed databases, load balancers, and storage would add another $100-200/month
Our total infrastructure cost — including power, internet, and occasional hardware replacement — runs about $80/month. We would spend that on a single managed PostgreSQL instance in the cloud.
Keeping three nodes powered off is part of this equation. Each idle server draws 30-50 watts even when doing nothing. Turning them off saves roughly $10-15/month per node and extends hardware lifespan. We boot them only when we need additional capacity or when rotating hardware for maintenance.
What Can Go Wrong
Running production from a self-hosted environment is not without risk, and it is important to be honest about the failure modes:
- ISP outages. You have one internet connection. When it goes down, everything goes down. We have had exactly two ISP outages longer than 30 minutes in the past year. For our use case, that is acceptable. For a business with an SLA promising 99.99% uptime, it is not.
- Power failures. UPS units buy you time, but they do not buy you hours. A sustained power outage means a graceful shutdown and waiting. We have a UPS on each node that provides 15-20 minutes of runtime — enough for automated shutdown scripts to run.
- Hardware failure. Consumer hardware fails. SSDs wear out, fans seize, RAM develops bit errors. The mitigation is redundancy: multiple nodes, daily backups to a separate NAS, and spare hardware on the shelf. We have recovered from a dead node SSD in under an hour by restoring from NFS backups to a different node.
- Dynamic IP. Most residential ISPs assign dynamic IPs. Ours has not changed in over a year, but it technically can. A business-class plan with a static IP or a simple dynamic DNS updater solves this.
Practical Advice for IT Professionals
If you are considering running production workloads on a self-hosted Proxmox cluster, here is what we would tell you after years of doing it:
- Start with backups, not VMs. Before you deploy your first production service, make sure your backup pipeline works end to end. Test a restore. Time it. Know exactly how long it takes to recover from a dead disk.
- Use cloud-init from day one. Manual VM installs create snowflake servers that are painful to rebuild. Cloud-init templates mean you can reprovision any VM in minutes.
- Centralize SSL termination. Do not manage certificates on every VM. Pick one reverse proxy, run certbot there, and let everything behind it speak plain HTTP on the private network.
- Document your quorum settings. If you run fewer nodes than your cluster expects, write down the procedure for adjusting quorum. Tape it to the server rack if you have to. Getting this wrong causes split-brain, and split-brain causes data loss.
- Right-size your VMs. The temptation is to allocate 8 cores and 16 GB to everything. Resist it. A static website needs 2 cores and 2 GB. A database might need 4 cores and 8 GB. Overprovisioning wastes resources and limits how many VMs you can run per node.
- Keep a spare node. Having at least one powered-off node ready to receive migrated VMs is worth the hardware cost. It turns "the server is down, panic" into "the server is down, migrate and investigate."
The Bottom Line
A self-hosted Proxmox cluster is not a compromise. It is a legitimate infrastructure choice for small businesses, consultancies, and independent software companies that need full control over their stack without cloud-scale bills. We run seven production domains, serve real customers, process real payments, and back everything up to NFS storage — all from three on-premises servers.
The total cost is under $80/month. The total control is absolute. And when Broadcom decides to change VMware's licensing again, or AWS bumps its prices by 20%, we will not even notice.
If you are an IT professional or a small business owner considering self-hosted infrastructure, reach out to us. We have been doing this since 2003, and we are happy to help you evaluate whether a self-hosted Proxmox cluster makes sense for your workloads.