Home Server Project: Architecting a 3-Node Proxmox HA Cluster

How I architected a 3-node High Availability (HA) cluster for my home infrastructure.

Architecture

The hypervisor environment is powered by Proxmox VE, an open-source platform that clusters independent Linux servers for high availability, live migration, and centralized compute management.

flowchart TB
    subgraph PROXMOX["PROXMOX CLUSTER"]
        N1["Node 1<br/>LXC Containers<br/>━━━━━━━━<br/>Homebridge<br/>Scrypted (Ring)<br/>so-co (Sonos)"]
        N2["Node 2<br/>DNS Services"]
        N3["Node 3<br/>Virtual Machines"]
    end
    
    GPU["GPU Server<br/>Ollama API<br/>LLM Inference"]
    
    N1 -->|API Calls| GPU
    N1 -->|Passthrough| AppleHome[Apple Home]
    
    style GPU fill:#09090b,stroke:#06b6d4
    style PROXMOX fill:#09090b,stroke:#27272a

Hardware Setup

For compute, I’m using 3 modified Dell OptiPlex Micro machines equipped with Intel i5-10500T processors and 16 GB of RAM. The “T” variant is designed for strict power efficiency, drawing significantly less power than the standard 65W TDP i5-10500. Since raw compute is not the primary bottleneck for my workloads, optimizing for performance-per-watt was the main goal.

In testing, the idle power draw bottoms out at ~4 W on Node 1. The maximum sustained draw occurs on Node 3 due to VM overhead, peaking at 24 W. For comparison, the Windows 10 Pro VM alone would push a standard desktop CPU well past double that figure.

To handle AI workloads without burdening the primary cluster, I deployed a separate GPU server equipped with an RTX 3070. This machine runs bare-metal Ubuntu and hosts Ollama for LLM inference. The Proxmox containers make lightweight API calls to this server. Decoupling the GPU infrastructure from the hypervisor keeps the cluster lean and avoids the complexities of PCIe passthrough across clustered nodes.

Proxmox Deployment and Clustering

Setting up the cluster started with imaging each node with the Proxmox VE ISO (flashed via Rufus/dd) to establish the base Debian system. I assigned static IPs during the installation phase to ensure predictable routing.

Once the individual environments were stable, I initialized the cluster from Node 1 and joined the remaining nodes. Proxmox relies on Corosync for cluster state and messaging. Because Corosync is highly sensitive to latency, all nodes must maintain robust, low-latency network communication.

Deploying exactly 3 nodes makes quorum management mathematically straightforward: maintaining cluster state requires a simple majority (2 out of 3 nodes online). This inherently prevents the dreaded “split-brain” scenario that plagues even-numbered clusters.

Networking Topology

All three Proxmox nodes are wired into a UniFi US-8-150W managed switch. While I am currently operating on a flat /24 network architecture (no VLAN segmentation for the management interface yet), network reliability is paramount. Corosync cluster communication and standard traffic share the primary Ethernet interface on each machine; secondary NICs were disabled at the OS level to prevent routing loops or Corosync confusion.

Core DNS routing is handled by Node 2, running Pi-hole in a lightweight LXC container to provide network-wide ad blocking. I utilize AdGuard Sync to replicate the DNS configuration to a secondary instance. This means that if Node 2 drops offline, DNS queries automatically failover at the application layer, completely independent of Proxmox’s hypervisor-level HA.

The smart home containers on Node 1 require highly stable, broadcast-capable networking for mDNS and HomeKit integration, so their virtual interfaces are bridged directly to the physical network rather than hidden behind NAT.

Key Features

HA Failover

High Availability is the primary operational advantage of this architecture. Critical services, like the smart home stack, are configured for automatic HA failover. If a physical host experiences a kernel panic or power loss, Proxmox automatically spins up those containers on a surviving node.

This resilience requires storage replication, meaning I had to be strategic about which workloads actually warrant HA. For example, the legacy Windows 10 VM on Node 3 is restricted to a single node to save on I/O overhead.

Live Migration

The ability to migrate running virtual machines and containers between physical hosts without dropping packets is critical for zero-downtime maintenance. I regularly use live migration to patch and reboot nodes sequentially. Because I am utilizing local storage with ZFS replication rather than a centralized NAS, container migrations are nearly instantaneous, while larger VMs take slightly longer as the delta states sync over the network.

Centralized Management

Proxmox provides a unified Web UI and a robust API to manage the entire datacenter footprint from a single pane of glass. Monitoring resource allocation, tailing logs, and deploying new LXC containers is super straightforward.

Challenges and Lessons Learned

The Rule of Three for Quorum: Opting for an odd number of nodes was the right call. Needing 2 out of 3 nodes for quorum means the cluster can comfortably tolerate a single-node hardware failure without entering a locked, read-only state.

Storage Replication I/O: Keeping virtual disks constantly replicated across local storage for HA introduces significant I/O overhead and reduces usable disk space. It forced me to categorize workloads strictly into “mission-critical” (needs HA) and “best-effort” (can tolerate downtime).

Efficiency over Brute Force: The i5-10500T CPUs are perfect for infrastructure tasks. DNS resolution, automated workflows, and reverse proxies do not require heavy compute, they require 24/7 uptime. Optimizing for low thermal output and minimal power draw really pays off over time.

LXC Containers > Full VMs: Whenever possible, I deploy services in LXC containers rather than full virtual machines. They share the host kernel, meaning the DNS container on Node 2 uses a tiny fraction of the CPU and memory footprint that a dedicated VM would require.

The Network is the Computer: A distributed cluster is only as stable as its underlying network. When a switch momentarily dropped packets during early testing, Corosync lost quorum and fenced the nodes. You really can’t cut corners with the network—wired connections are a must if you want the cluster to actually stay stable.

This cluster has been in stable production for months. The redundancy, combined with the power efficiency of the micro-nodes, makes the initial setup complexity more than worth it, providing a fantastic sandbox for distributed systems engineering.