On my journey to making my homelab more reliable and operationally efficient, I've made some crucial mistakes. I attribute all of my failures to the learning process and write this blog in hopes that you do not do the same as I have.
Mistake #1: Running OS's on USBs
I know I know, it sounds crazy, but I had my reasons. To this day in my homelab, I am clustering mini PCs. They're more power efficient and cheaper, NUCs are pretty powerful in spite of their size. That is beside the point. The issue was in Intel NUCS there is a slot for an NVMe drive and a 2.5" SATA. When running a mirrored ZFS pool, I lost the ability to also use one of these drives as a boot pool. My solution was to use redundant external USB drives as the boot-pool. At least I was redundant?
Mitigation #1
You don't know what you’re missing out on until you have something else. The slow SSH logins, delayed UI load speeds, and dropped connections were all a result of this silly design choice. I moved all my server OSs to SSDs. Life is so much better and faster now. I've seen disk speeds decrease from milliseconds to microseconds.
Mistake #2: XCP-NG Instead of Proxmox
XCP-NG is not bad. There are some features in XCP-NG/XenOrchastra that I would take over Proxmox today. Choosing XCP-NG as my hypervisor when starting my homelab simply came down to exploration, that being said there are a few reasons I chose to move off:
Support - Proxmox has a larger community, basically anything I want to do in Proxmox is in someone's blog, YouTube video, or the Proxmox documentation. KVM also has more support than XenServer i.e. being used by most hyper-scalers and maintained as a part of the Linux Kernel.
USB and PCIe passthrough - These features just did not want to work for me. I jumped through a lot of hoops to no avail.
Container support - Proxmox provides the ability to run LXC containers without needing to spin up a "docker host"
Mitigation #2
Pretty obvious what the mitigation was here. Moving to Proxmox has also saved me a lot of struggle and headaches. Promox does come with a headache of its own, but in my opinion, their feature set is just more complete than XCP-NG at the moment. Maybe one day I can move back or find another hypervisor.
Mistake #3: VM disks over NFS
I had good intentions behind this, but it was pretty stupid. My thought was I could make a few VMs highly available by running them over NFS. If their host goes, they can be restarted on another host automatically and their disk is still available. This is a bad idea because it introduces my NAS as the single point of failure.
Sometimes we make bad decisions and never face the consequences. That was not the case for this decision. I went out of town for two weeks and my NAS went offline on day 3 of the trip. 11 days of my cluster being offline as a result.
Mitigation #3
I moved all VM disks onto local storage. If a service is critical (e.g., DNS), it should run redundantly across hosts and be backed up. Given a fast quick enough disks and network speeds, restores are rather quick. Another option is Ceph in Proxmox for highly available storage in the event a host goes offline.
Mistake & Mitigation #3.5: Host Interdependence
This is just a continuation of the previous mistake. Turning my NAS strictly into a backup target instead of running VMs, containers, etc. off of it has saved me a lot of stress and taken my reliability score up a notch. Please do not run VM disks over NFS unless you have HA storage too.
Conclusion
Now that you've thrown up from disgust at poor architectural choices, I also want to say that I've made some good choices too. However, I quickly overlook the good decisions and go back to the drawing board when my server goes offline. That being said two weeks ago I woke up to etcd failing and did a full restore in 30 minutes, so I am getting somewhere.
TL;DR
Don't run your server OS off of a USB.
Proxmox has more support than XCP-NG.
Don't run VM disks over NFS.
Avoid introducing single points of failure.