discussion: convert single villa Proxmox nodes to bare NixOS service hosts (clan + disko + nixos-anywhere) #5
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context & goal
Companion to
docs/architecture/nixos-container-substrate-map.mdand the clan.lol evaluation. The substrate map already concludes the migration is "convert a Proxmox node to a bare NixOS host (running nspawn and Incus), node by node… start at villa (3 PVE nodes → convert one while two carry load)" and that networking is the hardest part, not deploying the services.This issue is a focused discussion + runbook for the host-conversion mechanics of doing that on one villa node, specifically around clan, disko, a boot image, and nixos-anywhere in-place ("from within").
Villa facts: 3 PVE nodes
villa-pve-01/02/03at10.1.10.11/.12/.13(Mgmt VLAN 10); services on Server VLAN 40 (10.1.40.x, gw10.1.40.1). Console reality: no out-of-band IPMI/iKVM. The only console is Proxmox noVNC, which lives on the other nodes — the node being converted loses its own console the instant itkexecs away from Proxmox.The decision in one paragraph
We want one villa node to stop being a Proxmox hypervisor and instead be a bare NixOS host that carries services directly (host-intrinsic infra as native systemd; cattle as nspawn with their own VLAN-40 IP; optional pets as Incus-LXC). Getting NixOS onto the bare metal is the new step. Three mechanisms exist — nixos-anywhere in-place (kexec), boot-image/USB then install, and clan (which wraps nixos-anywhere+disko). All three end at the same place; they differ in how much we bet on the node coming back without a console. Given no IPMI, the recommendation is drain-then-convert (evacuate services to the two peers first so the node is disposable) and use a boot image for the first conversion, keeping in-place kexec for later nodes once the disko layout and a booting closure are proven.
Options for getting NixOS onto a villa node
machines installdisko-install/nixos-anywherelocallyvarssecret generators + uniform CLIAll three require: a correct disko layout for the node's real disk, a NixOS closure that boots on this hardware, and root SSH (A/C) or physical access (B).
Feasibility of in-place "from within" given our console reality
Mechanically: yes — this is exactly what nixos-anywhere is designed for. Default phases:
kexec(boot a minimal NixOS installer that runs entirely from RAM, so the old OS is no longer mounted) →disko(destroy + create + mount the target disk — the same disk the OS booted from, now free because we're in RAM) →install(build/copy closure,nixos-install) →reboot. Budget ~1.5–2.5 GB RAM for the kexec installer (the 1 GB README floor OOMs in practice without zram); villa nodes easily clear this. Build the closure on the deployer or a builder, not on the target.Operationally: high-risk here, because we have no out-of-band console. Failure modes that brick the node until someone walks to it:
devicepath in the disko config → wipes the wrong disk.What makes it acceptable: drain-then-convert. Evacuate the node's guests to the two peer PVE nodes first (live-migrate or restore from vzdump), so the node carries no load and is fully disposable. Then a failed in-place attempt costs a drive to the rack, not an outage. De-risk further with
nixos-anywhere --vm-test(boot the config in a local VM first) and--generate-hardware-config(capture the real NIC/disk).Full runbook — converting one drained villa node
Step 0 — Drain the node
vzdump+ restore the node's CTs/VMs onto the other two villa PVE nodes. Confirm every service is healthy on its new home (Traefik route, VLAN-40 reachability) before touching the node. The node must carry zero load.Step 1 — disko layout (read the real disk first!)
Step 2 — host config skeleton (plain NixOS first; clan optional)
A bare service host needs: the disko module, a bootloader, the systemd-networkd config (Step 4), sops-nix, and the service feature modules (native systemd +
containers.<name>nspawn, optionallyvirtualisation.incus.enable).Option C add-on — wrap with clan (only if we adopt clan for this tier):
clan machines install villa-pve-01== nixos-anywhere + disko (Option A/C).clan machines update==nixos-rebuild switch --target-host(identical to today'shomelab apply). clan's real win here isvars(secret generators on top of sops-nix); its mesh-VPN feature is redundant with our UniFi WireGuard and should stay disabled.Step 3 — install
Step 4 — networking rebuild (the hard part)
Reproduce Proxmox VLAN-aware
vmbr0.10/.40in systemd-networkd. The repo already has the VLAN-on-NIC pattern inmodules/hosts/villa-router-01.nixandmodules/features/gateway.nix— extend it with VLAN 40 and a bridge+veth for guests:For nspawn/Incus guests that need their own
10.1.40.NNidentity: a VLAN-aware bridge (VLANFiltering=yes) with veth pairs tagged into VLAN 40 (verifybridgeVLANs/PVID syntax againstsystemd.network(5)). This bridge+veth-per-guest is the genuinely new networking work the substrate map flags. Keep dnsmasq/router roles unchanged.Step 5 — bring services back
services.<name>.enable).containers.<name>(nspawn) with/databind-mounted from NAS, veth on VLAN 40, Traefik route + DNS/tunnel updated.incus launchfrom thelxc-container.niximage), restore/data; adopt clanvars/updatehere if desired.Step 6 — validation
10.1.10.11) and each service on VLAN 40 (10.1.40.x)./datamounted,nixos-rebuild switchclean.Step 7 — rollback
Where clan helps vs where it doesn't
clan fits everywhere except nspawn. For the nspawn tier, plain NixOS + sops-nix remains the honest substrate; clan would only manage the host layer. clan's standalone win regardless of substrate is
vars(cherry-pickable onto our existing sops-nix without full adoption).Recommendation
varsfor secret generation; consider clan as the management layer for the Incus-pet tier only; do not enable clan's mesh VPN; keephomelab applyfor nspawn hosts.Open questions
VLANFiltering+ per-vethbridgeVLANs, or simpler per-VLAN netdevs + macvlan for guests?vars) or just replicate thevarspattern on sops-nix?Generated as an exploration/decision artifact. No infrastructure changed. Grounded in nixos-anywhere/disko/clan docs and the repo's existing
villa-router-01.nix+gateway.nixsystemd-networkd patterns.