discussion: pivot homelab to a netbird mesh + clan + NixOS/Incus (drop UniFi reliance) #6
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Decision study (no migration committed): re-found the homelab on a single netbird mesh connectivity/trust/DNS plane managed by clan, with NixOS + Incus as the only runtime, and clan
data-mesher/corednsfor internal DNS. UniFi is demoted to dumb VLAN transport (it stays for coworking tenants/printers/client VLANs).📄 Full study:
docs/superpowers/specs/2026-06-01-homelab-mesh-clan-incus-pivot-design.mdCompanions: substrate map · clan.lol eval · builds on #5.
The pivot in one picture
Why the "double-hop" worry doesn't apply
netbird splits control plane (Management/Signal/Relay = coordination only) from data plane (WireGuard peer-to-peer). ICE prefers Host↔Host LAN-direct → NAT hole-punch → relay (last resort, still E2E-encrypted). So a same-site NAS↔streaming (or you↔NAS) connection is direct over the LAN and never hairpins through the Hetzner control plane. Control-plane location ≠ data path. Self-hosting Management/Signal/Relay on
util-01gives an always-reachable coordinator; if it blips, existing tunnels keep flowing.Key design decisions
internet/tortoday). North-star: register netbird there soclan machines updateuses it with fallback. Staged via a netbird clanService (deploy/enrol agent,targetHost= mesh name) that works on day one. This is "add netbird to clan's mesh options."data-mesher(gossip host records) +coredns→ retires router dnsmasq, UniFi static DNS, and the*.home.miskam.xyzsplit-horizon.qmrestoreretire. disko + clanmachines installprovision (per #5).*.miskam.xyzvia cloudflared-peer → Traefik → mesh backend. Coworking tenants still hit Traefik by hostname on the VLAN — Traefik is the single VLAN↔mesh bridge.The experiment: one beefy, physically-reachable node (
game-01)Physical access ⇒ the #5 console risk is gone. Reinstall
game-01as bare NixOS+Incus → netbird agent + join mesh → clan-manage → 2–3 Incus guests →data-mesher/corednsnames → mesh-native edge for one service.Success criteria: (a) same-site peer reaches a guest over the LAN (endpoint is local IP, not relay); (b) cross-site reach to villa NAS; (c) control-plane down → existing tunnels survive; (d)
clan machines updateover the mesh; (e) one service published via cloudflared-peer→Traefik→mesh; (f)<service>.<tld>resolves viacoredns/data-mesher.Trade-offs (honest)
Simpler: no homelab inter-VLAN firewall, no site-to-site WG upkeep, no split-DNS, no Proxmox/qmrestore, one connectivity model, ACLs+secrets in git/clan.
Costs: a netbird control-plane to run/back-up (mitigated by self-host + P2P survival); overlay security surface (permissive default = flat network); per-guest identity/key management; learning curve (clan networking internals, netbird ACLs, data-mesher); and it adds a plane on top of the tenant VLANs that must persist — the simplification is for the homelab's mental model, not the building's.
Open questions (discussion)
targetHostcleanly, or is a clan-native mesh (mycelium) lower-friction for the deploy path while netbird is the service mesh?*.miskam.xyz/*.home.miskam.xyz, or adopt a.clan-style internal TLD viacoredns?incus export/snapshot +/data→NAS→Storage Box).Recommendation: run the
game-01experiment (low-risk, no production dependency); decide fleet migration only after (a)–(f) pass.Thanks — folded these into the study (updated on
main, new Resolutions + Still open sections). Point by point:1. "Why more/less friction?" — Fair catch, my framing was wrong. The friction isn't performance, it's clan-integration effort, and only ZeroTier is actually wired into clan's networking selector — mycelium and yggdrasil are not registered backends either, so mycelium is not lower-friction than netbird. Honest options for the clan deploy path:
targetHost= netbird addrPlan: (a) for
game-01, (b) as the north star. (c) is an escape hatch, not the plan.2. "Is this a question for netbird?" — Yes, entirely. netbird's model is groups + group→group policies + ports/port-ranges + posture checks, so per-service-tag = a netbird group. We use tag-based groups; not a granularity we have to design around.
3. "Whatever simplifies certs." — That points at keeping the real
*.miskam.xyznames (existing Let's Encrypt wildcard via Cloudflare DNS-01 just keeps working) and not adopting a.clanTLD (which would need an internal CA + trust distribution).coredns/data-mesherresolve those same real names to mesh IPs internally. Decided.4. Migration ordering — "not sure." — Left open; default lean is the villa node (per #5) after the experiment passes. No commitment.
5. Backups. — Adopted your split: service-level export is primary (portable, substrate-independent, NAS→Storage Box),
incus export/snapshot is secondary for fast restore. The study no longer leans Incus-first.Net: Q2/Q3/Q5 resolved, Q1 reframed, Q4 still open. Next move is the low-risk
game-01experiment — say the word and I'll turn §5 into an implementation plan.Update: clan-maintainer signal reopens the mesh choice
Found the actual
#clan:clan.loldiscussion on netbird — straight from clan core devs, and it's a meaningful caution:Impact on this design:
So I've reopened the mesh choice (it was implicitly "netbird") and reframed the experiment as a bake-off. The honest fork is data-plane quality (netbird) vs declarative-fit + clan-integration (clan-native):
Plan: on
game-01, stand up netbird AND a clan-native mesh (mycelium as the local-first contender; zerotier as integrated baseline) side-by-side and score (a) local-direct, (b) cross-site, (g) declarativeness, (h) setup pain, (i) route/roaming self-healing. Decision rule: local-first is non-negotiable; netbird must clearly win the data plane to justify its declarative cost, else the clan-native mesh wins.Full detail folded into the study (updated on
main) — new Community signal, Mesh choice (reopened), and bake-off §5 sections. Thanks for digging up that chat — it changed the experiment from "deploy netbird" to "find out which overlay actually fits."Decision: start the experiment with ZeroTier
Going with the pragmatic path first. ZeroTier is the only mesh fully integrated into clan's networking selector, so it's the lowest-friction way to get a working clan-over-mesh node and — crucially —
clan machines updateover the mesh works out of the box, no backend to build.Rationale: prove the substrate (bare NixOS + Incus + clan +
data-mesher+ mesh-native edge) before sinking time into netbird's non-declarative control plane or self-hosting a headscale DERP. If the substrate has issues, find out with the easy mesh. ZeroTier then becomes the baseline the challengers are measured against.Setup notes for the experiment: self-host the ZeroTier controller (e.g. on
util-01— clan supports a self-hosted controller that needn't be globally reachable), and run a moon if we want cross-site relay independence from ZeroTier's roots. Accept its knocks for a baseline: BSL license + controller SPOF.Bake-off order on
game-01: ZeroTier (baseline) → then headscale/tailscale (best NAT + declarative ACL file), netbird, mycelium as challengers. A challenger only displaces ZeroTier on clear evidence — better declarativeness, cross-site NAT without a third-party hop, or independence/license. If none clearly wins, ZeroTier stays (it already works in clan).Folded into the study (updated on
main) — §5 now leads with ZeroTier, mesh-choice read + decision rule updated, headscale/tailscale added as a first-class candidate.Next: when ready to actually run it, turn §5 into a ZeroTier-first implementation plan (reinstall game-01 → clan + ZeroTier controller → Incus guests → data-mesher/coredns → mesh-native edge → capture the (a)–(i) baseline).