discussion: pivot homelab to a netbird mesh + clan + NixOS/Incus (drop UniFi reliance) #6

Open
opened 2026-06-01 08:35:55 +00:00 by mxm · 4 comments
Owner

Decision study (no migration committed): re-found the homelab on a single netbird mesh connectivity/trust/DNS plane managed by clan, with NixOS + Incus as the only runtime, and clan data-mesher/coredns for internal DNS. UniFi is demoted to dumb VLAN transport (it stays for coworking tenants/printers/client VLANs).

📄 Full study: docs/superpowers/specs/2026-06-01-homelab-mesh-clan-incus-pivot-design.md
Companions: substrate map · clan.lol eval · builds on #5.

The pivot in one picture

            BEFORE (today)                          AFTER (north star)
 UniFi = transport + VLAN segmentation +    UniFi = dumb VLAN transport + WAN
 inter-VLAN firewall + site-to-site WG +      (kept for coworking tenants)
 partial DNS                                netbird mesh = homelab connectivity
 services = VLAN-40 IPs via zone rules        + trust + DNS plane; every host &
 + WG tunnel cross-site; split-DNS            first-class service is a mesh peer,
 substrate = Proxmox + qmrestore              reachable anywhere it has a path —
 mgmt = homelab CLI over paramiko             LAN-direct when local.
                                            substrate = bare NixOS + Incus
                                            mgmt = clan (machines/update/vars)

Why the "double-hop" worry doesn't apply

netbird splits control plane (Management/Signal/Relay = coordination only) from data plane (WireGuard peer-to-peer). ICE prefers Host↔Host LAN-direct → NAT hole-punch → relay (last resort, still E2E-encrypted). So a same-site NAS↔streaming (or you↔NAS) connection is direct over the LAN and never hairpins through the Hetzner control plane. Control-plane location ≠ data path. Self-hosting Management/Signal/Relay on util-01 gives an always-reachable coordinator; if it blips, existing tunnels keep flowing.

Key design decisions

  • netbird as a clan networking backend — clan already has a pluggable, priority-based network selector (internet/tor today). North-star: register netbird there so clan machines update uses it with fallback. Staged via a netbird clanService (deploy/enrol agent, targetHost = mesh name) that works on day one. This is "add netbird to clan's mesh options."
  • Segmentation = netbird ACLs in git, replacing UniFi's inter-VLAN firewall matrix for homelab traffic. The mesh is an overlay on top of the existing VLANs.
  • Internal DNS = data-mesher (gossip host records) + coredns → retires router dnsmasq, UniFi static DNS, and the *.home.miskam.xyz split-horizon.
  • Runtime = NixOS + Incus; Proxmox + qmrestore retire. disko + clan machines install provision (per #5).
  • Hybrid guest identity: host-intrinsic infra on the host's mesh identity; cattle/pets that need their own identity run a netbird agent inside the Incus guest (first-class peers).
  • Mesh-native edge: Traefik + cloudflared run as mesh peers; public *.miskam.xyz via cloudflared-peer → Traefik → mesh backend. Coworking tenants still hit Traefik by hostname on the VLAN — Traefik is the single VLAN↔mesh bridge.

The experiment: one beefy, physically-reachable node (game-01)

Physical access ⇒ the #5 console risk is gone. Reinstall game-01 as bare NixOS+Incus → netbird agent + join mesh → clan-manage → 2–3 Incus guests → data-mesher/coredns names → mesh-native edge for one service.

Success criteria: (a) same-site peer reaches a guest over the LAN (endpoint is local IP, not relay); (b) cross-site reach to villa NAS; (c) control-plane down → existing tunnels survive; (d) clan machines update over the mesh; (e) one service published via cloudflared-peer→Traefik→mesh; (f) <service>.<tld> resolves via coredns/data-mesher.

Trade-offs (honest)

Simpler: no homelab inter-VLAN firewall, no site-to-site WG upkeep, no split-DNS, no Proxmox/qmrestore, one connectivity model, ACLs+secrets in git/clan.
Costs: a netbird control-plane to run/back-up (mitigated by self-host + P2P survival); overlay security surface (permissive default = flat network); per-guest identity/key management; learning curve (clan networking internals, netbird ACLs, data-mesher); and it adds a plane on top of the tenant VLANs that must persist — the simplification is for the homelab's mental model, not the building's.

Open questions (discussion)

  1. Does netbird's NixOS module + self-hosted control plane satisfy clan's targetHost cleanly, or is a clan-native mesh (mycelium) lower-friction for the deploy path while netbird is the service mesh?
  2. ACL granularity: per-host vs per-service-tag?
  3. Naming: keep *.miskam.xyz/*.home.miskam.xyz, or adopt a .clan-style internal TLD via coredns?
  4. Migration ordering after the experiment (villa node per #5 next?).
  5. Backups in an Incus world (incus export/snapshot + /data→NAS→Storage Box).

Recommendation: run the game-01 experiment (low-risk, no production dependency); decide fleet migration only after (a)–(f) pass.

**Decision study (no migration committed):** re-found the homelab on a single **netbird mesh** connectivity/trust/DNS plane managed by **clan**, with **NixOS + Incus** as the only runtime, and clan **`data-mesher`/`coredns`** for internal DNS. UniFi is demoted to **dumb VLAN transport** (it stays for coworking tenants/printers/client VLANs). 📄 **Full study:** [`docs/superpowers/specs/2026-06-01-homelab-mesh-clan-incus-pivot-design.md`](https://git.miskam.xyz/mxm/homelab/src/branch/main/docs/superpowers/specs/2026-06-01-homelab-mesh-clan-incus-pivot-design.md) Companions: [substrate map](https://git.miskam.xyz/mxm/homelab/src/branch/main/docs/architecture/nixos-container-substrate-map.md) · [clan.lol eval](https://git.miskam.xyz/mxm/homelab/src/branch/main/docs/logs/2026-05-30-clan-lol-evaluation.md) · builds on #5. ## The pivot in one picture ``` BEFORE (today) AFTER (north star) UniFi = transport + VLAN segmentation + UniFi = dumb VLAN transport + WAN inter-VLAN firewall + site-to-site WG + (kept for coworking tenants) partial DNS netbird mesh = homelab connectivity services = VLAN-40 IPs via zone rules + trust + DNS plane; every host & + WG tunnel cross-site; split-DNS first-class service is a mesh peer, substrate = Proxmox + qmrestore reachable anywhere it has a path — mgmt = homelab CLI over paramiko LAN-direct when local. substrate = bare NixOS + Incus mgmt = clan (machines/update/vars) ``` ## Why the "double-hop" worry doesn't apply netbird splits **control plane** (Management/Signal/Relay = coordination only) from **data plane** (WireGuard peer-to-peer). ICE prefers **Host↔Host LAN-direct** → NAT hole-punch → relay (last resort, still E2E-encrypted). So a same-site NAS↔streaming (or you↔NAS) connection is **direct over the LAN** and never hairpins through the Hetzner control plane. Control-plane location ≠ data path. Self-hosting Management/Signal/Relay on **`util-01`** gives an always-reachable coordinator; if it blips, existing tunnels keep flowing. ## Key design decisions - **netbird as a clan networking backend** — clan already has a pluggable, priority-based network selector (`internet`/`tor` today). North-star: register netbird there so `clan machines update` uses it with fallback. **Staged** via a netbird **clanService** (deploy/enrol agent, `targetHost` = mesh name) that works on day one. *This is "add netbird to clan's mesh options."* - **Segmentation = netbird ACLs in git**, replacing UniFi's inter-VLAN firewall matrix for homelab traffic. The mesh is an overlay on top of the existing VLANs. - **Internal DNS = `data-mesher` (gossip host records) + `coredns`** → retires router dnsmasq, UniFi static DNS, and the `*.home.miskam.xyz` split-horizon. - **Runtime = NixOS + Incus**; Proxmox + `qmrestore` retire. disko + clan `machines install` provision (per #5). - **Hybrid guest identity:** host-intrinsic infra on the host's mesh identity; cattle/pets that need their own identity run a netbird agent inside the Incus guest (first-class peers). - **Mesh-native edge:** Traefik + cloudflared run as mesh peers; public `*.miskam.xyz` via cloudflared-peer → Traefik → mesh backend. Coworking tenants still hit Traefik by hostname on the VLAN — Traefik is the single VLAN↔mesh bridge. ## The experiment: one beefy, physically-reachable node (`game-01`) Physical access ⇒ the #5 console risk is gone. Reinstall `game-01` as bare NixOS+Incus → netbird agent + join mesh → clan-manage → 2–3 Incus guests → `data-mesher`/`coredns` names → mesh-native edge for one service. **Success criteria:** (a) same-site peer reaches a guest **over the LAN** (endpoint is local IP, not relay); (b) cross-site reach to villa NAS; (c) control-plane down → existing tunnels survive; (d) `clan machines update` over the mesh; (e) one service published via cloudflared-peer→Traefik→mesh; (f) `<service>.<tld>` resolves via `coredns`/`data-mesher`. ## Trade-offs (honest) **Simpler:** no homelab inter-VLAN firewall, no site-to-site WG upkeep, no split-DNS, no Proxmox/qmrestore, one connectivity model, ACLs+secrets in git/clan. **Costs:** a netbird control-plane to run/back-up (mitigated by self-host + P2P survival); overlay security surface (permissive default = flat network); per-guest identity/key management; learning curve (clan networking internals, netbird ACLs, data-mesher); and it **adds** a plane on top of the tenant VLANs that must persist — the simplification is for the *homelab's* mental model, not the building's. ## Open questions (discussion) 1. Does netbird's NixOS module + self-hosted control plane satisfy clan's `targetHost` cleanly, or is a clan-native mesh (mycelium) lower-friction for the **deploy path** while netbird is the **service mesh**? 2. ACL granularity: per-host vs per-service-tag? 3. Naming: keep `*.miskam.xyz`/`*.home.miskam.xyz`, or adopt a `.clan`-style internal TLD via `coredns`? 4. Migration ordering after the experiment (villa node per #5 next?). 5. Backups in an Incus world (`incus export`/snapshot + `/data`→NAS→Storage Box). **Recommendation:** run the `game-01` experiment (low-risk, no production dependency); decide fleet migration only after (a)–(f) pass.
Author
Owner
  1. I dunno why should this have more or less friction?
  2. is this a question for netbird?
  3. i think whatever simplifies the usage of certs
  4. not sure
  5. i think i want to export on a service level first, and the incus exports are more like getting containers restored fast
1. I dunno why should this have more or less friction? 2. is this a question for netbird? 3. i think whatever simplifies the usage of certs 4. not sure 5. i think i want to export on a service level first, and the incus exports are more like getting containers restored fast
Author
Owner

Thanks — folded these into the study (updated on main, new Resolutions + Still open sections). Point by point:

1. "Why more/less friction?" — Fair catch, my framing was wrong. The friction isn't performance, it's clan-integration effort, and only ZeroTier is actually wired into clan's networking selector — mycelium and yggdrasil are not registered backends either, so mycelium is not lower-friction than netbird. Honest options for the clan deploy path:

Option Fallback? Effort
(a) targetHost = netbird addr none trivial — fine for the experiment
(b) build the netbird networking backend yes (internet/tor) real work — the headline artifact
(c) ZeroTier for the deploy path only yes low, but two overlays — cuts against "simplify"

Plan: (a) for game-01, (b) as the north star. (c) is an escape hatch, not the plan.

2. "Is this a question for netbird?" — Yes, entirely. netbird's model is groups + group→group policies + ports/port-ranges + posture checks, so per-service-tag = a netbird group. We use tag-based groups; not a granularity we have to design around.

3. "Whatever simplifies certs." — That points at keeping the real *.miskam.xyz names (existing Let's Encrypt wildcard via Cloudflare DNS-01 just keeps working) and not adopting a .clan TLD (which would need an internal CA + trust distribution). coredns/data-mesher resolve those same real names to mesh IPs internally. Decided.

4. Migration ordering — "not sure." — Left open; default lean is the villa node (per #5) after the experiment passes. No commitment.

5. Backups. — Adopted your split: service-level export is primary (portable, substrate-independent, NAS→Storage Box), incus export/snapshot is secondary for fast restore. The study no longer leans Incus-first.

Net: Q2/Q3/Q5 resolved, Q1 reframed, Q4 still open. Next move is the low-risk game-01 experiment — say the word and I'll turn §5 into an implementation plan.

Thanks — folded these into the study ([updated on `main`](https://git.miskam.xyz/mxm/homelab/src/branch/main/docs/superpowers/specs/2026-06-01-homelab-mesh-clan-incus-pivot-design.md), new **Resolutions** + **Still open** sections). Point by point: **1. "Why more/less friction?"** — Fair catch, my framing was wrong. The friction isn't performance, it's **clan-integration effort**, and **only ZeroTier is actually wired into clan's networking selector** — mycelium and yggdrasil are *not* registered backends either, so mycelium is **not** lower-friction than netbird. Honest options for the clan deploy path: | Option | Fallback? | Effort | |---|---|---| | (a) `targetHost` = netbird addr | none | trivial — fine for the experiment | | (b) build the netbird networking backend | yes (internet/tor) | real work — the headline artifact | | (c) ZeroTier for the deploy path only | yes | low, but two overlays — cuts against "simplify" | Plan: (a) for `game-01`, (b) as the north star. (c) is an escape hatch, not the plan. **2. "Is this a question for netbird?"** — Yes, entirely. netbird's model is **groups + group→group policies + ports/port-ranges + posture checks**, so **per-service-tag = a netbird group**. We use tag-based groups; not a granularity we have to design around. **3. "Whatever simplifies certs."** — That points at **keeping the real `*.miskam.xyz` names** (existing Let's Encrypt wildcard via Cloudflare DNS-01 just keeps working) and **not** adopting a `.clan` TLD (which would need an internal CA + trust distribution). `coredns`/`data-mesher` resolve those same real names to mesh IPs internally. Decided. **4. Migration ordering — "not sure."** — Left **open**; default lean is the villa node (per #5) after the experiment passes. No commitment. **5. Backups.** — Adopted your split: **service-level export is primary** (portable, substrate-independent, NAS→Storage Box), **`incus export`/snapshot is secondary** for fast restore. The study no longer leans Incus-first. Net: Q2/Q3/Q5 resolved, Q1 reframed, Q4 still open. Next move is the low-risk `game-01` experiment — say the word and I'll turn §5 into an implementation plan.
Author
Owner

Update: clan-maintainer signal reopens the mesh choice

Found the actual #clan:clan.lol discussion on netbird — straight from clan core devs, and it's a meaningful caution:

  • lassulus (clan core): "netbird looked interesting but I haven't tried it yet."
  • Qubasa (clan core): "netbird is terrible to setup."
  • fosskar: OIDC no longer required (built-in IdP now) — "though netbird still can't be set up fully declaratively."
  • loopunit: "Netbird currently has a lot of issues with network routes — disconnect/reconnect every instance on add/remove of a route, also on sleep/hibernate."

Impact on this design:

  1. "Not fully declarative" is the headline strike. For a clan/NixOS homelab the whole point is git/Nix; netbird's peers/groups/policies/routes live in its management DB/API, not the flake (at best imperative via API/Terraform). clan's native meshes are declarative.
  2. Route bugs specifically kill the host-advertises-guest-subnet model → use per-guest agents, avoid netbird routes; also a roaming-laptop flag.
  3. "Terrible to setup" → a clan networking-backend integration is harder than assumed.

So I've reopened the mesh choice (it was implicitly "netbird") and reframed the experiment as a bake-off. The honest fork is data-plane quality (netbird) vs declarative-fit + clan-integration (clan-native):

Mesh Local discovery Declarative-fit clan selector Central server Knock
netbird yes (ICE) poor none yes terrible setup, route bugs, not declarative
mycelium yes (locality-aware) good inventory only no clan calls it "unreliable"
zerotier yes good integrated yes BSL license, controller SPOF
yggdrasil no good inventory only no no local discovery → out

Plan: on game-01, stand up netbird AND a clan-native mesh (mycelium as the local-first contender; zerotier as integrated baseline) side-by-side and score (a) local-direct, (b) cross-site, (g) declarativeness, (h) setup pain, (i) route/roaming self-healing. Decision rule: local-first is non-negotiable; netbird must clearly win the data plane to justify its declarative cost, else the clan-native mesh wins.

Full detail folded into the study (updated on main) — new Community signal, Mesh choice (reopened), and bake-off §5 sections. Thanks for digging up that chat — it changed the experiment from "deploy netbird" to "find out which overlay actually fits."

## Update: clan-maintainer signal reopens the mesh choice Found the actual `#clan:clan.lol` discussion on netbird — straight from clan core devs, and it's a meaningful caution: - **lassulus** (clan core): "netbird looked interesting but I haven't tried it yet." - **Qubasa** (clan core): **"netbird is terrible to setup."** - **fosskar**: OIDC no longer required (built-in IdP now) — *"though netbird still **can't be set up fully declaratively**."* - **loopunit**: *"Netbird currently has a lot of issues with network routes — disconnect/reconnect every instance on add/remove of a route, also on sleep/hibernate."* **Impact on this design:** 1. **"Not fully declarative" is the headline strike.** For a clan/NixOS homelab the whole point is git/Nix; netbird's peers/groups/policies/routes live in its management DB/API, not the flake (at best imperative via API/Terraform). clan's *native* meshes are declarative. 2. **Route bugs** specifically kill the host-advertises-guest-subnet model → use **per-guest agents**, avoid netbird routes; also a roaming-laptop flag. 3. **"Terrible to setup"** → a clan networking-backend integration is harder than assumed. **So I've reopened the mesh choice** (it was implicitly "netbird") and reframed the experiment as a **bake-off**. The honest fork is data-plane quality (netbird) vs declarative-fit + clan-integration (clan-native): | Mesh | Local discovery | Declarative-fit | clan selector | Central server | Knock | |---|---|---|---|---|---| | netbird | yes (ICE) | **poor** | none | yes | terrible setup, route bugs, not declarative | | mycelium | yes (locality-aware) | good | inventory only | **no** | clan calls it "unreliable" | | zerotier | yes | good | **integrated** | yes | BSL license, controller SPOF | | yggdrasil | **no** | good | inventory only | no | no local discovery → out | **Plan:** on `game-01`, stand up **netbird AND a clan-native mesh (mycelium as the local-first contender; zerotier as integrated baseline)** side-by-side and score (a) local-direct, (b) cross-site, (g) **declarativeness**, (h) setup pain, (i) route/roaming self-healing. **Decision rule:** local-first is non-negotiable; netbird must *clearly* win the data plane to justify its declarative cost, else the clan-native mesh wins. Full detail folded into the study ([updated on `main`](https://git.miskam.xyz/mxm/homelab/src/branch/main/docs/superpowers/specs/2026-06-01-homelab-mesh-clan-incus-pivot-design.md)) — new *Community signal*, *Mesh choice (reopened)*, and bake-off *§5* sections. Thanks for digging up that chat — it changed the experiment from "deploy netbird" to "find out which overlay actually fits."
Author
Owner

Decision: start the experiment with ZeroTier

Going with the pragmatic path first. ZeroTier is the only mesh fully integrated into clan's networking selector, so it's the lowest-friction way to get a working clan-over-mesh node and — crucially — clan machines update over the mesh works out of the box, no backend to build.

Rationale: prove the substrate (bare NixOS + Incus + clan + data-mesher + mesh-native edge) before sinking time into netbird's non-declarative control plane or self-hosting a headscale DERP. If the substrate has issues, find out with the easy mesh. ZeroTier then becomes the baseline the challengers are measured against.

Setup notes for the experiment: self-host the ZeroTier controller (e.g. on util-01 — clan supports a self-hosted controller that needn't be globally reachable), and run a moon if we want cross-site relay independence from ZeroTier's roots. Accept its knocks for a baseline: BSL license + controller SPOF.

Bake-off order on game-01: ZeroTier (baseline) → then headscale/tailscale (best NAT + declarative ACL file), netbird, mycelium as challengers. A challenger only displaces ZeroTier on clear evidence — better declarativeness, cross-site NAT without a third-party hop, or independence/license. If none clearly wins, ZeroTier stays (it already works in clan).

Folded into the study (updated on main) — §5 now leads with ZeroTier, mesh-choice read + decision rule updated, headscale/tailscale added as a first-class candidate.

Next: when ready to actually run it, turn §5 into a ZeroTier-first implementation plan (reinstall game-01 → clan + ZeroTier controller → Incus guests → data-mesher/coredns → mesh-native edge → capture the (a)–(i) baseline).

## Decision: start the experiment with **ZeroTier** Going with the pragmatic path first. ZeroTier is the **only mesh fully integrated into clan's networking selector**, so it's the lowest-friction way to get a working clan-over-mesh node and — crucially — **`clan machines update` over the mesh works out of the box, no backend to build**. **Rationale:** prove the *substrate* (bare NixOS + Incus + clan + `data-mesher` + mesh-native edge) before sinking time into netbird's non-declarative control plane or self-hosting a headscale DERP. If the substrate has issues, find out with the easy mesh. ZeroTier then becomes the **baseline** the challengers are measured against. **Setup notes for the experiment:** self-host the **ZeroTier controller** (e.g. on `util-01` — clan supports a self-hosted controller that needn't be globally reachable), and run a **moon** if we want cross-site relay independence from ZeroTier's roots. Accept its knocks for a baseline: BSL license + controller SPOF. **Bake-off order on `game-01`:** ZeroTier (baseline) → then **headscale/tailscale** (best NAT + declarative ACL file), **netbird**, **mycelium** as challengers. A challenger only displaces ZeroTier on clear evidence — better **declarativeness**, **cross-site NAT without a third-party hop**, or **independence/license**. If none clearly wins, ZeroTier stays (it already works in clan). Folded into the study ([updated on `main`](https://git.miskam.xyz/mxm/homelab/src/branch/main/docs/superpowers/specs/2026-06-01-homelab-mesh-clan-incus-pivot-design.md)) — §5 now leads with ZeroTier, mesh-choice read + decision rule updated, headscale/tailscale added as a first-class candidate. Next: when ready to actually run it, turn §5 into a ZeroTier-first implementation plan (reinstall game-01 → clan + ZeroTier controller → Incus guests → data-mesher/coredns → mesh-native edge → capture the (a)–(i) baseline).
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
mxm/homelab#6
No description provided.