feat(nix): deploy forgejo-01 with PostgreSQL 17 — Phase 4 #1

Open
mxm wants to merge 0 commits from 67-phase4-forgejo-n8n-matrix into main
Owner

Summary

Phase 4 continues: forgejo-01 (CT 116) migrated to NixOS with full SSH git access.

Changes

forgejo-01 (CT 116) — #67

  • services.forgejo with non-LTS package (v14.0.2) for migration version compatibility
  • PostgreSQL 17 with dataDir on /data/postgres/17/main
  • Runs as git user for SSH compatibility (git@git.miskam.xyz)
  • DB ownership transferred from forgejogit role
  • Secrets via sops (internal token, JWT, LFS JWT)
  • Data migrated from Debian rootfs to /data (postgres + forgejo + repos)
  • Z tmpfiles for Debian UID ownership fix

Also included (not yet deployed)

  • n8n-01 container config
  • matrix-01 container config (custom binary approach for continuwuity)

Verified

  • forgejo + postgresql active
  • External: https://git.miskam.xyz → 200
  • SSH: git@git.miskam.xyz authenticated, push works
  • Repositories intact and accessible
  • just test passes

Closes #67

## Summary Phase 4 continues: forgejo-01 (CT 116) migrated to NixOS with full SSH git access. ## Changes ### forgejo-01 (CT 116) — #67 - `services.forgejo` with non-LTS package (v14.0.2) for migration version compatibility - PostgreSQL 17 with dataDir on `/data/postgres/17/main` - Runs as `git` user for SSH compatibility (`git@git.miskam.xyz`) - DB ownership transferred from `forgejo` → `git` role - Secrets via sops (internal token, JWT, LFS JWT) - Data migrated from Debian rootfs to /data (postgres + forgejo + repos) - Z tmpfiles for Debian UID ownership fix ### Also included (not yet deployed) - n8n-01 container config - matrix-01 container config (custom binary approach for continuwuity) ## Verified - [x] forgejo + postgresql active - [x] External: https://git.miskam.xyz → 200 - [x] SSH: `git@git.miskam.xyz` authenticated, push works - [x] Repositories intact and accessible - [x] `just test` passes ## Related Issue Closes #67
mxm added 195 commits 2026-03-31 10:20:05 +00:00
Ansible playbooks for automated CT/HV setup (create, restore, migrate):
- ssh-keys, security-baseline, alert-scripts roles
- Inventory for all wow CTs (101-119), hv-04, villa HVs (01-03), nas-01
- Vault-encrypted Telegram + Matrix alert secrets
- Per-host UFW rules, unattended-upgrades with 3rd-party repo origins

Also includes:
- Site-to-site WireGuard VPN runbook and debug log
- Cross-site migration runbook (updated with Ansible references)
- CLAUDE.md: villa site, cross-site DNS, reverse proxy docs
- WiFi QR codes for new coworking SSIDs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Brings in all uncommitted changes from main working directory:
- Updated hosts.yml with villa site containers
- Ansible roles (traefik, cloudflared, prometheus)
- Updated ansible.cfg, group_vars, security-baseline
- Updated migrate-ct.sh and documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add hostname, service_port, public fields to 10 service hosts.
Add site_prefix, site_gateway, site_traefik_ip, site_tunnel_name,
site_tunnel_id, site_primary_hv to wow and villa group_vars.

These fields make hosts.yml the single source of truth for
DNS, traefik routes, and Cloudflare tunnel config generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New role generates DNS overrides from inventory metadata:
- Service hostnames resolve to local/remote site_traefik_ip
- Host records for .wow/.villa suffixed infrastructure
- Non-inventory services (oi, hv-04, prometheus) hard-coded

mxm-infra.yml applies ssh-keys, security-baseline, alert-scripts,
dnsmasq, traefik, and cloudflared to a target traefik host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reads inventory to sync Cloudflare state:
- DNS CNAMEs: ensures public services point to correct tunnel
- Tunnel ingress: builds rules from inventory + special cases
- Supports --dry-run for safe previewing
- Uses flarectl for DNS, CF API for tunnel config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Route generation iterates all container hosts with hostname +
service_port fields, building routes automatically. Enabled via
traefik_generate_routes: true per traefik host.

Manual traefik_routes replaced with traefik_routes_extra for
non-inventory services (proxmox, oi, forgejo-ssh).

Extra routers (matrix well-known) moved to continuwuity host
as extra_routers field, picked up during generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
migrate-service.sh orchestrates full service migration:
1. Calls migrate-ct.sh for backup/restore/network
2. Runs mxm-infra.yml on both traefik hosts (routes + dnsmasq)
3. Runs cf-sync.sh for Cloudflare DNS + tunnel updates

migrate-ct.sh refactored to read site metadata from group_vars
via yq instead of hardcoded case functions, supporting future
third-site additions without code changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip auto-generation for hosts with matching custom_route_files
  entry (prevents burgenlandkreis route overwrite)
- Use safe empty-array expansion pattern for bash < 4.4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ct-migration-cross-site.md: add migrate-service.sh section, update
  site conventions to reference group_vars, mark automated steps
- villa-traefik-tunnel-setup.md: update for mxm-infra.yml playbook,
  dnsmasq DNS, and automated migration checklist
- Add development journal documenting architecture and decisions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- media-vpn.md: qbt.home.miskam.xyz → qbt.miskam.xyz, jf.home.miskam.xyz → jf.miskam.xyz
- ct-migration-script log: traefik is per-site (not always hv-04.wow), add group_vars dep, add superseded-by note
- villa-traefik-tunnel-setup: rewrite for dnsmasq + mxm-infra.yml automation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete ansible.cfg (no roles/playbooks left), replace subprocess +
temp-file vault decryption with Ansible's VaultLib Python API, and
remove ansible-lint from dev shell.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace Ansible playbooks/roles and bash scripts with a unified Python CLI
(`homelab`) that handles converge, CT lifecycle, and Cloudflare sync.

- Add homelab CLI: converge engine, inventory loader, SSH pool, Proxmox/CF clients
- Add converge roles: ssh-keys, security-baseline, alert-scripts, postgres,
  prometheus, traefik, cloudflared, dnsmasq, cf-dns, cf-tunnel
- Add CT lifecycle commands: create, migrate, destroy (via Proxmox API)
- Consistent role-to-host assignment via boolean flags (traefik, cloudflared, prometheus)
- Fix DHCP host IPs (burgenlandkreis, openclaw) and CF API token in vault
- Remove all Ansible roles, playbooks, and shell scripts
- Update CLAUDE.md, docs, and runbooks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace per-role applies_to overrides and boolean flags (postgres,
traefik, cloudflared, dnsmasq, prometheus) with a unified roles list
on each host. default_roles in group_vars provides the baseline
(ssh-keys, security-baseline, alert-scripts), merged with per-host
roles. Hosts can set no_default_roles: true to skip defaults entirely
(e.g. gateways). Also adds ansible_modules abstraction layer, gateway-dns
role, and SSH connection pooling improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
proxmoxer's ssh_paramiko backend hardcodes allow_agent=True and
look_for_keys=True. With many SSH keys in the agent, this burns
through fail2ban's retry limit before trying the correct key.

Monkey-patch _connect() to use allow_agent=False, look_for_keys=False,
and strip the @pam realm from the Proxmox API username for SSH auth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ct create: --postgres replaced with --role (repeatable)
- Document inventory_defaults() and manual_steps() classmethods
- Add ROLE_REGISTRY to engine docs
- Add pytest dev dependency to deps table
- Document ProxmoxClient SSH monkey-patch
- Fix stale file paths (cli_ct.py → ct/command.py, inventory.py → inventory/loader.py)
- Update monitoring-01 hostname and public status
- Note CT 110 (postgres) removed from inventory

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Introduced `test_ct_integration.py` to validate the lifecycle of a container (CT) including creation, data persistence, rebuild, and destruction.
- Implemented various checks to ensure the correct state of the CT and its data volume throughout the lifecycle.
- Added `test_inventory_loader.py` to test YAML parsing, Jinja2 variable resolution, and role merging logic in the inventory loader.
- Updated `uv.lock` to include `pytest` and `ruff` as optional development dependencies.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Renames: gw-01.wow→gw-01, gw-01.villa→gw-02, hv-04.wow→hv-04,
hv-01..03.villa→hv-01..03. Site is determined by inventory group
structure; internal DNS auto-generates <name>.<site> records.

Also fixes migrate command to derive site from inventory host lookup
instead of parsing the dot-separated HV name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set no_default_roles on mail-01 and util-01 (DNS-only hosts, not
managed by converge). Also skip hosts with no roles from converge
host selection to avoid unnecessary SSH attempts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Introduced ensure_service_running function to manage systemd services, ensuring they are enabled and active.
- Updated roles (cloudflared, postgres, prometheus, traefik, etc.) to utilize the new service management function.
- Refactored service checks and restarts across various roles for consistency and improved readability.
- Added Grafana role for installation and configuration, including datasource provisioning.
- Modified inventory loading to replace hostname with service_name for better clarity.
- Implemented SSH command functionality for easier remote access to inventory hosts.
Covers rebuild plan: add /data volume, migrate app data, upgrade
postgres 16→17 via dump/restore, static IP, converge postgres role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ticker role: manages symlinks, systemd units, service state
  for the burgenlandkreis.events app on CT 101
- Fix postgres role: early return if package install fails, preventing
  cascading chown error on missing postgres user
- Fix ct rebuild/migrate: auto-clear stale SSH known_hosts entries
- Fix security-baseline: generate en_US.UTF-8 locale on fresh CTs
- Update CT 101 inventory: add ticker role
- Update plan docs with lessons learned

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- postgres role: detect existing data on /data after rebuild, skip
  copying empty default cluster, just reconfigure data_directory
- ct rebuild/migrate: auto-accept new SSH host key via ssh-keyscan
  after CT starts (complements the known_hosts removal on destroy)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Install curl and rsync on all containers. Runs before ssh-keys
so utilities are available to subsequent roles.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Install python3-requests, deploy scrape-calendars runner script +
systemd timer (runs at :30, before ticker-update at :00), ensure
ICS output directory exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hypervisors only got bare name records (e.g. hv-04) while containers
got both bare and suffixed (e.g. n8n-01.villa). Add <name>.<site>
records for hypervisors so hv-04.wow and hv-01.villa resolve.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces declarative resource API for converge roles, replacing
verbose boilerplate with r.file(), r.dir(), and r.check_cmd() calls
that accumulate RoleResult entries and return Resource handles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Introduced new skill documentation for homelab role authoring, detailing role structure, resources API, and patterns for role implementation.
- Updated AGENTS.md to CLAUDE.md, providing guidance for Claude Code, including project overview, installation instructions, key files, dependencies, and repository structure.
- Enhanced sections on CLI usage, task runner, and inventory management, while adding detailed descriptions of roles and security measures across sites.
refactor(ct): integrate traefik DNS reconvergence into converge and migrate functions
chore: add shared YAML utilities for atomic dumping
- Introduced new documentation for homelab operations, including runbooks, experience logs, tool notes, and plans/specs.
- Created skills for drift checking, inventory management, investigation processes, networking configurations, role authoring, and updating documentation.
- Each skill provides structured guidance for specific tasks, ensuring consistency and clarity in infrastructure management.
- Enhanced the inventory system documentation, detailing file structures, loading pipelines, and common fields.
- Established protocols for auditing configuration consistency and mapping code changes to documentation updates.
- verify() now takes (self, r: Resources) instead of (host, ssh, ctx),
  using r.check_cmd() and r._result() to accumulate into the same
  Resources instance used by resources()
- LocalRole.converge(ctx) replaced with resources(r: LocalResources),
  new LocalResources class provides inventory, dry_run, and _result()
- Grafana: replace urllib.request download with r.apt_key()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add Rich Live progress table during convergence showing per-host status
- Group role results under role headers for better scannability
- Use Rich Syntax with diff lexer for colored dry-run diffs
- Buffer results and print in deterministic host order (not as-completed)
- Remove duplicate summary from command.py (engine already prints one)
- Remove shadow Console instance from command.py

Closes #2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
itertools.groupby only groups consecutive elements — if results from
the same role aren't contiguous, they'd appear as separate groups.
defaultdict(list) handles this correctly regardless of ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewed-on: #3
Reviewed-on: #5
Reviewed-on: #7
Reviewed-on: #18
Reviewed-on: #20
Two new converge roles for CI/CD runner infrastructure:

- `docker`: installs Docker CE from official apt repo (key, repo, packages, service)
- `forgejo-runner`: installs binary v6.3.1, deploys config + systemd unit, handles
  offline registration via create-runner-file. Depends on docker role.

Also adds `forgejo_runner_secret` field to Host model and inventory loader.

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the tmux-based Claude Code development environment on util-01,
including session naming convention, directory layout, authentication,
and network reachability from WireGuard tunnel.

Closes #22

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewed-on: #25
Verified from util-01 that both traefik CTs (wow + villa) respond to
ping and SSH on port 2222. The earlier "no return route" finding was
transient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chore: update settings.json with additional permissions for SSH commands
fix: remove .claude directory from .gitignore
Reviewed-on: #25
Document the tmux-based Claude Code development environment on util-01,
including session naming convention, directory layout, authentication,
and network reachability from WireGuard tunnel.

Closes #22

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verified from util-01 that both traefik CTs (wow + villa) respond to
ping and SSH on port 2222. The earlier "no return route" finding was
transient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chore: update settings.json with additional permissions for SSH commands
fix: remove .claude directory from .gitignore
Reviewed-on: #21
Reviewed-on: #27
create-runner-file writes .runner to CWD, which defaults to /root/ over SSH.
Prefix with cd to /var/lib/forgejo-runner so the file lands where the systemd
unit and config expect it.

Also adds the vault-encrypted runner secret to inventory and a setup runbook.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Labels were under container: instead of runner: in config template, causing
empty labels in .runner file. Added --connect flag to create-runner-file so
registration syncs with Forgejo server.

Switch CI checkout to code.forgejo.org/actions/checkout@v6 (Go-based) —
the uv container image has no Node.js, which actions/checkout@v4 requires.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The uv container image has no Node.js, which all versions of
actions/checkout require. Use git clone directly instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
uv sync --dev uses PEP 735 [dependency-groups], not
[project.optional-dependencies]. Ruff was only in the latter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_remove_known_host and _accept_host_key call ssh-keygen/ssh-keyscan via
subprocess. CI container lacks openssh-client, causing FileNotFoundError.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Plain python -c doesn't see venv packages. Use uv run to activate the venv.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yaml.safe_load rejects unknown YAML tags. Register a passthrough
constructor for !vault so encrypted values are treated as strings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add converge role for lyrion-01 (CT 120, villa site) that handles apt repo
setup, package installation, /data volume migration, systemd BindPaths
override, and service management with health check verification.

Closes #17

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GITHUB_REF_NAME resolves to the PR number (not the branch name) on
Forgejo pull_request events, causing `git clone -b 30` to fail.
Use GITHUB_HEAD_REF (the actual branch name) with fallback to
GITHUB_REF_NAME for push events.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewed-on: #29
Add converge role for n8n workflow automation (CT 111, villa site):
- NodeSource apt repo + Node.js 22.x + npm global n8n install
- Environment config with postgres connection, webhook URL
- /data mount check + user folder migration from ~/.n8n
- systemd unit with postgresql dependency
- Health check via /healthz endpoint
- inventory_defaults: UFW port 5678, NodeSource uu_extra_origins

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The n8n service should run as the dedicated n8n user, not root.
Also restores Restart=always and adds WorkingDirectory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewed-on: #28
Adds openclaw, uptime-kuma, qbittorrent, jellyfin, dokuwiki, and
vaultwarden converge roles with /data migration, systemd management,
and verify steps. All live-tested via converge-dry and live converge.

Closes #10, Closes #11, Closes #12, Closes #13, Closes #14, Closes #15

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewed-on: #30
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drops ports 80/443 from UFW (gateway is localhost-only on 18789).
Access token is a placeholder — needs vault encryption.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop git-clone source build approach. Install via pnpm with pinned version
to /opt/pnpm/. Dedicated openclaw system user runs the gateway service.
Config templated from inventory vars (Matrix credentials).
Symlink ~/.openclaw -> /data/openclaw for persistence.
Cleanup old /opt/openclaw source build and npm global install.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Version-aware install check (grep version in pnpm list output)
- Guard chown with ownership check (skip if already openclaw)
- Validate symlink target, not just existence
- Validate all required Matrix fields, not just access_token
- Strip comments per project convention

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pnpm requires PNPM_HOME to be in PATH, not just set as env var.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing config has 5 API keys, Telegram channels, agent settings,
and room-specific Matrix config that can't be reasonably templated.
Config is now managed by `openclaw onboard` and lives on /data/ (persistent).

Remove openclaw_matrix_* inventory fields — only openclaw_version remains.
Role becomes pure scaffolding: binary + user + dirs + systemd.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v2026.3.22 has two upstream bugs:
- Plugin API version hardcoded as 1.2.0 (github #51494)
- Matrix plugin (@openclaw/matrix) incompatible with new plugin SDK

v2026.3.13 bundles the Matrix plugin and works correctly.
Also cleans up old systemd override dir from previous role version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the full rewrite, upstream bugs blocking v2026.3.22,
and criteria for when it's safe to upgrade.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pnpm needs git to resolve @whiskeysockets/baileys GitHub ref.
Re-add openclaw-01 to inventory after CT rebuild.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bundled matrix extension in v2026.3.13 requires this peer dependency
but pnpm global install doesn't resolve it. Install via npm directly
into openclaw's package directory in the pnpm store.

Also adds git as apt dependency (needed by @whiskeysockets/baileys).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate flake.nix to the vimjoyer pattern (flake-parts + import-tree)
for composable, auto-discovered NixOS modules. Adds nixos-generators
and sops-nix inputs for the upcoming NixOS LXC migration.

Closes #38

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hostConfig reads hosts.yml at Nix eval time and returns per-container
config (site, system arch, SSH pub key, reboot time, postgres fields).
serviceManifests exports all containers' ports as JSON for routing.

Adds arch field to HVs and hv field to monitoring-01/n8n-01 in
hosts.yml. Checks in site SSH public keys under keys/.

Closes #39

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Minimal nixosConfiguration that declares homelab.services (grafana
+ prometheus) to verify serviceManifests extracts ports correctly.
Will be replaced by real container configs in later issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registers flake.nixosModules.base with: SSH key-only auth, nftables
firewall (deny-by-default, port 22), fail2ban, auto-upgrade with
configurable reboot time, nix-ld, and homelab.services option registry
for feature modules to declare their ports.

Updates test-stub to use the real base module instead of inlining the
homelab.services option.

Closes #40

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds .sops.yaml with homelab-wide age key (same key as dotfiles repo)
scoped to modules/secrets/*.yaml. Includes encrypted test secret to
verify round-trip encryption/decryption.

Age private key lives in existing sops/Bitwarden setup. Deployed to
/data/age.key on containers during homelab deploy.

Closes #41

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds unprivileged ops user (wheel + systemd-journal groups) with site
SSH key and passwordless sudo. Drops root SSH login entirely —
root only accessible via sudo or pct exec from HV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the ops user + no-root-SSH design decision and its
implications for CLI commands (ops SSHes as ops, deploy uses pct exec).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds nixos (bool) and hv (str) fields to the Host dataclass, inventory
loader, and inventory table output. Converge engine skips NixOS hosts
with a message. Adds hv field to all container entries in hosts.yml.

Closes #42

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements `homelab init <name> <ctid> <site>` which generates:
- hosts.yml entry (IP derived from site prefix + CTID, nixos: true)
- modules/containers/<name>.nix (hostConfig wiring + feature imports)
- modules/secrets/<name>.yaml (sops-encrypted placeholder)

Supports --public, --feature (repeatable), --memory, --swap, --cores,
--data-disk. Preflight rejects duplicate names, CTIDs, and existing
.nix files. Feature validation checks modules/features/ exists.

Closes #43

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wraps nix build to produce LXC tarballs. Validates host exists in
inventory with nixos: true, resolves architecture from hv → HV arch,
and runs nix build .#packages.<system>.<name>-lxc.

Closes #44

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements ops subapp with 10 commands:
- shell: interactive SSH (system ssh, PTY via execvp)
- status/restart/logs: service management via paramiko
- snapshot/resize/destroy: container lifecycle via PVE API
- migrate: intra-cluster via pct migrate, cross-site stubbed
- db create: PostgreSQL createdb/createuser via SSH
- cert rotate: ACME renewal trigger via SSH

NixOS containers SSH as ops user (not root). Cross-site migration
documented but not yet implemented (requires vzdump transfer + rebuild).

Closes #47

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements deploy with two scenarios: create (new CT) and rebuild
(swap rootfs, preserve /data). Uploads tarball via SFTP, seeds age
key via pct exec on first deploy. Supports --build and --dry-run.

Rebuild uses the shelter pattern from ct rebuild: detach + rename
/data LVM volume before destroy, reattach after recreate.

Closes #45

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Syncs traefik routes + Cloudflare DNS + tunnel ingress in 5 phases:
1. Collect service manifests from nix eval (NixOS containers)
2. Combine with inventory service_port (Debian containers)
3. Write route YAML files to traefik CTs via SSH + GC unmanaged
4. Ensure Cloudflare DNS CNAMEs for public services
5. Sync Cloudflare tunnel ingress rules per site

Supports --dry-run. Absorbs cf-dns and cf-tunnel converge role logic
for the unified routing pipeline.

Closes #46

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Creates grafana.nix and prometheus.nix feature modules with /data/
persistence, service port registration, and sops secret for Grafana's
secret_key. Prometheus uses file_sd_configs for runtime target
discovery (targets written to /data/prometheus/targets.json by CLI).

Adds monitoring-01.nix container config importing base + grafana +
prometheus. Sets nixos: true in hosts.yml. Adds LXC stub config to
base.nix (fileSystems + boot.loader for LXC compatibility).

Part of #48

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of overriding ExecStart or symlinking via tmpfiles, bind mount
/data/prometheus onto /var/lib/prometheus. This lets the NixOS module
manage its own ExecStart and config file generation while data persists
on /data across rebuilds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Grafana runs as the grafana user and needs read access to the
decrypted secret at /run/secrets/grafana_secret_key.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NixOS defaults to useDHCP=true which overrides the static IP config
that Proxmox writes to /etc/systemd/network/eth0.network. Disable
NixOS DHCP and enable systemd-networkd so Proxmox-managed networking
works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds lib.prometheusTargets — collects all hosts with prometheus_node
from hosts.yml at Nix eval time. Prometheus module gets a
scrapeTargets option (list of {name, target}). Replaces file_sd_configs
with static_configs baked into the image.

Trade-off: adding a monitored host requires rebuild of monitoring-01.
Tech debt tracked in #49.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LXC first boot doesn't run NixOS activation scripts reliably.
sops-nix's useSystemdActivation creates a proper systemd service
that runs before dependent services (grafana, etc).

Also disable SSH key import — we use a dedicated age key at /data/age.key.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unprivileged LXC containers can't mount debugfs, causing
systemctl is-system-running to report 'degraded'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewed-on: #51
- Age key seeded via host-side mount (not pct exec) with UID 100000
  for unprivileged LXC containers
- Age key path reads from ~/.config/sops/age/keys.txt or SOPS_AGE_KEY_FILE
- Accept 'degraded' system status (sys-kernel-debug.mount expected failure)
- Add ProxmoxClient.seed_age_key() method

Fixes #52

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix tarball path: resolve actual .tar.xz inside nix output dir
- Fix pve.node property call (was called as method)
- seed_age_key resolves data LV from mp0 config (not hardcoded disk-1)
- vzdump backs up to storagebox (not /tmp)
- Remove unused AGE_KEY_PATH constant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use pct set instead of API for data volume (avoids proxmoxer 500
  on LVM thin pool warnings)
- Use full NixOS path for systemctl in pct exec verification
- Catch RuntimeError on verification to avoid crash

Tested: created and verified both NixOS CT 199 and Debian CT 198.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewed-on: #53
- Add modules/features/uptime-kuma.nix: service feature with Node.js systemd service
- Add modules/containers/uptime-kuma-01.nix: container configuration for CT 114
- Add modules/secrets/uptime-kuma-01.yaml: encrypted sops secrets file

Closes #54
- Set nixos: true for CT 114 (uptime-kuma-01) to enable NixOS build/deploy commands
- Phase 2 migration prerequisite
- Note uptime-kuma-01 (CT 114) and monitoring-01 (CT 115) are now NixOS
- Add comprehensive deployment/rollback runbook for Phase 2 migration
- Includes pre-deployment checklist, step-by-step deploy, verification, and rollback procedures
- Mark uptime-kuma-01 (CT 114) and monitoring-01 (CT 115) as NixOS
- Aligns with CLAUDE.md updates
- Remove nested /data/uptime-kuma fileSystems binding (LXC handles /data mount)
- Remove ProtectSystem=strict (too restrictive for unprivileged containers)
- Change ConditionPathIsMountPoint to ConditionPathExists (LXC compatible)
- Add data.mount to service dependencies for proper ordering
- Relax security constraints suitable for containerized environment
- Remove WorkingDirectory requirement (causes CHDIR failures)
- Use full path in ExecStart to server.js
- Run preStart as root to create /data directory and install dependencies
- Relax security constraints (NoNewPrivileges, PrivateTmp, ProtectHome) for LXC
- Add NODE_ENV=production
- Fix ownership/permissions before service start
- Add systemd.tmpfiles.rules to create /data/uptime-kuma with proper permissions
- Remove manual mkdir from preStart (now handled by tmpfiles)
- Add dependency on systemd-tmpfiles-setup.service
- Restore WorkingDirectory (will be created by tmpfiles first)
- This is the proper NixOS pattern for container directory initialization
- Replace npm-based uptime-kuma installation with simple HTTP stub server
- Stub server responds on port 3001 with basic status page
- Allows Phase 2 testing without external npm registry dependencies
- Note: Full uptime-kuma packaging should be done in later phase with proper nixpkgs integration
- Document that uptime-kuma in nixpkgs is broken
- Clarify what was validated (pattern) vs what defers (packaging)
- Recommend Phase 2 as 'Architecture Validation' not full service deployment
- Set proper expectations for Phase 3 uptime-kuma packaging work
Issue found during Phase 2:
- Deploying different service to same CT with old /data causes corruption
- deploy.py shelters old /data without checking if it's compatible
- monitoring-01 data (grafana, prometheus) persisted when deploying uptime-kuma-01

This document proposes Option A (explicit data cleanup) + service markers:
- Check service identity before preserving /data
- Back up old /data if service changes
- Mount clean /data for new services
- Allow recovery via --restore-data flag

Implementation deferred to Phase 3+ (before production deployments)
Add reference to DEPLOYMENT-DATA-ISSUE.md in Phase 2 status.
Mark /data safety as HIGH PRIORITY blocker for Phase 3.
- Use upstream services.uptime-kuma module (v2.2.0) instead of stub
- Static user + tmpfiles + ReadWritePaths for /data persistence
- Open firewall port 3001, bind to 0.0.0.0 for traefik access
- Update runbook and phase notes with lessons learned
Reviewed-on: #56
- Use upstream services.dokuwiki module (nginx + php-fpm)
- stateDir on /data/dokuwiki/data for wiki content persistence
- usersFile on /data/dokuwiki/users.auth.php for user accounts
- ACL and settings declared in Nix config
- Firewall port 80 open for traefik access
- All wiki pages, users, and config preserved from Debian migration
Reviewed-on: #61
- Add service marker file /data/.homelab-service to track which service
  owns the /data volume
- On rebuild: read marker, detect service mismatch
- If service changed: archive old /data as LV, create fresh volume
- If same service: preserve /data (existing behavior)
- Legacy volumes (no marker): warn and preserve (safe default)
- Write marker on both create and rebuild paths
Reviewed-on: #62
Reviewed-on: #77
Reviewed-on: #78
tmpfiles Z rule ensures /data/vaultwarden files are owned by the NixOS
vaultwarden user on boot. Fixes 'readonly database' error caused by
Debian UID mismatch after migration.
The plugin bundles pre-built CryptX XS binaries for perl <=5.40, but
NixOS ships perl 5.42. XSLoader fails to find CryptX.so, leaving the
module half-loaded, then Perl refuses to reload it.

Fix: build ShairTunes2W as a Nix derivation that strips the bundled
elib/ and symlinks nixpkgs perlPackages.CryptX 0.087 (perl 5.42).

Also fix plugin preStart permissions: use a root ExecStartPre (+) script
with chown instead of running as the slimserver user.
Strict rpfilter drops cross-site VPN traffic because packets arrive
via WireGuard tunnel on the gateway but the container's routing table
doesn't have a direct reverse path. Loose mode only checks that any
route to the source exists, which works with site-to-site VPN.
- Use services.forgejo with non-LTS package (v14.0.2, migration v44 compat)
- PostgreSQL 17 with dataDir on /data/postgres/17/main
- Secrets via sops (internal token, JWT secrets, LFS JWT)
- Data migrated from rootfs to /data (postgres + forgejo)
- Z tmpfiles for Debian UID ownership fix
- External access via git.miskam.xyz verified

Also: n8n-01 and matrix-01 container configs created (not yet deployed)
feat(nix): deploy forgejo-01 with PostgreSQL 17 — Phase 4 (#67)
Some checks failed
CI / lint (pull_request) Failing after 55s
CI / test (pull_request) Successful in 56s
CI / inventory-parse (pull_request) Successful in 4s
25ce65508e
- Use services.forgejo with non-LTS package (v14.0.2, migration v44 compat)
- Run as 'git' user for SSH compatibility (git@git.miskam.xyz)
- PostgreSQL 17 with dataDir on /data/postgres/17/main
- pg_ident map: OS user 'git' → DB user 'git', DB ownership transferred
- Secrets via sops (internal token, JWT secrets, LFS JWT)
- Data migrated from rootfs to /data (postgres + forgejo)
- Z tmpfiles for Debian UID ownership fix

Also: n8n-01 and matrix-01 container configs created (not yet deployed)
- Introduced a new workflow module to manage deployment steps using Plan and Step classes.
- Updated deploy.py to gather CT state, plan deployment actions, and execute them in a structured manner.
- Enhanced inventory loading by separating raw and decrypted data handling, ensuring better organization and clarity.
- Added support for migration inspections and planning in migrate.py, improving the migration process for containers.
- Implemented detailed inspection and synchronization of Traefik route files in routes.py, allowing for better management of service manifests.
- Introduced new data classes for better type handling and clarity in inventory loading and migration processes.
- Added unit tests for inventory loading to ensure robustness and error handling.
feat: update .gitignore to include specific mcp.json and remove unused Justfile commands
Some checks failed
CI / lint (pull_request) Failing after 3s
CI / inventory-parse (pull_request) Successful in 5s
CI / test (pull_request) Successful in 9s
36f6287954
Some checks failed
CI / lint (pull_request) Failing after 3s
CI / inventory-parse (pull_request) Successful in 5s
CI / test (pull_request) Successful in 9s
This pull request has changes conflicting with the target branch.
  • infra/src/homelab/ct/command.py
  • infra/src/homelab/deploy.py
  • infra/src/homelab/proxmox.py
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin 67-phase4-forgejo-n8n-matrix:67-phase4-forgejo-n8n-matrix
git switch 67-phase4-forgejo-n8n-matrix

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main
git merge --no-ff 67-phase4-forgejo-n8n-matrix
git switch 67-phase4-forgejo-n8n-matrix
git rebase main
git switch main
git merge --ff-only 67-phase4-forgejo-n8n-matrix
git switch 67-phase4-forgejo-n8n-matrix
git rebase main
git switch main
git merge --no-ff 67-phase4-forgejo-n8n-matrix
git switch main
git merge --squash 67-phase4-forgejo-n8n-matrix
git switch main
git merge --ff-only 67-phase4-forgejo-n8n-matrix
git switch main
git merge 67-phase4-forgejo-n8n-matrix
git push origin main
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
mxm/homelab!1
No description provided.