feat(examples): comprehensive node prerequisites audit (fixes #19)#27
feat(examples): comprehensive node prerequisites audit (fixes #19)#27
Conversation
Research across all Cozystack subsystems revealed the example prepare playbooks were installing only a fraction of what nodes actually need. Symptom reported in issue #19 (missing zfsutils-linux) was one of several silent-failure traps. ## Changes ### Required (installed on every node): - LVM stack: lvm2 plus thin-provisioning-tools (or device-mapper-persistent-data on RHEL) for LINSTOR - Kernel headers: linux-headers-generic / kernel-devel / kernel-default-devel. Piraeus operator compiles DRBD 9.x from source at runtime; only headers are needed on the host. - Kernel modules via /etc/modules-load.d/cozystack.conf: overlay, br_netfilter, openvswitch, geneve, ip_tables, iptable_nat. - Sysctl: net.bridge.bridge-nf-call-iptables/ip6tables and net.ipv6.conf.all.forwarding. - Critical: multipathd DRBD blacklist. Without it LINSTOR volumes become inaccessible after reboot. ### Enabled by default (opt-out): - cozystack_enable_zfs: installs zfsutils-linux (Ubuntu); RHEL auto-adds OpenZFS release RPM; openSUSE auto-adds OBS filesystems repo; loads the zfs kernel module. - cozystack_enable_kubevirt: loads vhost_net, tun, and kvm_intel or kvm_amd kernel modules. QEMU and libvirt are bundled in KubeVirt pods, so no host userspace packages are installed. ### README - New 'Node Prerequisites' section with per-subsystem tables and verified package names for Ubuntu 22.04/24.04, RHEL 9, openSUSE 15.6. - Callouts explain why each prerequisite matters and which Cozystack component needs it. ### Other - Bump galaxy.yml to 1.3.0 (new opt-out variables in examples). Closes #19 Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
- Revert galaxy.yml version bump: collection version tracks upstream
Cozystack release, stays at 1.2.2 until next upstream bump. Changes
land under an 'Unreleased' CHANGELOG section instead.
- Ubuntu/Debian kernel headers: use linux-headers-${ansible_kernel}
so the package list works on both Ubuntu (linux-headers-<kernel>)
and Debian (linux-headers-<arch>); linux-headers-generic was
Ubuntu-only and broke Debian nodes.
- ZFS kernel module persistence: write /etc/modules-load.d/cozystack-zfs.conf
in all three prepare playbooks so modprobe survives reboot (KubeVirt
already had this; ZFS was missing).
- RHEL ZFS GPG: import OpenZFS project key before installing the release
RPM, dropping disable_gpg_check. Adds cozystack_zfs_gpg_key_url var.
- openSUSE ZFS OBS path: OBS uses canonical distro names
(openSUSE_Leap_15.6) as path segments, not bare versions. Introduces
cozystack_zfs_obs_path and builds the URL from it. Previous default
failed to resolve.
- README: restore fs.* and vm.swappiness sysctls to the documented
table (they are set by the playbook but had dropped out of the doc).
- README: cozystack_chart_version default fixed from stale 1.1.2 to
actual 1.2.2.
- CHANGELOG: backfill missing v1.1.3, v1.2.1, v1.2.2 entries so the
Unreleased block does not read as a jump from v1.1.2.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Review findings: - Debian ZFS (P2): zfsutils-linux is in Debian 'contrib', disabled in stock sources.list. Ubuntu example now auto-enables contrib on Debian when cozystack_enable_zfs is true. README documents this. - RHEL kernel-devel (P2): pin to ansible_kernel so Piraeus compiles DRBD against the running kernel, not a staged update. - SUSE kernel-default-devel: same pin to ansible_kernel. - SUSE OBS path (P2): auto-detect Leap / Tumbleweed / SLE so the default URL is valid on every supported openSUSE/SLE flavour. - ZFS modprobe: failed_when: false in all three playbooks (Secure Boot and unsigned-module environments should not abort preparation). - KVM modules persistence: add kvm_intel and kvm_amd to /etc/modules-load.d/cozystack-kubevirt.conf. Only one loads per CPU vendor; the other produces a harmless systemd-modules-load log line. - README: kernel-headers row now shows the ansible_kernel-pinned package names that the playbooks actually install, and documents the Debian contrib auto-enable. New CLAUDE.md captures the conventions we've been operating under: - Release strategy: galaxy.yml version tracks upstream Cozystack chart version and does not bump on feature additions. Changes accumulate under an Unreleased CHANGELOG section until the next upstream bump. - PR review workflow: both codex review (background) and /review-branch are mandatory. - Planning discipline: research depth and package-name verification requirements. - What NOT to install on hosts (bundled in pods). - Critical silent-failure traps (multipath DRBD blacklist, etc). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ux-modules-extra on Ubuntu
Review round 3 findings:
- Kernel module loading suppressed all failures with failed_when: false,
which masks a real problem: if br_netfilter fails to load, the next
sysctl task dies with a cryptic 'No such file or directory'. Split
the list into 'optional' (overlay — usually built into the kernel)
and 'required' (br_netfilter, openvswitch, geneve, ip_tables,
iptable_nat). Only the optional task keeps failed_when: false; the
required task aborts cleanly with a pointed error.
- Ubuntu cloud/minimal kernels ship openvswitch and geneve in
linux-modules-extra-${kernel}, not in the base modules package. The
playbook now installs linux-modules-extra-${ansible_kernel} on Ubuntu
(Debian keeps all modules in the base image and doesn't need this).
- CLAUDE.md: remove internal tool names from the review workflow
section; describe the pattern in generic terms so the file is safe
for a public repo.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Review round 4 findings:
- cozystack_enable_zfs / cozystack_enable_kubevirt were declared in the
play vars, which silently wins over inventory-supplied overrides.
Remove from vars and use | default(true) at every when: check so
users can opt out from host_vars/group_vars/--extra-vars.
- Debian ZFS: drop the contrib/apt_repository automation. Adding
contrib via apt_repository conflicts with Debian 12+ deb822
Signed-By, and zfsutils-linux alone does not provide zfs.ko on stock
Debian kernels (zfs-dkms is required). The playbook now skips ZFS
on Debian and emits a notice directing the operator to manual setup.
- SLE version parsing: guard split('.')[1] against SLE GA where
ansible_distribution_version has no dot (e.g. '15'). Default minor
to '0' instead of crashing with IndexError.
- README + CHANGELOG alignment:
- README KubeVirt modules-load.d block now shows all four modules
(vhost_net, tun, kvm_intel, kvm_amd) to match the playbook.
- CHANGELOG no longer claims ZFS is Ubuntu-only; lists per-distro
handling including Debian-manual.
- CHANGELOG kernel-headers wording includes -${ansible_kernel}
suffix (matches what the playbooks actually install).
- README documents cozystack_enable_zfs / cozystack_enable_kubevirt
as inventory-overridable opt-outs.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Review round 5 findings: - KubeVirt modprobe swallowed ALL failures, not just the CPU-vendor mismatch. On hosts missing vhost_net/tun the playbook would finish green while KubeVirt was unusable. Split into two tasks: required modules (vhost_net, tun) fail loudly; kvm_intel/kvm_amd keep failed_when: false because only one matches the CPU. - ZFS modprobe also swallowed failures silently. Register the result and print a WARNING debug task when the load failed so Secure-Boot / unsigned-module environments surface the real problem instead of appearing to succeed. - openSUSE SLE GA path: previous template produced SLE_15_SP0, which does not exist on OBS. Detect _minor == '0' and emit SLE_<major> without the SP suffix. - README Debian ZFS row contradicted the playbook (claimed contrib is auto-enabled, but the playbook only warns). Aligned: 'Not automated. Install manually or set cozystack_enable_zfs: false.' - RHEL GPG key URL tracks the master branch of an external repo; added a comment describing the failure mode (404 on rename) and suggesting a commit-pinned URL as a future improvement. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…or release RPM, Ubuntu-only scope
Review round 6 findings:
- Drop Debian support claim from prepare-ubuntu.yml header. Debian ZFS
needs contrib + zfs-dkms (not provided here) and linux-modules-extra
is Ubuntu-only. Ubuntu-only scope keeps the example honest.
- linux-modules-extra-${ansible_kernel} is now a mandatory package in
cozystack_packages (not a separate conditional task). Required for
openvswitch and geneve on Ubuntu cloud/minimal kernels.
- KVM module persistence: detect CPU vendor via ansible_processor and
write only kvm_intel OR kvm_amd into modules-load.d. Previous dual
listing left systemd-modules-load.service failing every boot on the
mismatched module. All three playbooks.
- ZFS modules-load.d: only create the conf file after modprobe zfs
succeeds. Writing it unconditionally made systemd-modules-load.service
fail every boot in Secure Boot / unsigned-module environments. All
three playbooks.
- SUSE GPG: fetch the OBS filesystems key explicitly via get_url +
rpm --import instead of auto_import_keys: true (trust on first use).
Matches the CLAUDE.md convention used for RHEL.
- RHEL OpenZFS RPM URL: replace hardcoded zfs-release-2-8 with a
per-major lookup table, and fail fast with a clear message when no
URL is configured for the running RHEL major.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…scipline Review round 7 findings: - CHANGELOG: drop retroactive v1.1.3 / v1.2.1 / v1.2.2 entries. The tags are immutable and CLAUDE.md documents not attributing new changes to already-tagged versions. Only Unreleased remains on top. - README: move cozystack_enable_zfs and cozystack_enable_kubevirt into a dedicated 'Example playbook variables' subsection. They are not part of the role's API and should not sit next to role vars like cozystack_chart_version. - SUSE GPG key: save under /etc/pki/rpm-gpg/ (canonical location for RPM GPG keys) instead of /etc/pki/trust/anchors/ (CA trust store). Switch the rpm --import invocation to argv form so the path is not shell-interpolated. - Opt-out idempotency: when cozystack_enable_zfs or cozystack_enable_kubevirt flips from true to false on a host that had been prepared previously, the corresponding modules-load.d drop-in file is now removed. Without this, systemd-modules-load would keep loading the modules at every boot despite the opt-out. Applied to all three playbooks. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…drop-in cleanup, CHANGELOG scope Review round 8: - SUSE: replace get_url + rpm --import with ansible.builtin.rpm_key. It fetches, verifies, and imports in one idempotent step, and does not rely on /etc/pki/rpm-gpg/ existing on minimal installs. - openvswitch is compiled into some kernels (certain cloud/HPC images). Move it from required to optional across all three playbooks so a built-in OVS does not abort the play. overlay was already handled this way; openvswitch is now consistent. - ZFS stale drop-in on failure: previously the cleanup only ran when cozystack_enable_zfs was flipped to false. If the module used to load and then starts failing (kernel update, Secure Boot toggle), the old drop-in stayed and systemd-modules-load failed every boot. Extended the removal condition to include the 'enabled but modprobe failed' path. All three playbooks. - CHANGELOG: drop 'Debian' from the kernel-headers bullet (Debian is explicitly out of scope for prepare-ubuntu.yml) and note the scope change for users reading the changelog. - README 'Tested on' list now says 'Ubuntu 22.04 / 24.04' only for the Ubuntu example; Debian is removed from the supported list. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
- examples/ubuntu/site.yml: drop 'Debian' from header comment and play name. Consistent with prepare-ubuntu.yml scope. - examples/rhel/prepare-rhel.yml: comment on cozystack_zfs_release_rpm_by_major mentions RHEL 10 so future maintainers know to add the entry. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ipathd tolerance
Validated against actual Lima VMs running openSUSE Leap 15.6 (ZFS
install, OBS repo with explicit GPG, kernel modules, sysctl). Findings:
- SUSE kernel headers: drop -${ansible_kernel} suffix. zypper does
not accept that format and errors with 'no provider found'.
kernel-default-devel (unversioned) lets zypper resolve to the
matching kernel. RHEL's kernel-devel-${ansible_kernel} is correct
there (dnf accepts that form).
- CPU vendor detection Jinja: switch from multi-line >- block to a
single-line ternary. The folded scalar was yielding None on hosts
where neither GenuineIntel nor AuthenticAMD matched (ARM64),
which then blew up | length > 0. Guards now use
(_cozystack_kvm_module | default('')) | length > 0.
- multipathd start: tolerate failure (register + warn) instead of
aborting. On VMs without multipath-capable storage and on minimal
kernels without dm_multipath, the service cannot start but the
DRBD blacklist config file is already in place, so LINSTOR still
benefits once the module is available. Real servers will start
multipathd normally.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…s-load split Test-driven fixes discovered by running Rocky 9 aarch64 in Lima: - cozystack_enable_* conditionals: add | bool so CLI --extra-vars parsing works (ansible-core 2.20 errors out on string conditionals). Strings like 'false' from CLI now correctly coerce. Round 9 review findings: - multipathd handler also gets failed_when: false. Previously the enable task tolerated multipathd unavailability but the blacklist handler would still hard-fail on the same environment. - vhost_net and tun are now persisted in modules-load.d on every host, not just Intel/AMD. ARM hosts and other virt contexts still reach the KubeVirt module load, but the KVM_* lines are only added when a vendor is detected. - README: drop every Ubuntu/Debian mention left behind by the Debian de-scoping (4 occurrences across tables and file refs). - CLAUDE.md: call out SUSE kernel-default-devel as an exception to the ansible_kernel pinning rule, since zypper package names use a different NVR format than 'uname -r'. - not (x | default(false)): explicit parens around the default filter so operator precedence is visible at a glance. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Validated against Rocky Linux 10.1 and Debian 12 (bookworm) Lima VMs:
- RHEL 9+ requires kernel-modules-extra-${ansible_kernel} for
br_netfilter, ip_tables, iptable_nat (previously part of
kernel-modules, split out into -extra). Rocky 10 aarch64 ships
with only kernel-modules by default, so those modprobe calls
failed until the extra package was installed.
- Debian 12 support restored in prepare-ubuntu.yml. Two distro-
specific branches:
- linux-modules-extra-${ansible_kernel} is Ubuntu-only; Debian
bundles openvswitch/geneve in the base linux-image package.
- ZFS automation stays Ubuntu-only. Debian needs contrib +
zfs-dkms and is not automated here. A debug task notes this.
- README and site.yml headers updated to reflect Ubuntu+Debian +
Rocky 10 test coverage.
- Fail-fast for Rocky 10 ZFS still works: cozystack_zfs_release_rpm_by_major
has no RHEL 10 entry, so the playbook raises a clear error when
ZFS is requested on that distro.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…mits Validated on OCI x86_64 clusters (Ubuntu 22.04, Ubuntu 24.04, Oracle Linux 9 UEK, Oracle Linux 10 UEK). Findings: - Oracle Linux ships kernel-uek by default. The package names for headers and extra modules are kernel-uek-devel / kernel-uek-modules-extra (not kernel-devel / kernel-modules-extra). Detect 'uek' in ansible_kernel and switch the prefix accordingly. - OpenZFS does not publish kmod builds for UEK kernels. Skip the ZFS tasks on UEK hosts and emit an informational notice telling the user to switch to the stock RHEL kernel or opt out. - README: new 'Known limitations' subsection documenting the distro matrix for ZFS automation (Ubuntu, Debian, RHEL 9/10, Oracle Linux UEK, openSUSE) and the Piraeus header-matching nuance. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Validated on OCI Ubuntu 22.04/24.04: the stock cloud image ships an INPUT chain with 'REJECT icmp-host-prohibited' as the last rule, so ports 2380 (etcd peer) and 6443 (kube-apiserver) are blocked between nodes and k3s multi-server bootstrap fails with 'connect: no route to host'. Oracle Linux cloud images on OCI do not have this — they ship with a clean INPUT chain. The prepare playbook already flushes iptables when cozystack_flush_iptables: true (disabled by default to stay safe on hand-managed clusters). Users running on Ubuntu-based cloud images need to set that flag on. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Replaces the loose 'Tested on' bullet list with an explicit table that distinguishes claimed-supported distros from those actually validated end-to-end on real hosts during the recent test pass (Ubuntu 22/24 + Oracle Linux 9/10 UEK on OCI; Rocky 9/10, openSUSE Leap 15.6, Debian 12 prepare-only on Lima). Cross-references the known limitations section so users can find the per-distro gotchas (UEK kernel, Debian ZFS, RHEL 10 OpenZFS, cloud iptables). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Keep only end-to-end validated targets in the table so the matrix reflects what was actually tested on real infra (OCI Ubuntu 22/24, Oracle Linux 9/10 UEK), not local dev iterations. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Earlier commit overstated what was validated. Actual status: - Ubuntu 22/24 on OCI: prepare-ubuntu.yml passes, but site.yml fails during k3s multi-server bootstrap — agent nodes cannot reach the server's etcd peer port 2380 between hosts in the same VCN. Flushing iptables unblocks port 6443 but 2380 stays blocked; OCI-level filter not yet identified. - Oracle Linux 9/10 UEK on OCI: prepare-rhel.yml and k3s multi-server bootstrap both succeed (all 3 nodes joined as control-plane), but the Cozystack platform does not fully reconcile — kube-ovn stays in installing state and dependent HelmReleases never become Ready. The matrix now reflects these two stages (prepare, k3s multi-node) and calls out that no distro currently has a green Cozystack platform in the validation set. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Validated: 3-node Ubuntu 24.04 cluster on OCI, all 3 control-plane, 87/87 HelmReleases Ready, 137/137 pods Running. Required setting cozystack_flush_iptables: true to clear the cloud image's default INPUT REJECT rule that was blocking etcd peer port 2380 between nodes. Documented the requirement. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Validated: 3-node Ubuntu 22.04 cluster on OCI, multi-master, 87/87 HelmReleases Ready. Same flow as Ubuntu 24.04 — needs cozystack_flush_iptables: true. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
OL k3s multi-master fails on OCI with port 2380 (etcd peer) blocked between nodes, while identical OCI configuration works for Ubuntu. Root cause not yet identified — likely OS-level (SELinux, hidden iptables-legacy tables, or oracle-cloud-agent interference). Filed as a separate investigation issue. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Rocky 10 / Alma 10 (and other RHEL 10 rebuilds) do not install the iptables userspace binary by default, so the cozystack_flush_iptables task failed with 'Failed to find required executable iptables' on cloud images. Add iptables-nft to cozystack_packages on the RHEL example. Validated end-to-end on OCI: Rocky 10 3-node multi-master k3s with 87/87 Cozystack HelmReleases Ready. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Reviewers flagged that all three prepare playbooks wrote /etc/modules-load.d/cozystack-kubevirt.conf with the CPU-vendor KVM module before verifying it could load. On hosts where nested virt is disabled in BIOS or a container host does not expose /dev/kvm, modprobe kvm_intel/kvm_amd fails silently (failed_when: false) but the config file is still persisted, causing systemd-modules-load.service to fail on every boot. Reorder: modprobe the CPU-vendor module first (best-effort, capture rc), then write the modules-load.d config with the kvm_* line only when rc == 0. vhost_net and tun remain required — the playbook fails hard if they cannot load, which is correct for a KubeVirt-enabled stand. Also: - Fix CHANGELOG wording: prepare-ubuntu.yml still supports Debian 12 end-to-end; only ZFS automation is Ubuntu-only. - Add missing v1.2.2 CHANGELOG section (synced with Cozystack v1.2.2). - Flag required 'cozystack_enable_zfs: false' override for Rocky 10 / Alma 10 in README and CHANGELOG — OpenZFS has not yet published an el10 release RPM, so ZFS must stay disabled there until it does. - Fold long CPU-vendor ternary into a block scalar (ansible-lint yaml[line-length]). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Review round 2 findings:
- The previous fix used 'ansible.builtin.command: modprobe <m>' with
'changed_when: rc == 0' to capture the rc for the modules-load.d
gate. modprobe returns 0 both for a fresh load AND when the module
is already loaded, so this reported 'changed' on every rerun,
breaking idempotency. Restore 'community.general.modprobe' (which
handles idempotency correctly), keep 'failed_when: false' to
tolerate unsupported CPUs / disabled virt, and check
/sys/module/<name> via stat afterwards to decide whether to persist
the kvm_* line in /etc/modules-load.d/cozystack-kubevirt.conf.
- Revert the v1.2.2 CHANGELOG section. v1.2.2 is already tagged; per
project convention (CLAUDE.md) releases are immutable — the
Unreleased section is the right place for current changes, and the
missing v1.2.2 entry would be filled in a separate housekeeping PR
if desired.
- Fix CHANGELOG wording on openSUSE kernel-devel: the playbook
installs the *unversioned* 'kernel-default-devel' (zypper rejects
the 'kernel-default-devel-{{ ansible_kernel }}' form because SUSE's
NVR differs from 'uname -r'). Only Ubuntu and RHEL pin to
ansible_kernel.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
- Add cozystack_flush_iptables to the example-playbook variables table; it was referenced in prose (intro, Known limitations, Customizing variables) but missing from the reference table. - ZFS distribution table: widen RHEL row to 8+ (the playbook maps 8 → el8 RPM in cozystack_zfs_release_rpm_by_major, the table only listed 9). Note the missing el10 release RPM inline so readers see the constraint alongside the RHEL row. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…K note Review round 3 findings: - Replace 'edit cozystack_zfs_release_rpm_by_major' guidance with an additive override mechanism. Overriding a vars: dict from inventory replaces the whole dict — if a user set just the el10 entry from inventory the el8 and el9 entries would vanish and break those nodes. Keep the built-in mapping internal and expose cozystack_zfs_release_rpm_extra as the additive override; the two are merged via 'combine', so supplying one major does not wipe out the others. - prepare-suse.yml: fail fast when ansible_distribution is not one of 'openSUSE Leap', 'openSUSE Tumbleweed', or 'SLES'. The OBS path template previously fell through to SLE_<major>[_SP<minor>] for any unrecognised distro and constructed a URL that would silently 404 at the rpm_key step. Explicit fail makes the misconfiguration obvious. - README: explain the Oracle Linux / UEK code path in the kernel- headers section. Oracle Linux is not on the validated-end-to-end list but the prepare-rhel.yml UEK auto-detection is retained best-effort; document the behavior so the docs and code don't contradict each other. - README: add cozystack_zfs_release_rpm_extra to the example-playbook variables table and update the Known limitations entry for RHEL 10 to point at the safe override variable. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…10 note Round-4 review findings addressed: - SUSE fail message claimed users could override cozystack_zfs_obs_path from inventory to suppress the fail, but the fail task does not honour that override. Drop the misleading claim; the only supported escape hatch is cozystack_enable_zfs: false. - Add explicit '"uek" not in ansible_kernel' guard to the 'Warn if ZFS module failed to load' task in prepare-rhel.yml. The warn task was previously correct only because _cozystack_zfs_modprobe is never registered on UEK hosts (the modprobe task itself has the UEK guard); the implicit is-defined check works but is fragile for future editors. - README Known limitations RHEL 10 row: clarify that fail-fast applies only when cozystack_enable_zfs is left at its default true; setting it to false skips the block cleanly. Dismissed (justified as false positives or user-approved design): - Rename 'Unreleased' → 'v1.2.2': galaxy.yml version and cozystack_chart_version were NOT changed in this PR; v1.2.2 was already tagged on main before this branch. Per CLAUDE.md release strategy (galaxy.yml tracks the upstream chart), keeping this under 'Unreleased' is correct — the release workflow will not fire on this merge. Renaming would attribute new changes to an already- tagged release, which CLAUDE.md explicitly forbids. - Optional kernel modules (overlay, openvswitch) written unconditionally to cozystack.conf: modprobe returns 0 for built-in modules, so systemd-modules-load reports success on distros where they are built into the kernel. End-to-end validation on Rocky 10 (where overlay is built-in since the stock 6.12 kernel) passed with systemd-modules-load clean and 87/87 HelmReleases Ready. - vhost_net/tun hard-fail when cozystack_enable_kubevirt: true: this is the intended design. vhost_net and tun are standard modules present on every production Linux distro; failing early on their absence surfaces a misconfiguration before cluster state diverges. Users who don't want KubeVirt set cozystack_enable_kubevirt: false (documented). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
📝 WalkthroughWalkthroughThis PR expands example preparation playbooks and docs to add distro-aware package installs, kernel module management (optional/required with persistent boot loading), multipath DRBD blacklisting, conditional ZFS and KubeVirt flows, and extended sysctl settings across RHEL, SUSE, and Ubuntu. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the node preparation playbooks for RHEL, SUSE, and Ubuntu/Debian to ensure all Cozystack prerequisites are met. Key changes include the installation of necessary storage and LVM packages, pinning kernel headers to the running kernel, applying a critical multipath blacklist for DRBD, and managing kernel modules for Kube-OVN and KubeVirt. Additionally, a new CLAUDE.md file establishes project conventions, and the documentation is updated to reflect these comprehensive requirements. Feedback focuses on improving security by pinning GPG key URLs to specific commits and ensuring that optional kernel modules are only persisted if they are actually available to prevent systemd service failures.
| # branch of the openzfs docs repo. If OpenZFS renames/moves the key | ||
| # file this URL will 404; update to the current path (or pin to a | ||
| # known-good commit SHA) when that happens. | ||
| cozystack_zfs_gpg_key_url: "https://raw.githubusercontent.com/zfsonlinux/zfsonlinux.github.com/master/zfs-release/RPM-GPG-KEY-openzfs-key2" |
There was a problem hiding this comment.
There was a problem hiding this comment.
This trade-off is already documented at lines 65–69 just above the variable. Pinning to a specific commit would protect against upstream moving the file, but it would also go stale the moment OpenZFS rotates the signing key — at which point rpm_key would silently import a superseded key and later install steps would fail GPG verification on new packages. Tracking master means the playbook follows whatever the OpenZFS project currently publishes, which is what matters for the rpm_key import step.
The fetch itself is over HTTPS against raw.githubusercontent.com, and the key is validated by rpm_key before use. I'd rather keep the current URL (with the inline note telling users what to do if it 404s) than add a moving target that needs manual bumping.
examples/rhel/prepare-rhel.yml
Outdated
| ansible.builtin.copy: | ||
| dest: /etc/modules-load.d/cozystack.conf | ||
| mode: "0644" | ||
| content: "{{ (cozystack_kernel_modules_optional + cozystack_kernel_modules_required) | join('\n') }}\n" |
There was a problem hiding this comment.
The cozystack_kernel_modules_optional list (which includes overlay and openvswitch) is persisted to /etc/modules-load.d/cozystack.conf without verifying if the modules are actually available or built-in. If a module is missing from the kernel, systemd-modules-load.service will fail on every boot. Given that you've implemented careful checks for ZFS and KVM modules to avoid this exact issue (e.g., lines 298 and 375), you should consider applying similar logic here—only persisting modules that are confirmed to be loadable or present in /sys/module.
There was a problem hiding this comment.
Good catch — applied the same stat/warn/persist pattern used for ZFS and the KVM modules. The optional modules are now loaded with failed_when: false, then stat'd under /sys/module/<name>; only modules that are actually present (loadable or built-in) end up in /etc/modules-load.d/cozystack.conf. Missing modules emit a visible warning pointing at kernel-modules-extra so the user can fix the environment, and systemd-modules-load.service no longer fails every boot on an absent module.
examples/suse/prepare-suse.yml
Outdated
|
|
||
| - name: Load kernel modules at boot (containerd, Kubernetes, Kube-OVN) | ||
| ansible.builtin.copy: | ||
| dest: /etc/modules-load.d/cozystack.conf |
There was a problem hiding this comment.
The cozystack_kernel_modules_optional list is persisted to /etc/modules-load.d/cozystack.conf without verifying if the modules are actually available. If a module is missing from the kernel, systemd-modules-load.service will fail on every boot. Consider only persisting modules that are confirmed to be loadable, consistent with the logic used for ZFS and KVM modules later in the playbook.
There was a problem hiding this comment.
Fixed with the same pattern used elsewhere in the playbook: load optional modules with failed_when: false, stat each under /sys/module/<name>, then render /etc/modules-load.d/cozystack.conf with only the modules that are actually present. Missing ones emit a warning pointing at kernel-default-extra. This keeps systemd-modules-load.service from failing every boot on an absent module.
examples/ubuntu/prepare-ubuntu.yml
Outdated
|
|
||
| - name: Load kernel modules at boot (containerd, Kubernetes, Kube-OVN) | ||
| ansible.builtin.copy: | ||
| dest: /etc/modules-load.d/cozystack.conf |
There was a problem hiding this comment.
The cozystack_kernel_modules_optional list is persisted to /etc/modules-load.d/cozystack.conf without verifying if the modules are actually available. If a module is missing from the kernel, systemd-modules-load.service will fail on every boot. Consider only persisting modules that are confirmed to be loadable, consistent with the logic used for ZFS and KVM modules later in the playbook.
There was a problem hiding this comment.
Fixed with the same pattern used for ZFS and KVM: optional modules are probed with failed_when: false, stat'd under /sys/module/<name>, and only the ones that are actually present end up in /etc/modules-load.d/cozystack.conf. Missing modules emit a warning pointing at linux-modules-extra so the user can fix the environment, and systemd-modules-load.service no longer fails every boot on an absent module.
Previously overlay and openvswitch were written to /etc/modules-load.d/cozystack.conf unconditionally. If either was unavailable on the host kernel, systemd-modules-load.service would fail on every boot. Now the playbooks probe each optional module, stat /sys/module/<name>, and persist only the modules that are actually present. Missing ones emit a visible warning pointing at the distro-specific extras package (kernel-modules-extra / kernel-default-extra / linux-modules-extra). This mirrors the pattern already used for ZFS and the KVM modules. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/rhel/prepare-rhel.yml`:
- Around line 57-63: The default mapping lacks an EL10 entry causing
cozystack_zfs_release_rpm to be empty and ZFS tasks to fail; add a safe default
and guard the ZFS install flow: add a "10": "" entry to
cozystack_zfs_release_rpm_by_major_default (so cozystack_zfs_release_rpm
resolves to empty string rather than undefined) and update the ZFS-related tasks
that reference cozystack_zfs_release_rpm to run only when
cozystack_zfs_release_rpm | length > 0, and emit a debug/warn message when it's
empty so unsupported majors are skipped non‑fatally (references: variables
cozystack_zfs_release_rpm_by_major_default, cozystack_zfs_release_rpm_by_major,
and cozystack_zfs_release_rpm and the ZFS install tasks that currently consume
cozystack_zfs_release_rpm).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e30137e0-7d8e-4302-9711-2f1262080f8b
📒 Files selected for processing (3)
examples/rhel/prepare-rhel.ymlexamples/suse/prepare-suse.ymlexamples/ubuntu/prepare-ubuntu.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/ubuntu/prepare-ubuntu.yml
| cozystack_zfs_release_rpm_by_major_default: | ||
| "8": "https://zfsonlinux.org/epel/zfs-release-2-3.el8.noarch.rpm" | ||
| "9": "https://zfsonlinux.org/epel/zfs-release-2-8.el9.noarch.rpm" | ||
| cozystack_zfs_release_rpm_by_major: >- | ||
| {{ cozystack_zfs_release_rpm_by_major_default | ||
| | combine(cozystack_zfs_release_rpm_extra | default({})) }} | ||
| cozystack_zfs_release_rpm: "{{ cozystack_zfs_release_rpm_by_major[ansible_distribution_major_version] | default('') }}" |
There was a problem hiding this comment.
Default EL10 path hard-fails ZFS flow.
With defaults, cozystack_enable_zfs is true. On EL10, Line 63 resolves cozystack_zfs_release_rpm to empty (no "10" mapping), and Line 277 then fails the play. That makes default prepare behavior fail on EL10-class hosts unless users pre-override variables.
Please make unsupported-major behavior non-fatal by default (warn + skip ZFS tasks), or provide an EL10 mapping when available.
Also applies to: 277-287
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/rhel/prepare-rhel.yml` around lines 57 - 63, The default mapping
lacks an EL10 entry causing cozystack_zfs_release_rpm to be empty and ZFS tasks
to fail; add a safe default and guard the ZFS install flow: add a "10": "" entry
to cozystack_zfs_release_rpm_by_major_default (so cozystack_zfs_release_rpm
resolves to empty string rather than undefined) and update the ZFS-related tasks
that reference cozystack_zfs_release_rpm to run only when
cozystack_zfs_release_rpm | length > 0, and emit a debug/warn message when it's
empty so unsupported majors are skipped non‑fatally (references: variables
cozystack_zfs_release_rpm_by_major_default, cozystack_zfs_release_rpm_by_major,
and cozystack_zfs_release_rpm and the ZFS install tasks that currently consume
cozystack_zfs_release_rpm).
There was a problem hiding this comment.
This is intentional and documented. See the "Known limitations" table in README.md (RHEL 10 row) and the v1.2.2 entry in CHANGELOG.rst: on EL10 with the default cozystack_enable_zfs: true, prepare fails fast on the missing OpenZFS release RPM.
Silent warn-and-skip was rejected because ZFS is a primary LINSTOR backend in Cozystack. If prepare quietly skipped the ZFS install on EL10 while the toggle stayed at its default, LINSTOR would later refuse to create ZFS-backed volumes and the user would have no breadcrumb pointing back at prepare. Fail-fast forces an explicit cozystack_enable_zfs: false (or cozystack_zfs_release_rpm_extra: {"10": "<url>"} once OpenZFS ships an el10 release RPM).
There was a problem hiding this comment.
@lexfrei, understood — the fail-fast is a deliberate UX contract, not an oversight. Failing loudly with a clear message is preferable to silently skipping ZFS and leaving LINSTOR in a broken state without any diagnostic trail. The escape hatches (cozystack_enable_zfs: false or cozystack_zfs_release_rpm_extra) give users a clean path forward. I'll retract the concern.
( \_
(='.'=)
(")_(") ← noted and filed away
✏️ Learnings added
Learnt from: lexfrei
Repo: cozystack/ansible-cozystack PR: 27
File: examples/rhel/prepare-rhel.yml:57-63
Timestamp: 2026-04-13T20:30:27.306Z
Learning: In `examples/rhel/prepare-rhel.yml`, the ZFS setup intentionally fails fast on RHEL 10 (EL10) when no OpenZFS release RPM URL is configured for that major version (cozystack_zfs_release_rpm resolves to empty string). This is by design: ZFS is a primary LINSTOR backend in Cozystack, so silently skipping it would leave LINSTOR unable to create ZFS-backed volumes with no diagnostic breadcrumb. Users must explicitly set `cozystack_enable_zfs: false` or supply `cozystack_zfs_release_rpm_extra: {"10": "<url>"}` once OpenZFS publishes an el10 release RPM. This design decision is documented in README.md "Known limitations" and CHANGELOG.rst.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
Summary
Comprehensive node prerequisites audit for the three example prepare playbooks. Closes issue #19 (LINSTOR ZFS failure) and fills many adjacent gaps uncovered during research: kernel modules for containerd/Kube-OVN, sysctl for bridge netfilter, multipath DRBD blacklist, kernel headers pinned to the running kernel, ZFS backend + KubeVirt modules behind opt-out toggles, and a broadly rewritten README that documents every Cozystack node requirement per subsystem.
The collection role itself is unchanged; all new work lives in
examples/*/prepare-*.ymland the README.Changes
examples/ubuntu/prepare-ubuntu.yml,examples/rhel/prepare-rhel.yml,examples/suse/prepare-suse.yml— full prerequisite coverage:lvm2,thin-provisioning-tools/device-mapper-persistent-data, and kernel headers pinned to the running kernel (Ubuntu/RHEL); openSUSE uses the unversionedkernel-default-devel(zypper rejects the version-suffixed form).linux-modules-extra-{{ ansible_kernel }}which shipsopenvswitchandgeneveon cloud/minimal kernels.iptables-nft— Rocky 10 / Alma 10 do not ship theiptablesbinary by default, blocking the optional flush task and k3s kube-proxy replacement./etc/modules-load.d/cozystack.conf:overlay,br_netfilter,openvswitch,geneve,ip_tables,iptable_nat.net.bridge.bridge-nf-call-iptables,net.bridge.bridge-nf-call-ip6tables,net.ipv6.conf.all.forwarding.true):cozystack_enable_zfs— Ubuntu installszfsutils-linux; RHEL imports the OpenZFS GPG key then installs the release RPM from a per-major map (cozystack_zfs_release_rpm_by_major_default+ additivecozystack_zfs_release_rpm_extrafor user overrides without wiping the base dict); openSUSE adds the OBSfilesystemsrepo with a distro-detected path. Debian is not automated (contrib +zfs-dkmsrequired manually). UEK kernels skip with a visible notice (no OpenZFS kmod builds for UEK).cozystack_enable_kubevirt— loadsvhost_net,tun, and a best-effortkvm_intel/kvm_amd. The CPU-vendor module is only persisted tomodules-load.dif it actually loaded (checked via/sys/module), so systemd-modules-load does not fail on every boot. No host userspace packages are installed: QEMU and libvirt are bundled in KubeVirt pods.cozystack_flush_iptables(defaultfalse) documented — set totrueon cloud images where the default INPUT chain blocks k3s inter-node traffic.README.md— new Node Prerequisites section with per-subsystem tables (verified package names for Ubuntu 22.04/24.04, RHEL 9, openSUSE Leap 15.6), Known limitations matrix for ZFS per OS + kernel flavor, and a supported-targets/validated-end-to-end summary table.CHANGELOG.rst— expanded Unreleased entry.Validated end-to-end on OCI (3-node multi-master k3s, 87/87 Cozystack HelmReleases Ready):
Test plan
ansible-lintpasses (production profile)ansible-playbook --syntax-checkpasses on all three prepare playbooksNotes for reviewers
galaxy.ymlversion is not bumped (per project convention it tracks the upstream Cozystack chart version). Current changes stay under theUnreleasedCHANGELOG section until the next upstream chart bump.examples/rhel/prepare-rhel.ymlretains best-effort Oracle Linux UEK handling (kernel-uek-devel, ZFS skip) even though Oracle Linux is not on the validated-end-to-end list — documented in README.Summary by CodeRabbit
New Features
Documentation
Chores