Skip to content
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
81485a6
feat(podvm): opt-in NVIDIA 595.71.05 driver for B200 multi-GPU CC
alhassankhedr-cohere May 8, 2026
02f05f0
fix(podvm): build libnvat from source so nvidia-attester actually links
alhassankhedr-cohere May 12, 2026
d02fb9b
fix(podvm): refresh Cargo.lock before AA build so nvidia-attester res…
alhassankhedr-cohere May 12, 2026
6361ef8
fix: multi-GPU nvidia-attester detection and debug image sizing
alhassankhedr-cohere May 12, 2026
0f10cc0
chore: remove temporary detect_platform sed patch
alhassankhedr-cohere May 13, 2026
c174616
fix: install libnvat to /usr instead of /usr/local
alhassankhedr-cohere May 13, 2026
92af4d5
fix: copy libnvat from /usr/lib to match CMAKE_INSTALL_PREFIX
alhassankhedr-cohere May 13, 2026
8cec037
fix: force libnvat install to /usr/lib (not arch-specific subdir)
alhassankhedr-cohere May 13, 2026
da1222b
fix: address Bugbot findings in Dockerfile
alhassankhedr-cohere May 13, 2026
9f839bf
fix: add libssl-dev to NVAT build deps (prevents OpenSSL source build)
alhassankhedr-cohere May 13, 2026
ad4c921
ci: default guest_components_ref to alhassankhedr/sync-main-to-cohere
alhassankhedr-cohere May 15, 2026
7a46e77
fix(podvm): pin debug rootfs to 12 GiB for NVIDIA 595 stack
alhassankhedr-cohere May 15, 2026
928b7fd
build(podvm-b200-cc): bump NVIDIA stack to 595.71.05 + load ib_umad f…
alhassankhedr-cohere May 19, 2026
85f075d
fix(b200-cc): gate nvidia-persistenced on NVLink fabric readiness
alhassankhedr-cohere May 19, 2026
bb64832
build(podvm-b200-cc): add nvlink5-595 metapackage components + OFED d…
alhassankhedr-cohere May 20, 2026
fd50ec5
build(podvm-b200-cc): add NVIDIA DOCA-Host repo + restore CX7 NIC tools
alhassankhedr-cohere May 20, 2026
5e953fc
build(podvm-b200-cc): fix non-existent package names
alhassankhedr-cohere May 20, 2026
f4b67ed
build(podvm-b200-cc): bake fmctl-probe into the SVM image at build time
alhassankhedr-cohere May 20, 2026
b8a77ba
build(podvm-b200-cc): add image-side distro guard to mkosi.postinst
alhassankhedr-cohere May 20, 2026
de75f12
build(podvm-b200-cc): gate nvidia-imex.service on check-nvidia-gpu
alhassankhedr-cohere May 20, 2026
077f479
build(podvm-b200-cc): fmctl-probe — add resolve-by-physids; fix stdou…
alhassankhedr-cohere May 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions .github/workflows/build-podvm-cohere.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,11 @@ on:
required: false
type: boolean
default: false
b200_cc_drivers:
description: "Install NVIDIA 595.71.05 open driver (enables Confidential Computing on multi-GPU B200). EXPERIMENTAL"
required: false
type: boolean
default: false

permissions:
id-token: write # OIDC token for build provenance attestation
Expand All @@ -70,11 +75,13 @@ jobs:
image_name_debug: ${{ steps.compute.outputs.image_name_debug }}
image_tag_release: ${{ steps.compute.outputs.image_tag_release }}
image_tag_debug: ${{ steps.compute.outputs.image_tag_debug }}
b200_cc_drivers: ${{ steps.compute.outputs.b200_cc_drivers }}
steps:
- name: Compute tags and image names
id: compute
env:
DISTRO: ${{ inputs.distro || 'ubuntu' }}
B200_CC_DRIVERS: ${{ inputs.b200_cc_drivers && 'true' || 'false' }}
run: |
if [[ "$GITHUB_REF" == refs/tags/podvm-v* ]]; then
TAG="${GITHUB_REF#refs/tags/podvm-}"
Expand All @@ -84,10 +91,15 @@ jobs:
REPLACE_IMAGE="true"
fi
TAG="${TAG//./-}"
# Suffix CC-driver builds so they never collide with standard images
if [ "$B200_CC_DRIVERS" = "true" ]; then
TAG="${TAG}-cc595"
fi
{
echo "tag=$TAG"
echo "distro=$DISTRO"
echo "replace_image=$REPLACE_IMAGE"
echo "b200_cc_drivers=$B200_CC_DRIVERS"
echo "image_name_release=podvm-${DISTRO}-${TEE_PLATFORM}-release-${TAG}"
echo "image_name_debug=podvm-${DISTRO}-${TEE_PLATFORM}-debug-${TAG}"
echo "image_tag_release=${TAG}-${DISTRO}-release"
Expand Down Expand Up @@ -195,6 +207,47 @@ jobs:
echo "Disk after binaries build:"
df -h /

- name: Override NVIDIA driver to 595.71.05 (B200 multi-GPU CC)
if: needs.meta.outputs.b200_cc_drivers == 'true'
working-directory: src/cloud-api-adaptor/podvm-mkosi
run: |
set -euo pipefail
CONF=mkosi.presets/system/mkosi.conf.d/ubuntu.conf
# The 595 branch only ships the unversioned `nvidia-driver-open`
# metapackage in NVIDIA's CUDA repo (no `nvidia-driver-595-open`).
# Match by package name only so this survives future 580.x.y bumps.
sed -i -E \
-e 's|^([[:space:]]*)nvidia-driver-580-open=.*|\1nvidia-driver-open=595.71.05-1ubuntu1|' \
-e 's|^([[:space:]]*)nvidia-persistenced=.*|\1nvidia-persistenced=595.71.05-1ubuntu1|' \
-e 's|^([[:space:]]*)nvidia-fabricmanager=.*|\1nvidia-fabricmanager=595.71.05-1ubuntu1|' \
-e 's|^([[:space:]]*)libnvidia-nscq=.*|\1libnvidia-nscq=595.71.05-1ubuntu1|' \
"$CONF"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opt-in flag is non-functional; config already hardcodes 595

High Severity

The b200_cc_drivers flag is supposed to opt-in to the 595 driver by sed-replacing 580 package pins, but ubuntu.conf was directly committed with 595 packages (nvidia-driver-open=595.71.05-1ubuntu1), making the sed a no-op. Specifically, the first sed pattern looks for nvidia-driver-580-open= which doesn't exist in the committed file (it has nvidia-driver-open= without 580), and patterns 2–4 match but replace with the same already-present 595 values. This means ALL builds use the 595 driver regardless of the flag, contradicting the stated "Default behaviour is unchanged" and the PR's "opt-in" design. The flag's only actual effect is appending -cc595 to the image tag.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fd50ec5. Configure here.

echo "----- Updated NVIDIA package pins -----"
grep -E '^[[:space:]]*(nvidia|libnvidia)' "$CONF"

- name: Increase debug root partition for CC595 drivers
if: needs.meta.outputs.b200_cc_drivers == 'true' && matrix.profile == 'debug'
working-directory: src/cloud-api-adaptor/podvm-mkosi
run: |
set -euo pipefail
CONF=mkosi.presets/system/mkosi.repart-debug/10-root.conf
# NVIDIA 595 drivers make the root filesystem too large for
# systemd-repart's Minimize=guess estimation, causing mkfs.ext4
# "No space left on device" during the build.
printf '[Partition]\nType=root\nFormat=ext4\nCopyFiles=/\nMinimize=off\nSizeMinBytes=12G\nSizeMaxBytes=12G\n' > "$CONF"
echo "----- Updated repart config -----"
cat "$CONF"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug root partition override is redundant dead code

Low Severity

The "Increase debug root partition for CC595 drivers" workflow step (conditional on b200_cc_drivers == 'true') writes Minimize=off / SizeMinBytes=12G / SizeMaxBytes=12G to 10-root.conf. However, the base 10-root.conf was already directly changed in this commit from Minimize=guess to the identical Minimize=off + 12G content. The conditional override is redundant dead code that writes exactly what's already in the file.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit de75f12. Configure here.


- name: Resolve installed NVIDIA driver version
working-directory: src/cloud-api-adaptor/podvm-mkosi
run: |
set -euo pipefail
CONF=mkosi.presets/system/mkosi.conf.d/ubuntu.conf
DRIVER_LINE=$(grep -E '^[[:space:]]*nvidia-driver(-580)?-open=' "$CONF" | head -n1)
DRIVER_VER=$(printf '%s' "$DRIVER_LINE" | sed -E 's|.*=([0-9]+\.[0-9]+\.[0-9]+).*|\1|')
echo "Resolved NVIDIA driver version: $DRIVER_VER"
echo "NVIDIA_DRIVER=$DRIVER_VER" >> "$GITHUB_ENV"

- name: Build OS image
working-directory: src/cloud-api-adaptor/podvm-mkosi
env:
Expand Down Expand Up @@ -265,6 +318,7 @@ jobs:
--arg distro "$DISTRO" \
--arg profile "$PROFILE" \
--arg tee_platform "$TEE_PLATFORM" \
--arg nvidia_driver "$NVIDIA_DRIVER" \
--arg build_date "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
'$ARGS.named' > /tmp/measurements.json

Expand Down Expand Up @@ -307,6 +361,7 @@ jobs:
--annotation "com.cohere.caa.commit=${CAA_COMMIT}" \
--annotation "com.cohere.caa.version=${GITHUB_REF_NAME}" \
--annotation "com.cohere.rtmr2=${RTMR2}" \
--annotation "com.cohere.nvidia.driver=${NVIDIA_DRIVER}" \
--format json > oras-output.json

cat oras-output.json

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Agentic Security Review
Severity: HIGH

GC_REF now defaults to a mutable personal branch (alhassankhedr/sync-main-to-cohere) for push/tag-driven PodVM builds instead of an immutable, reviewed ref. That expands the build trust boundary to branch-head state that can change outside this repository’s review path.

Impact: If that branch is updated maliciously (or compromised), unreviewed guest-components code can be pulled into release artifacts and published as trusted PodVM images.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ Packages=
nvidia-fabricmanager=580.126.20-1
libnvidia-nscq=580.126.20-1
nvidia-container-toolkit=1.19.0-1
libcurl4t64
libxml2
libxmlsec1-openssl
pciutils

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate pciutils entry in package list

Low Severity

pciutils appears twice in the Packages= list (lines 41 and 45). While the package manager handles duplicates gracefully, the repetition is unnecessary and suggests a copy-paste oversight.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 85f075d. Configure here.


RemoveFiles=/etc/issue
RemoveFiles=/etc/issue.net
Expand Down
26 changes: 25 additions & 1 deletion src/cloud-api-adaptor/podvm/Dockerfile.podvm_binaries.ubuntu
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ ARG CUSTOM_GC_BINARIES=""
ARG AA_FEATURES=""
ARG GUEST_COMPONENTS_REF=""
ARG GUEST_COMPONENTS_REPO="https://github.com/confidential-containers/guest-components.git"
ARG NVAT_REPO="https://github.com/NVIDIA/attestation-sdk.git"
ARG NVAT_TAG="2026.03.02"
ARG DEBIAN_FRONTEND=noninteractive
RUN set -e; \
if [ -n "${CUSTOM_GC_BINARIES}" ] && [ -z "${GUEST_COMPONENTS_REF}" ]; then \
Expand All @@ -26,15 +28,29 @@ RUN set -e; \
if [ -n "${CUSTOM_GC_BINARIES}" ] && [ -n "${GUEST_COMPONENTS_REF}" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
protobuf-compiler pkg-config clang libssl-dev libtss2-dev && \
protobuf-compiler pkg-config clang libclang-dev libssl-dev libtss2-dev \
cmake libcurl4-openssl-dev libxml2-dev libxmlsec1-dev libxmlsec1-openssl && \
apt-get clean && rm -rf /var/lib/apt/lists/* && \
mkdir -p /build/gc && cd /build/gc && \
git init && \
git remote add origin "${GUEST_COMPONENTS_REPO}" && \
git fetch --depth=1 origin "${GUEST_COMPONENTS_REF}" && \
git reset --hard FETCH_HEAD; \
fi
# Build NVIDIA Attestation SDK (libnvat) from source so the nvidia-attester
# feature can link against it. The AA binary will dynamically link libnvat.so,
# which must also be present in the final PodVM image at runtime.
RUN set -e; \
if echo "${AA_FEATURES}" | grep -q "nvidia-attester"; then \
git clone --depth 1 --branch "${NVAT_TAG}" "${NVAT_REPO}" /build/nvat && \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Agentic Security Review
Severity: MEDIUM

The new build step clones nvidia-attestation-sdk from a mutable Git tag (NVAT_TAG) and builds it directly without immutable pinning or integrity verification. This weakens the supply-chain trust boundary for PodVM artifacts.

Impact: If the upstream tag is retargeted or the source repo is compromised, malicious code could be compiled into libnvat and shipped in the resulting image.

cd /build/nvat/nv-attestation-sdk-cpp && \
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr && \
cmake --build build && \
cmake --install build && \
ldconfig; \
fi
Comment thread
cursor[bot] marked this conversation as resolved.
COPY cloud-api-adaptor/podvm/build-guest-components.sh /build/
ENV NVAT_USE_SYSTEM_LIB=1
RUN /build/build-guest-components.sh "${CUSTOM_GC_BINARIES}" "${AA_FEATURES}"

# ubuntu:24.04
Expand Down Expand Up @@ -146,5 +162,13 @@ RUN for bin in /tmp/gc-overrides/*; do \
install -m0755 "$bin" /src/cloud-api-adaptor/podvm/files/usr/local/bin/"$(basename "$bin")"; \
done; true

# Copy libnvat shared library if it was built (needed at runtime by attestation-agent
# when compiled with nvidia-attester feature).
COPY --from=gc_builder /usr/local/lib/libnvat* /tmp/libnvat/
Comment thread
cursor[bot] marked this conversation as resolved.
Outdated
RUN if ls /tmp/libnvat/libnvat* 1>/dev/null 2>&1; then \
mkdir -p /src/cloud-api-adaptor/podvm/files/usr/local/lib/ && \
cp /tmp/libnvat/libnvat* /src/cloud-api-adaptor/podvm/files/usr/local/lib/; \
fi; true
Comment thread
cursor[bot] marked this conversation as resolved.

FROM scratch
COPY --from=podvm_binaries_builder /src/cloud-api-adaptor/podvm/files /
4 changes: 4 additions & 0 deletions src/cloud-api-adaptor/podvm/build-guest-components.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ for bin in "${BINS[@]}"; do
exit 1
fi
cd /build/gc/attestation-agent/attestation-agent
# Refresh lockfile so optional feature deps (e.g. nv-attestation-sdk
# for nvidia-attester) are resolved even if the checked-in Cargo.lock
# was generated without them.
cargo update --workspace

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Agentic Security Review
Severity: HIGH

The new cargo update --workspace step rewrites dependency resolution from live registries during image builds, then cargo build --locked only enforces that freshly-updated lock state. This removes the protection of building from a pre-reviewed, committed dependency graph.

Impact: A malicious or compromised transitive crate release could be silently pulled into attestation-agent at build time and shipped in PodVM artifacts without an explicit dependency-pin change in this repository.

cargo build --release --locked --no-default-features \
--features "$AA_FEATURES" --bin ttrpc-aa
cp /build/gc/target/release/ttrpc-aa "$OUTDIR/attestation-agent"
Expand Down
Loading