-
Notifications
You must be signed in to change notification settings - Fork 0
feat(podvm): opt-in NVIDIA 595.71.05 driver for B200 multi-GPU CC #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: cohere
Are you sure you want to change the base?
Changes from 11 commits
81485a6
02f05f0
d02fb9b
6361ef8
0f10cc0
c174616
92af4d5
8cec037
da1222b
9f839bf
ad4c921
7a46e77
928b7fd
85f075d
bb64832
fd50ec5
5e953fc
f4b67ed
b8a77ba
de75f12
077f479
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,10 +22,20 @@ on: | |
| type: string | ||
| default: "https://github.com/cohere-ai/guest-components.git" | ||
| guest_components_ref: | ||
| description: "guest-components ref (default: cohere)" | ||
| description: | | ||
| guest-components ref (branch, tag, or SHA). | ||
|
|
||
| Default: alhassankhedr/sync-main-to-cohere (head of PR #9). | ||
| That branch carries upstream main's nvidia-attester rewrite | ||
| (NVAT SDK based, no `count == 1` guard) and is required for | ||
| multi-GPU evidence to work end-to-end on 8x B200 hosts. The | ||
| plain `cohere` branch still has the old NVML-based attester | ||
| which silently produces empty evidence on 2+ GPU systems | ||
| (mod a sed `s/count == 1/count >= 1/` patch the podvm-mkosi | ||
| Dockerfile applies). Switch back to `cohere` after PR #9 merges. | ||
| required: false | ||
| type: string | ||
| default: "cohere" | ||
| default: "alhassankhedr/sync-main-to-cohere" | ||
| custom_gc_binaries: | ||
| description: "guest-components binaries to build from source" | ||
| required: false | ||
|
|
@@ -46,6 +56,11 @@ on: | |
| required: false | ||
| type: boolean | ||
| default: false | ||
| b200_cc_drivers: | ||
| description: "Install NVIDIA 595.71.05 open driver (enables Confidential Computing on multi-GPU B200). EXPERIMENTAL" | ||
| required: false | ||
| type: boolean | ||
| default: false | ||
|
|
||
| permissions: | ||
| id-token: write # OIDC token for build provenance attestation | ||
|
|
@@ -70,11 +85,13 @@ jobs: | |
| image_name_debug: ${{ steps.compute.outputs.image_name_debug }} | ||
| image_tag_release: ${{ steps.compute.outputs.image_tag_release }} | ||
| image_tag_debug: ${{ steps.compute.outputs.image_tag_debug }} | ||
| b200_cc_drivers: ${{ steps.compute.outputs.b200_cc_drivers }} | ||
| steps: | ||
| - name: Compute tags and image names | ||
| id: compute | ||
| env: | ||
| DISTRO: ${{ inputs.distro || 'ubuntu' }} | ||
| B200_CC_DRIVERS: ${{ inputs.b200_cc_drivers && 'true' || 'false' }} | ||
| run: | | ||
| if [[ "$GITHUB_REF" == refs/tags/podvm-v* ]]; then | ||
| TAG="${GITHUB_REF#refs/tags/podvm-}" | ||
|
|
@@ -84,10 +101,15 @@ jobs: | |
| REPLACE_IMAGE="true" | ||
| fi | ||
| TAG="${TAG//./-}" | ||
| # Suffix CC-driver builds so they never collide with standard images | ||
| if [ "$B200_CC_DRIVERS" = "true" ]; then | ||
| TAG="${TAG}-cc595" | ||
| fi | ||
| { | ||
| echo "tag=$TAG" | ||
| echo "distro=$DISTRO" | ||
| echo "replace_image=$REPLACE_IMAGE" | ||
| echo "b200_cc_drivers=$B200_CC_DRIVERS" | ||
| echo "image_name_release=podvm-${DISTRO}-${TEE_PLATFORM}-release-${TAG}" | ||
| echo "image_name_debug=podvm-${DISTRO}-${TEE_PLATFORM}-debug-${TAG}" | ||
| echo "image_tag_release=${TAG}-${DISTRO}-release" | ||
|
|
@@ -176,7 +198,7 @@ jobs: | |
| PODVM_DISTRO: ${{ needs.meta.outputs.distro }} | ||
| AA_FEATURES: ${{ inputs.aa_features || 'bin,ttrpc,kbs,coco_as,rust-crypto,tdx-attester,nvidia-attester' }} | ||
| GC_REPO: ${{ inputs.guest_components_repo || 'https://github.com/cohere-ai/guest-components.git' }} | ||
| GC_REF: ${{ inputs.guest_components_ref || 'cohere' }} | ||
| GC_REF: ${{ inputs.guest_components_ref || 'alhassankhedr/sync-main-to-cohere' }} | ||
| GC_CUSTOM_BINARIES: ${{ inputs.custom_gc_binaries || 'attestation-agent,api-server-rest' }} | ||
| run: | | ||
| MAKE_ARGS=( | ||
|
|
@@ -195,6 +217,47 @@ jobs: | |
| echo "Disk after binaries build:" | ||
| df -h / | ||
|
|
||
| - name: Override NVIDIA driver to 595.71.05 (B200 multi-GPU CC) | ||
| if: needs.meta.outputs.b200_cc_drivers == 'true' | ||
| working-directory: src/cloud-api-adaptor/podvm-mkosi | ||
| run: | | ||
| set -euo pipefail | ||
| CONF=mkosi.presets/system/mkosi.conf.d/ubuntu.conf | ||
| # The 595 branch only ships the unversioned `nvidia-driver-open` | ||
| # metapackage in NVIDIA's CUDA repo (no `nvidia-driver-595-open`). | ||
| # Match by package name only so this survives future 580.x.y bumps. | ||
| sed -i -E \ | ||
| -e 's|^([[:space:]]*)nvidia-driver-580-open=.*|\1nvidia-driver-open=595.71.05-1ubuntu1|' \ | ||
| -e 's|^([[:space:]]*)nvidia-persistenced=.*|\1nvidia-persistenced=595.71.05-1ubuntu1|' \ | ||
| -e 's|^([[:space:]]*)nvidia-fabricmanager=.*|\1nvidia-fabricmanager=595.71.05-1ubuntu1|' \ | ||
| -e 's|^([[:space:]]*)libnvidia-nscq=.*|\1libnvidia-nscq=595.71.05-1ubuntu1|' \ | ||
| "$CONF" | ||
| echo "----- Updated NVIDIA package pins -----" | ||
| grep -E '^[[:space:]]*(nvidia|libnvidia)' "$CONF" | ||
|
|
||
| - name: Increase debug root partition for CC595 drivers | ||
| if: needs.meta.outputs.b200_cc_drivers == 'true' && matrix.profile == 'debug' | ||
| working-directory: src/cloud-api-adaptor/podvm-mkosi | ||
| run: | | ||
| set -euo pipefail | ||
| CONF=mkosi.presets/system/mkosi.repart-debug/10-root.conf | ||
| # NVIDIA 595 drivers make the root filesystem too large for | ||
| # systemd-repart's Minimize=guess estimation, causing mkfs.ext4 | ||
| # "No space left on device" during the build. | ||
| printf '[Partition]\nType=root\nFormat=ext4\nCopyFiles=/\nMinimize=off\nSizeMinBytes=12G\nSizeMaxBytes=12G\n' > "$CONF" | ||
| echo "----- Updated repart config -----" | ||
| cat "$CONF" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Debug root partition override is redundant dead codeLow Severity The "Increase debug root partition for CC595 drivers" workflow step (conditional on Additional Locations (1)Reviewed by Cursor Bugbot for commit de75f12. Configure here. |
||
|
|
||
| - name: Resolve installed NVIDIA driver version | ||
| working-directory: src/cloud-api-adaptor/podvm-mkosi | ||
| run: | | ||
| set -euo pipefail | ||
| CONF=mkosi.presets/system/mkosi.conf.d/ubuntu.conf | ||
| DRIVER_LINE=$(grep -E '^[[:space:]]*nvidia-driver(-580)?-open=' "$CONF" | head -n1) | ||
| DRIVER_VER=$(printf '%s' "$DRIVER_LINE" | sed -E 's|.*=([0-9]+\.[0-9]+\.[0-9]+).*|\1|') | ||
| echo "Resolved NVIDIA driver version: $DRIVER_VER" | ||
| echo "NVIDIA_DRIVER=$DRIVER_VER" >> "$GITHUB_ENV" | ||
|
|
||
| - name: Build OS image | ||
| working-directory: src/cloud-api-adaptor/podvm-mkosi | ||
| env: | ||
|
|
@@ -265,6 +328,7 @@ jobs: | |
| --arg distro "$DISTRO" \ | ||
| --arg profile "$PROFILE" \ | ||
| --arg tee_platform "$TEE_PLATFORM" \ | ||
| --arg nvidia_driver "$NVIDIA_DRIVER" \ | ||
| --arg build_date "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ | ||
| '$ARGS.named' > /tmp/measurements.json | ||
|
|
||
|
|
@@ -307,6 +371,7 @@ jobs: | |
| --annotation "com.cohere.caa.commit=${CAA_COMMIT}" \ | ||
| --annotation "com.cohere.caa.version=${GITHUB_REF_NAME}" \ | ||
| --annotation "com.cohere.rtmr2=${RTMR2}" \ | ||
| --annotation "com.cohere.nvidia.driver=${NVIDIA_DRIVER}" \ | ||
| --format json > oras-output.json | ||
|
|
||
| cat oras-output.json | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔒 Agentic Security Review
Impact: If that branch is updated maliciously (or compromised), unreviewed guest-components code can be pulled into release artifacts and published as trusted PodVM images. |
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,6 +30,10 @@ Packages= | |
| nvidia-fabricmanager=580.126.20-1 | ||
| libnvidia-nscq=580.126.20-1 | ||
| nvidia-container-toolkit=1.19.0-1 | ||
| libcurl4t64 | ||
| libxml2 | ||
| libxmlsec1-openssl | ||
| pciutils | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Duplicate
|
||
|
|
||
| RemoveFiles=/etc/issue | ||
| RemoveFiles=/etc/issue.net | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,6 +17,8 @@ ARG CUSTOM_GC_BINARIES="" | |
| ARG AA_FEATURES="" | ||
| ARG GUEST_COMPONENTS_REF="" | ||
| ARG GUEST_COMPONENTS_REPO="https://github.com/confidential-containers/guest-components.git" | ||
| ARG NVAT_REPO="https://github.com/NVIDIA/attestation-sdk.git" | ||
| ARG NVAT_TAG="2026.03.02" | ||
| ARG DEBIAN_FRONTEND=noninteractive | ||
| RUN set -e; \ | ||
| if [ -n "${CUSTOM_GC_BINARIES}" ] && [ -z "${GUEST_COMPONENTS_REF}" ]; then \ | ||
|
|
@@ -26,15 +28,34 @@ RUN set -e; \ | |
| if [ -n "${CUSTOM_GC_BINARIES}" ] && [ -n "${GUEST_COMPONENTS_REF}" ]; then \ | ||
| apt-get update && \ | ||
| apt-get install -y --no-install-recommends \ | ||
| protobuf-compiler pkg-config clang libssl-dev libtss2-dev && \ | ||
| protobuf-compiler pkg-config clang libclang-dev libssl-dev libtss2-dev \ | ||
| cmake libcurl4-openssl-dev libxml2-dev libxmlsec1-dev libxmlsec1-openssl && \ | ||
| apt-get clean && rm -rf /var/lib/apt/lists/* && \ | ||
| mkdir -p /build/gc && cd /build/gc && \ | ||
| git init && \ | ||
| git remote add origin "${GUEST_COMPONENTS_REPO}" && \ | ||
| git fetch --depth=1 origin "${GUEST_COMPONENTS_REF}" && \ | ||
| git reset --hard FETCH_HEAD; \ | ||
| fi | ||
| # Build NVIDIA Attestation SDK (libnvat) from source so the nvidia-attester | ||
| # feature can link against it. The AA binary will dynamically link libnvat.so, | ||
| # which must also be present in the final PodVM image at runtime. | ||
| # Installs its own build deps so this works even without CUSTOM_GC_BINARIES. | ||
| RUN set -e; \ | ||
| if echo "${AA_FEATURES}" | grep -q "nvidia-attester"; then \ | ||
| apt-get update && \ | ||
| apt-get install -y --no-install-recommends cmake libssl-dev libcurl4-openssl-dev \ | ||
| libxml2-dev libxmlsec1-dev libxmlsec1-openssl && \ | ||
| apt-get clean && rm -rf /var/lib/apt/lists/* && \ | ||
| git clone --depth 1 --branch "${NVAT_TAG}" "${NVAT_REPO}" /build/nvat && \ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔒 Agentic Security Review The new build step clones Impact: If the upstream tag is retargeted or the source repo is compromised, malicious code could be compiled into |
||
| cd /build/nvat/nv-attestation-sdk-cpp && \ | ||
| cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=lib && \ | ||
| cmake --build build && \ | ||
| cmake --install build && \ | ||
| ldconfig; \ | ||
| fi | ||
|
cursor[bot] marked this conversation as resolved.
|
||
| COPY cloud-api-adaptor/podvm/build-guest-components.sh /build/ | ||
| ENV NVAT_USE_SYSTEM_LIB=1 | ||
| RUN /build/build-guest-components.sh "${CUSTOM_GC_BINARIES}" "${AA_FEATURES}" | ||
|
|
||
| # ubuntu:24.04 | ||
|
|
@@ -146,5 +167,14 @@ RUN for bin in /tmp/gc-overrides/*; do \ | |
| install -m0755 "$bin" /src/cloud-api-adaptor/podvm/files/usr/local/bin/"$(basename "$bin")"; \ | ||
| done; true | ||
|
|
||
| # Copy libnvat shared library if it was built (needed at runtime by attestation-agent | ||
| # when compiled with nvidia-attester feature). Uses a mount instead of COPY so | ||
| # builds without nvidia-attester don't fail on an empty glob. | ||
| RUN --mount=from=gc_builder,src=/usr/lib/,dst=/tmp/gc-lib/,readonly \ | ||
| if ls /tmp/gc-lib/libnvat* 1>/dev/null 2>&1; then \ | ||
| mkdir -p /src/cloud-api-adaptor/podvm/files/usr/lib/ && \ | ||
| cp /tmp/gc-lib/libnvat* /src/cloud-api-adaptor/podvm/files/usr/lib/; \ | ||
| fi; true | ||
|
cursor[bot] marked this conversation as resolved.
|
||
|
|
||
| FROM scratch | ||
| COPY --from=podvm_binaries_builder /src/cloud-api-adaptor/podvm/files / | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,6 +30,10 @@ for bin in "${BINS[@]}"; do | |
| exit 1 | ||
| fi | ||
| cd /build/gc/attestation-agent/attestation-agent | ||
| # Refresh lockfile so optional feature deps (e.g. nv-attestation-sdk | ||
| # for nvidia-attester) are resolved even if the checked-in Cargo.lock | ||
| # was generated without them. | ||
| cargo update --workspace | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔒 Agentic Security Review The new Impact: A malicious or compromised transitive crate release could be silently pulled into |
||
| cargo build --release --locked --no-default-features \ | ||
| --features "$AA_FEATURES" --bin ttrpc-aa | ||
| cp /build/gc/target/release/ttrpc-aa "$OUTDIR/attestation-agent" | ||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opt-in flag is non-functional; config already hardcodes 595
High Severity
The
b200_cc_driversflag is supposed to opt-in to the 595 driver by sed-replacing 580 package pins, butubuntu.confwas directly committed with 595 packages (nvidia-driver-open=595.71.05-1ubuntu1), making the sed a no-op. Specifically, the first sed pattern looks fornvidia-driver-580-open=which doesn't exist in the committed file (it hasnvidia-driver-open=without580), and patterns 2–4 match but replace with the same already-present 595 values. This means ALL builds use the 595 driver regardless of the flag, contradicting the stated "Default behaviour is unchanged" and the PR's "opt-in" design. The flag's only actual effect is appending-cc595to the image tag.Additional Locations (1)
src/cloud-api-adaptor/podvm-mkosi/mkosi.presets/system/mkosi.conf.d/ubuntu.conf#L27-L31Reviewed by Cursor Bugbot for commit fd50ec5. Configure here.