Skip to content

Rocm pixi env#175

Merged
jandom merged 30 commits into
aqlaboratory:mainfrom
sdvillal:rocm-pixi-env
Apr 28, 2026
Merged

Rocm pixi env#175
jandom merged 30 commits into
aqlaboratory:mainfrom
sdvillal:rocm-pixi-env

Conversation

@Emrys-Merlin
Copy link
Copy Markdown
Contributor

Summary

This PR introduces a ROCm pixi-environment called openfold3-rocm7 in line with the cpu/cuda12/cuda13 environments. This unifies the usage pattern of openfold3 after the migration to the pixi package manger.

Changes

  • Added a pytorch-rocm pixi-feature, which pulls pytorch and triton with rocm support from the pytorch PyPI mirror. (Please, note that we cannot pull pytorch-rocm dependencies from conda-forge (yet).)

Related Issues

I tried to build the environment on our HPC cluster, but our proxy interfered with the resolution of the pytorch dependency. @sdvillal thankfully already opened an issue about that with the pixi developers, so hopefully this will be resolved soon. I spun up an AWS EC2 instance where the resolution worked without any issues.

Testing

I could only test that the environment resolves as I do not have access to an AMD accelerator. @singagan if you could help me out here, that would be highly appreciated :-)

The current output of the validate-openfold3-rocm command is as follows:

$ pixi run -e openfold3-rocm7 validate-openfold3-rocm
OpenFold3 ROCm environment check

  [PASS] PyTorch installed: 2.11.0+rocm7.2
  [PASS] PyTorch built with ROCm (HIP): 7.2.26015
  [FAIL] ROCm GPU visible: none
  [PASS] Triton installed: 3.6.0
  [FAIL] Triton backend is HIP: 0 active drivers ([]). There should only be one.
  [PASS] Triton evoformer kernel loaded

One or more checks failed. See above for details.
Installation instructions: https://github.com/aqlaboratory/openfold-3/blob/main/docs/source/Installation.md

Other Notes
Note, as we need to pull pytorch from PyPI, we pull almost all dependencies from PyPI and not from conda-forge. This is necessary, because if any one of our dependencies were to pull pytorch from conda-forge, this would supersede our PyPI pytorch request and we would end up with a pytorch version without ROCm support. This is a known pixi limitation. If it gets resolved, we could think about pulling more of the dependencies from conda-forge, but this is optional and not a blocker.

@sdvillal, I would love to get your feedback. The environment setup is rather complex and I'm not completely convinced I assembled the rocm environment correctly (or if I pulled in unnecessary features).

@jnwei @jandom As discussed in #166, this is the draft to enable ROCm in the pixi setup.

sdvillal and others added 25 commits March 23, 2026 12:33
* Add initial pixi environment

all tests pass, predictions seem to be correct
corresponds to a modernized conda environment following best practices

* Reorder dependencies for easier read

* Add openfold3 as an editable dependency

* Sync cuda-python pin between pypi package and the conda environment

* Comments

Comments

Overcommenting issues

* Add explicitly a conda yml version of the pixi environment

* Improve some wordings

* Update pixi lockfile

* Vendoring pieces of deepspeed

incomplete, we might not need the native sources
from upstream commit df59f203f40c8a292dd019ae68c9e6c88f107026

* Swap ninja verification with pytorch's

* Vendoring pieces of deepspeed

incomplete, we might not need the native sources
from upstream commit df59f203f40c8a292dd019ae68c9e6c88f107026

* Use vendored deepspeed evoformer builder

Use vendored deepspeed in the attention primitives

* Add symlink to vendored deepspeed as in upstream

* Vendor also op_builder.__init__ from deepspeed

* Import explicitly EvoformerAttnBuilder, avoiding broken introspection magic

* Add a ignore mechanism for cutlass detection in vendored deepspeed

* Apply cutlass detection workaround and remove all nvidia-cutlass tricks from pixi environment

* Remove nvidia-cutlass from openfold-3 dependencies (fix later)

* Remove pypi ninja dependency in pixi workspace

* No need for cutlass hacks

* Add pixi config to .gitattributes

* Remove deepspeed hacks for good

* Update pixi lockfile

* Update pixi conda environment

* Remove MKL from pypi dependencies, as it is unused

* Remove aria2 from pypi dependencies, unused and not so much of a convenience

* Update lockfile

Update lockfile

* Re-enable pure PyPI install

* Disable hack when conda is active

* More comments on cutlass python API deprecation and pytorch

* Make pixi environments (CPU, CUDA12, CUDA13, for all major platforms)

* Increase LMDB map size to make test pass in osx-arm64

* Better comments of TODOs in pixi.toml

Better comments of TODOs in pixi.toml

Better comments of TODOs in pixi.toml

* Pin cuequivariance until test failure is investigated

* Move deepspeed to optional dependency also in pyproject

* Pyproject: extend python version support

* Pyproject: move dependencies table together with optional-dependencies

* Pyproject: document future decision on dependency-groups

* Pyproject: reformat to consolidate indent to 4 spaces

* Pyproject: reorder dependencies for easier read

* Pixi: add scipy

* Pixi: add comment on CUDA13

* Pixi: make cuequivariance CUDA generic for its conda packages

* Pixi: add reminder about devel install

* Pyproject: fix and improve readability, add URLs

* pixi.toml: make more readable by showing first envs, then base, then variants

* pixi.toml: pin deepspeed to 0.18.3, first one with ninja detection fixed

* pixi.toml: fully enable aarch64 and cuda13, revamp docs

* pixi.lock: update

* pixi.toml: add triton to cuequivariance dependencies for CUDA13

* pixi.lock: update

* pixi.toml: include pip to allow users to play

* pixi.toml: formatting for better readability

* pixi.toml: restrict cuequivariance-cu13 to linux-64 until we unpin to >=0.8

* pixi.toml: formatting for better readability

* pixi.toml: make pytorch-gpu an isolated environment feature

in this way we can more easily express when a package is not ready yet in CF

* pixi.toml: add environments that combine mostly pypi-based deps with CUDA from conda

* pixi.toml: add openfold3-editable-full and account for lack of cuequivariance for python=3.14

* pixi.toml: brief documentation of the pypi-dominant environments

* pixi.toml: add also the dev optional dependency group to openfold3-full

* pyproject.toml: pin cuequivariance to <0.8 until we adapt tests

* pixi.toml: add kalign to required non-pypi dependencies

* pixi.toml: add more bioinformatics tools to non-pypi

* pixi.toml: make env setup be part of the deepspeed-build feature

* pixi.toml: simplify management of pypi features

* pixi.lock: update, all tests pass A100,B300 x CUDA12,CUDA13

* pixi.toml: add table of what works and what needs test

* pixi.toml: add tasks for exporting to regular conda environment yamls

* conda environments: delete outdated modernized conda env, use new tasks instead

* pixi.toml: bump min pixi version

* pixi.toml: remove unnecessary comments

* pixi.toml: remove unnecessary envvar definition for isolating extension builds

* pixi.toml: better definition of maintenance environment

pixi.toml: better definition of maintenance environment

pixi.toml: better definition of maintenance environment

* pixi.toml: add simple task to run test and save rsults to an environment-specific dir

* of3: enable pickling regardless of forking strategy and platform

* of3: enable multiple data loader workers in osx mps backed

* Vendor improved deepspeed builder from upstream PR

See: deepspeedai/DeepSpeed#7760

* pixi.lock: update

* pixi.toml: remove some comment noise

* of3: fix multiprocessing configuration corner case in osx

* docker: move outdated example dockerfiles to docker/pixi-examples

* examples: add example runner for osx inference

* pixi.toml: ensure we get the right pytorch from pypi

something smilar should actually be supported in pyproject.toml

* pixi.lock: update, fixed torch cuda missmatch in pypi environments

* pixi.toml: fix lock export + make default environment be maintenance

* pixi.toml: use a more consitent name for environment arg

* pixi.lock: update

* pixi.toml: workaround for no-default-feature breaking the test task (pixi bug)

* pixi.toml: issue with pixi pypi resolution seems solved

* Revert "pixi.toml: issue with pixi pypi resolution seems solved"

This reverts commit ded3482.

* pixi.toml: better document problem and workaround

* pixi.toml: make the test task present in all relevant environments

this I feel makes less surprising its use, as opposed to passing the environment as an arg to a dependent task

* pixi.toml: let CUDA13 flow freely

* pixi.lock: update for initial pytorch 2.10, cuda 13.1 support

* pixi.toml: add safe cuda environments (no accelerators)

* of3: remove deepspeed hacks

note that there are still some in __init__.py

* of3: unvendor deepspeed

* pixi.toml: simplify deepspeed dependency after our changes made it to CF/pypi

* pixi.toml: remove safe environments as we are not maintaining them

* pixi.toml: enable pytorch-coda in cuda 13 env after 2.10 release

* pyproject.toml: pin deepspeed to >0.18.5, improved evoformer compilation

* Add awscrt to dependencies, missing from recent PR

* pixi.toml: setup correctly path to PTXAS_BLACKWELL for triton >=3.6.0

* pixi.toml: add -safe environments, at the moment just without cuequivariance

these are also conda-pure environments

* pixi.lock: update after consolidation (no vendor, pytorch 2.10 + CF cuda13)

* pixi.toml: update outdated comments

* updates with GB10 tests (#2)

* updates with GB10 tests

* cleanup

* harmonize

* linting data_module.py

* speculative changes

* pixi.toml: remove safe environments

* pixi.lock: update after removal of safe environments

* Remove pixi docker examples, to rework

* Comment-out workaround for hard to reproduce ABI mismatch problem

* pixi.toml: bump pixi, improve conda export by including all env variables

* pixi.toml: unpin biotite

* pixi.toml: python has its own feature

* pixi.toml: bump deepspeed

* pyproject.toml: bump deepspeed to version without Evoformer build bug

* pixi.toml: detail on workaround

* pixi.lock: update

* pixi.toml: add example task to update safely the lockfile

* pixi.toml: remove kalign2

* tests: fix test depending on unspecified glob return order

* pixi.toml: better metadata

* docs: wip

* pixi.lock: update

* Allow to configure multiprocessing start and set safe defaults

We would still need to document this for users

* Fix capitalization error

* Fix capitalization error

* Fix typo

* pixi.lock: update

---------

Co-authored-by: Tim Adler <tim.adler@bayer.com>
Co-authored-by: Jan Domański <jan.domanski@omsf.io>
@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Apr 14, 2026

@Emrys-Merlin great contribution Tim :-)

@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Apr 16, 2026

Getting some test failures with this env on AMD

FAILED openfold3/tests/test_triangular_attention.py::test_shape[cuda-True] - AssertionError: Values are not sufficiently close.
FAILED openfold3/tests/test_triangular_attention.py::test_shape[cuda-False] - AssertionError: Values are not sufficiently close.
FAILED openfold3/tests/test_triangular_multiplicative_update.py::test_shape[cuda] - AssertionError: Values are not sufficiently close.

It could all be expected numerics, unclear. This is the chip

(openfold3:openfold3-rocm7) [jandom@k006-004-v2 openfold-3]$ amd-smi 
+------------------------------------------------------------------------------+
| AMD-SMI 26.2.1+fc0010cf6a    amdgpu version: 6.16.13  ROCm version: 7.2.0    |
| VBIOS version: 613661                                                        |
| Platform: Linux Guest (Passthrough)                                          |
|-------------------------------------+----------------------------------------|
| BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |
| GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |
|=====================================+========================================|
| 0000:0c:00.0     AMD Instinct MI210 | 0 %      51 °C   0            43/300 W |
|   0       0     N/A             N/A | 0 %        N/A             10/65520 MB |
+-------------------------------------+----------------------------------------+
+------------------------------------------------------------------------------+
| Processes:                                                                   |
|  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |
|==============================================================================|
|  No running processes found                                                  |
+------------------------------------------------------------------------------+

update

Looking in a more detailed way, the test_triangular_multiplicative_update.py update seems fine/minimal drift

E       
E       output:
E         Shape: (2, 22, 22, 128)
E         Number of differences: 27 / 123904 (0.0%)
E         Statistics are computed for differing elements only.
E         Stats for abs(obtained - expected):
E           Max:     1.8238788470625877e-06
E           Mean:    1.1477516181912506e-06
E           Median:  1.0848743841052055e-06
E         Stats for abs(obtained - expected) / abs(expected):
E           Max:     0.003565334714949131
E           Mean:    0.0020918985828757286
E           Median:  0.001880077994428575
E         Individual errors:

The other two (both for test_triangular_attention.py) look more severe:

       output:
E         Shape: (2, 22, 22, 128)
E         Number of differences: 57574 / 123904 (46.5%)
E         Statistics are computed for differing elements only.
E         Stats for abs(obtained - expected):
E           Max:     6.344435678329319e-05
E           Mean:    9.54591541812988e-06
E           Median:  8.05101626610849e-06
E         Stats for abs(obtained - expected) / abs(expected):
E           Max:     2778.166259765625
E           Mean:    1.3626092672348022
E           Median:  0.3236933946609497

E       output:
E         Shape: (2, 22, 22, 128)
E         Number of differences: 47210 / 123904 (38.1%)
E         Statistics are computed for differing elements only.
E         Stats for abs(obtained - expected):
E           Max:     5.022007826482877e-05
E           Mean:    1.0010324331233278e-05
E           Median:  8.413369869231246e-06
E         Stats for abs(obtained - expected) / abs(expected):
E           Max:     57885.04296875
E           Mean:    3.7973642349243164
E           Median:  0.34220370650291443

@Emrys-Merlin
Copy link
Copy Markdown
Contributor Author

Thanks a lot for testing this @jandom! I really appreciate it :-)

I think I count it as a win that the tests ran at all :-D

I agree that some of the numerical differences warrant deeper inspection. I'm open to support here, but I am a bit handicapped without access to AMD GPUs. If it is easy for you to share limited access with me to debug this, that could speed up things a bit. I will continue looking for an internal solution.

I will be on vacation next week. So, I won't be very responsive. If we don't find a solution until Barcelona, I'm happy to chat there :-)

@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Apr 20, 2026

No worries, I've shared this ticket with Gagan already – he might be able to come in and help

@jandom jandom deleted the branch aqlaboratory:main April 23, 2026 07:27
@jandom jandom closed this Apr 23, 2026
@jandom jandom reopened this Apr 24, 2026
…icative update

Floating point arithmetic is not associative: different hardware
parallelizes reductions (e.g. matrix multiplications, attention
softmax) in different orders, accumulating rounding errors differently.
CUDA and ROCm therefore produce results that diverge by up to ~2e-6
even on identical inputs. Snapshot comparisons are now routed to
nvidia/ or rocm/ subdirectories based on torch.version.hip, so each
platform validates consistency with itself across code changes.
@singagan
Copy link
Copy Markdown
Contributor

Hi @Emrys-Merlin, @jandom, @sdvillal, thank you for adding ROCm support. I tested this on AMD hardware. The environment resolves and installs correctly. I ran into snapshot regression failures in test_triangular_attention and test_triangular_multiplicative_update as these were generated on NVIDIA (added in 9879cd8e, stored under openfold3/tests/test_data/snapshots/) and since floating point arithmetic is not associative, we observe numerical differences. I added per-platform snapshot support (nvidia/ and rocm/subdirectories) with ROCm-generated snapshots, all tests now pass. Branch with the changes: https://github.com/singagan/openfold-3/tree/rocm-pixi-env. Please feel free to pull it in if it looks good to you or I can make a separate PR if it works better.

@Emrys-Merlin Emrys-Merlin changed the base branch from pixi-beta to main April 28, 2026 09:20
@Emrys-Merlin Emrys-Merlin marked this pull request as ready for review April 28, 2026 09:21
@Emrys-Merlin Emrys-Merlin marked this pull request as draft April 28, 2026 09:22
@Emrys-Merlin Emrys-Merlin marked this pull request as ready for review April 28, 2026 09:36
@Emrys-Merlin
Copy link
Copy Markdown
Contributor Author

Hi @singagan,

Thanks a lot for your help! Your changes make sense to me, so I added them to this PR.

From my side this PR is now ready for testing @jandom.

After everything is done, I think we should squash the commits in this PR. As I started working on this feature before the pixi PR was merged, there are a couple of commits in the history that were squashed when the pixi PR was merged.

@jandom jandom added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label Apr 28, 2026
@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Apr 28, 2026

Kicked-off all the tests and let's merge as soon as green

@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Apr 28, 2026

@Emrys-Merlin @singagan thanks for contributing here – updated snapshots make me happy!

@jnwei
Copy link
Copy Markdown
Contributor

jnwei commented Apr 28, 2026

Could we quickly add some documentation for installing with AMD? I think we could add it to this file (and rename the title accordingly)

https://github.com/aqlaboratory/openfold-3/blob/main/docs/source/kernels.md

@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Apr 28, 2026

Don't we have the install instructions already here?

https://github.com/aqlaboratory/openfold-3/blob/main/docs/source/Installation.md

Why would we add those ROCm instructions to the kernels page? Sorry maybe I'm confused

@jnwei
Copy link
Copy Markdown
Contributor

jnwei commented Apr 28, 2026

Oh if we have previous instructions on the installation.md that's great, but looking at this version, it doesn't seem like it includes the new openfold3-rocm7 environment

@Emrys-Merlin
Copy link
Copy Markdown
Contributor Author

I added a line to Installation.md and updated the pixi figure.

@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Apr 28, 2026

Looks good to me! If there are any outstanding issues, let's do a follow-up PR

@jandom jandom merged commit 7b8068f into aqlaboratory:main Apr 28, 2026
2 checks passed
@jandom jandom deleted the rocm-pixi-env branch April 28, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants