From e0e73d290efaee7dfe216eaaf023cee65a8df1f3 Mon Sep 17 00:00:00 2001 From: mesutoezdil Date: Mon, 11 May 2026 14:23:04 +0200 Subject: [PATCH] fix: make docs/ the default version and sync v2.8.0 snapshot (98 files) The website defaulted to the v2.8.0 versioned snapshot. All content improvements merged to docs/ since v2.8.0 were invisible on the live site. Contributors edited versioned_docs directly, causing ongoing divergence. - docusaurus.config.js: set lastVersion:'current' so docs/ is shown at /docs/ (labeled 'latest'); v2.8.0 moves to /docs/v2.8.0/ - versioned_docs/version-v2.8.0/: sync all 98 files with current docs/ After this merge, any PR to docs/ immediately appears on the live site. Signed-off-by: mesutoezdil --- docusaurus.config.js | 6 + versioned_docs/version-v2.8.0/README.md | 2 +- .../version-v2.8.0/contributor/adopters.md | 18 +- .../contributor/cherry-picks.md | 133 ++---- .../contributor/contribute-docs.md | 352 +++++++++------- .../contributor/contributing.md | 397 +++++++++++++++--- .../contributor/contributors.md | 18 +- .../contributor/github-workflow.md | 321 +++++--------- .../version-v2.8.0/contributor/governance.md | 45 +- .../version-v2.8.0/contributor/ladder.md | 230 +++++----- .../version-v2.8.0/contributor/lifted.md | 24 +- .../core-concepts/architecture.md | 4 +- .../core-concepts/gpu-virtualization.md | 6 +- .../core-concepts/introduction.md | 1 + .../version-v2.8.0/developers/build.md | 30 +- .../version-v2.8.0/developers/dynamic-mig.md | 40 +- .../developers/hami-core-design.md | 2 +- .../hami-webui-development-guide.md | 1 - .../developers/kunlunxin-topology.md | 30 +- .../version-v2.8.0/developers/protocol.md | 14 +- .../version-v2.8.0/developers/scheduling.md | 86 ++-- versioned_docs/version-v2.8.0/faq/faq.md | 10 +- .../get-started/deploy-with-helm.md | 98 ++--- .../version-v2.8.0/get-started/verify-hami.md | 15 +- .../installation/aws-installation.md | 14 +- .../installation/how-to-use-hami-dra.md | 9 +- .../installation/how-to-use-volcano-ascend.md | 15 +- .../installation/how-to-use-volcano-vgpu.md | 4 +- .../installation/offline-installation.md | 4 +- .../installation/online-installation.md | 4 +- .../installation/prerequisites.md | 9 +- .../version-v2.8.0/installation/uninstall.md | 124 +++++- .../version-v2.8.0/installation/upgrade.md | 297 ++++++++++++- .../installation/webui-installation.md | 16 +- .../key-features/device-sharing.md | 5 +- versioned_docs/version-v2.8.0/releases.md | 47 +-- .../troubleshooting/troubleshooting.md | 6 +- .../ascend-device/device-template.md | 3 + .../ascend-device/enable-ascend-sharing.md | 224 +++++++--- .../ascend-device/examples/allocate-310p.md | 18 +- .../ascend-device/examples/allocate-910b.md | 13 +- .../examples/allocate-exclusive.md | 5 +- .../enable-awsneuron-managing.md | 16 +- .../examples/allocate-neuron-core.md | 4 +- .../examples/allocate-neuron-device.md | 6 +- .../enable-cambricon-mlu-sharing.md | 6 +- .../examples/allocate-core-and-memory.md | 2 +- .../examples/allocate-exclusive.md | 2 +- .../specify-device-memory-usage.md | 2 +- .../specify-device-type-to-use.md | 6 +- .../version-v2.8.0/userguide/configure.md | 16 +- .../userguide/device-supported.md | 7 +- .../enable-enflame-gcu-sharing.md | 21 +- .../userguide/hami-webui-user-guide.md | 5 +- .../hygon-device/enable-hygon-dcu-sharing.md | 20 +- .../examples/allocate-core-and-memory.md | 2 +- .../examples/specify-certain-cards.md | 2 +- .../hygon-device/specify-device-core-usage.md | 4 +- .../specify-device-memory-usage.md | 4 +- .../examples/allocate-bi-v150.md | 2 +- .../examples/allocate-exclusive-bi-v150.md | 5 +- .../examples/allocate-exclusive-mr-v100.md | 9 +- .../examples/allocate-mr-v100.md | 4 +- .../enable-kunlunxin-schedule.md | 10 +- .../kunlunxin-device/enable-kunlunxin-vxpu.md | 2 +- .../examples/allocate-whole-xpu.md | 2 +- .../metax-gpu/enable-metax-gpu-schedule.md | 11 +- .../metax-gpu/examples/allocate-binpack.md | 4 +- .../metax-gpu/examples/allocate-spread.md | 4 +- .../metax-gpu/examples/default-use.md | 2 +- .../metax-gpu/specify-binpack-task.md | 2 +- .../metax-gpu/specify-spread-task.md | 2 +- .../metax-sgpu/enable-metax-gpu-sharing.md | 14 +- .../examples/allocate-qos-policy.md | 8 +- .../metax-sgpu/examples/default-use.md | 4 +- .../userguide/monitoring/device-allocation.md | 18 +- .../monitoring/real-time-device-usage.md | 6 +- .../userguide/monitoring/real-time-usage.md | 114 ++++- .../enable-mthreads-gpu-sharing.md | 18 +- .../examples/allocate-core-and-memory.md | 4 +- .../examples/allocate-exclusive.md | 4 +- .../specify-device-core-usage.md | 4 +- .../specify-device-memory-usage.md | 4 +- .../nvidia-device/dynamic-mig-support.md | 21 +- .../examples/allocate-device-core.md | 2 +- .../examples/allocate-device-memory.md | 2 +- .../examples/allocate-device-memory2.md | 2 +- .../examples/dynamic-mig-example.md | 4 +- .../examples/specify-card-type-to-use.md | 6 +- .../examples/specify-certain-card.md | 2 +- .../examples/use-exclusive-card.md | 4 +- .../specify-device-core-usage.md | 4 +- .../specify-device-memory-usage.md | 8 +- .../specify-device-type-to-use.md | 4 +- .../specify-device-uuid-to-use.md | 4 +- .../nvidia-gpu/examples/default-use.md | 6 +- .../nvidia-gpu/examples/use-exclusive-gpu.md | 2 +- .../nvidia-gpu/how-to-use-volcano-vgpu.md | 8 +- .../volcano-vgpu/nvidia-gpu/monitor.md | 14 +- 99 files changed, 1972 insertions(+), 1203 deletions(-) diff --git a/docusaurus.config.js b/docusaurus.config.js index 43b5e648..21c61992 100644 --- a/docusaurus.config.js +++ b/docusaurus.config.js @@ -187,6 +187,12 @@ module.exports = { showLastUpdateAuthor: false, showLastUpdateTime: true, includeCurrentVersion: true, + lastVersion: 'current', + versions: { + current: { + label: 'latest', + }, + }, // Performance optimization: Disable number prefix parser numberPrefixParser: false, // Performance optimization: Disable breadcrumbs for performance diff --git a/versioned_docs/version-v2.8.0/README.md b/versioned_docs/version-v2.8.0/README.md index 604d9f9d..f091a4ad 100644 --- a/versioned_docs/version-v2.8.0/README.md +++ b/versioned_docs/version-v2.8.0/README.md @@ -1,4 +1,4 @@ --- title: readme slug: /readme ---- \ No newline at end of file +--- diff --git a/versioned_docs/version-v2.8.0/contributor/adopters.md b/versioned_docs/version-v2.8.0/contributor/adopters.md index 0d87e5eb..86c7eaa2 100644 --- a/versioned_docs/version-v2.8.0/contributor/adopters.md +++ b/versioned_docs/version-v2.8.0/contributor/adopters.md @@ -1,16 +1,20 @@ +--- +title: HAMi Adopters +--- + # HAMi Adopters -HAMi is used in production by the organisations listed below. +So you and your organisation are using HAMi? That's great. Reach out and let the community know. ## Adding yourself -[Here](https://github.com/Project-HAMi/website/blob/master/src/pages/adopters.mdx) lists the organisations who adopted the HAMi project in production. +[See the list of HAMi adopters](https://github.com/Project-HAMi/website/blob/master/src/pages/adopters.mdx) for organisations who have adopted the HAMi project in production. -Add an entry for your company - it will be added to the website once the PR merges. +Add an entry for your company and upon merging it will automatically be added to the website. To add your organisation follow these steps: -1. Fork the [HAMi-io/website](https://github.com/Project-HAMi/website) repository. +1. Fork the [Project-HAMi/website](https://github.com/Project-HAMi/website) repository. 2. Clone it locally with `git clone https://github.com//website.git`. 3. (Optional) Add the logo of your organisation to `static/img/supporters`. Good practice is for the logo to be called e.g. `.png`. This will not be used for commercial purposes. @@ -19,10 +23,10 @@ To add your organisation follow these steps: | Organization | Contact | Environment | Description of Use | | ------------ | --------------------------------- | ----------- | --------------------------------------------- | - | My Company | [email](mailto:email@company.com) | Production | We use HAMi to Manage our GPU infrastructure. | + | My Company | [email](mailto:email@company.com) | Production | We use HAMi to manage our GPU infrastructure. | 5. Save the file, then do `git add -A` and commit using `git commit -s -m "Add MY-ORG to adopters"`. 6. Push the commit with `git push origin main`. -7. Open a Pull Request to [HAMi-io/website](https://github.com/Project-HAMi/website) and a preview build will turn up. +7. Open a Pull Request to [Project-HAMi/website](https://github.com/Project-HAMi/website) and a preview build will turn up. -Thanks to all adopters for being part of the community! +Thanks for being part of the community! diff --git a/versioned_docs/version-v2.8.0/contributor/cherry-picks.md b/versioned_docs/version-v2.8.0/contributor/cherry-picks.md index a7a96275..2058ffad 100644 --- a/versioned_docs/version-v2.8.0/contributor/cherry-picks.md +++ b/versioned_docs/version-v2.8.0/contributor/cherry-picks.md @@ -1,121 +1,70 @@ --- title: How to cherry-pick PRs +sidebar_label: Cherry Picks --- -This document explains how cherry picks are managed on release branches within -the `Project-HAMi/HAMi` repository. -A common use case for this task is backporting PRs from master to release -branches. - -> This doc is lifted from [Kubernetes cherry-pick](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md). - -- [Prerequisites](#prerequisites) -- [What Kind of PRs are Good for Cherry Picks](#what-kind-of-prs-are-good-for-cherry-picks) -- [Initiate a Cherry Pick](#initiate-a-cherry-pick) -- [Cherry Pick Review](#cherry-pick-review) -- [Troubleshooting Cherry Picks](#troubleshooting-cherry-picks) -- [Cherry Picks for Unsupported Releases](#cherry-picks-for-unsupported-releases) +This document explains how cherry picks are managed on release branches in the `Project-HAMi/HAMi` repository. The typical use case is backporting a bug fix from master to an active release branch. ## Prerequisites -- A pull request merged against the `master` branch. -- The release branch exists (example: [`release-2.4`](https://github.com/Project-HAMi/HAMi/releases)) -- The normal git and GitHub configured shell environment for pushing to your - HAMi `origin` fork on GitHub and making a pull request against a - configured remote `upstream` that tracks - `https://github.com/Project-HAMi/HAMi`, including `GITHUB_USER`. -- Have GitHub CLI (`gh`) installed following [installation instructions](https://github.com/cli/cli#installation). -- A github personal access token which has permissions "repo" and "read:org". - Permissions are required for [gh auth login](https://cli.github.com/manual/gh_auth_login) - and not used for anything unrelated to cherry-pick creation process - (creating a branch and initiating PR). - -## What Kind of PRs are Good for Cherry Picks - -Compared to the normal master branch's merge volume across time, -the release branches see one or two orders of magnitude less PRs. -This is because there is an order or two of magnitude higher scrutiny. -Again, the emphasis is on critical bug fixes, e.g., - -- Loss of data -- Memory corruption -- Panic, crash, hang -- Security - -A bugfix for a functional issue (not a data loss or security issue) that only -affects an alpha feature does not qualify as a critical bug fix. - -If you are proposing a cherry pick and it is not a clear and obvious critical -bug fix, please reconsider. If upon reflection you wish to continue, bolster -your case by supplementing your PR with e.g., +- A pull request already merged into master. +- The target release branch exists (for example, `release-2.4`). +- Git and GitHub configured with your fork as `origin` and the upstream as `upstream`. +- [GitHub CLI (`gh`)](https://github.com/cli/cli#installation) installed. +- A GitHub personal access token with `repo` and `read:org` scopes, used for `gh auth login`. -- A GitHub issue detailing the problem +## What qualifies for a cherry pick -- Scope of the change +Release branches receive far fewer merges than master because the bar is higher. Cherry picks are reserved for critical fixes only: -- Risks of adding a change - -- Risks of associated regression - -- Testing performed, test cases added - -- Key stakeholder reviewers/approvers attesting to their confidence in the - change being a required backport +- Data loss +- Memory corruption +- Panic, crash, or hang +- Security vulnerabilities -It is critical that the full community is actively engaged on enhancements in -the project. If a released feature was not enabled on a particular provider's -platform, this is a community miss that needs to be resolved in the `master` -branch for subsequent releases. Such enabling will not be backported to the -patch release branches. +A bug that affects only an alpha feature does not qualify, even if it is a real bug. -## Initiate a Cherry Pick +If the fix is not clearly critical, support the case with: -- Run the [cherry pick script][cherry-pick-script] +- A GitHub issue describing the problem +- Scope of the change +- Risks of adding the change +- Risks of regression +- Tests added or updated +- Sign-off from key stakeholders or reviewers - This example applies a master branch PR #1206 to the remote branch - `upstream/release-1.0`: +Features that were not enabled on a specific vendor's platform belong in master for the next release. They will not be backported. - ```shell - hack/cherry_pick_pull.sh upstream/release-1.0 1206 - ``` +## Initiate a cherry pick - - Be aware the cherry pick script assumes you have a git remote called - `upstream` that points at the HAMi github org. +Run the [cherry pick script][cherry-pick-script]. This example backports PR #1206 to `upstream/release-1.0`: - - You will need to run the cherry pick script separately for each patch - release you want to cherry pick to. Cherry picks should be applied to all - active release branches where the fix is applicable. +```bash +hack/cherry_pick_pull.sh upstream/release-1.0 1206 +``` - - If `GITHUB_TOKEN` is not set you will be asked for your github password: - provide the github [personal access token](https://github.com/settings/tokens) rather than your actual github - password. If you can securely set the environment variable `GITHUB_TOKEN` - to your personal access token then you can avoid an interactive prompt. - Refer [https://github.com/github/hub/issues/2655#issuecomment-735836048](https://github.com/github/hub/issues/2655#issuecomment-735836048) +Notes: -## Cherry Pick Review +- The script expects a remote named `upstream` pointing to `https://github.com/Project-HAMi/HAMi`. +- Run the script separately for each release branch that needs the fix. +- If `GITHUB_TOKEN` is not set, the script will prompt for a token. Use a [personal access token](https://github.com/settings/tokens) rather than your account password. -As with any other PR, code OWNERS review (`/lgtm`) and approve (`/approve`) on -cherry pick PRs as they deem appropriate. +## Cherry pick review -The same release note requirements apply as normal pull requests, except the -release note stanza will auto-populate from the master branch pull request from -which the cherry pick originated. +Cherry pick PRs follow the same review process as normal PRs. Code owners review (`/lgtm`) and approve (`/approve`) as they see fit. -## Troubleshooting Cherry Picks +Release notes auto-populate from the original master PR. -Contributors may encounter some of the following difficulties when initiating a -cherry pick. +## Troubleshooting -- A cherry pick PR does not apply cleanly against an old release branch. In - that case, you will need to manually fix conflicts. +**The cherry pick does not apply cleanly.** +The patch conflicts with changes already in the release branch. Fetch the auto-generated branch from your fork, resolve the conflicts manually, and force-push. -- The cherry pick PR includes code that does not pass CI tests. In such a case - you will have to fetch the auto-generated branch from your fork, amend the - problematic commit and force push to the auto-generated branch. - Alternatively, you can create a new PR, which is noisier. +**CI fails on the cherry pick branch.** +Fetch the auto-generated branch, amend the failing commit, and force-push. Alternatively, open a new PR manually - it is noisier but sometimes cleaner. -## Cherry Picks for Unsupported Releases +## Unsupported releases -The community supports & patches releases need to be discussed. +Fixes for end-of-life release branches are not accepted without prior discussion. Open an issue to start that conversation before submitting a cherry pick against an unsupported branch. [cherry-pick-script]: https://github.com/Project-HAMi/HAMi/blob/master/hack/cherry_pick_pull.sh diff --git a/versioned_docs/version-v2.8.0/contributor/contribute-docs.md b/versioned_docs/version-v2.8.0/contributor/contribute-docs.md index c41ee7a2..725513b6 100644 --- a/versioned_docs/version-v2.8.0/contributor/contribute-docs.md +++ b/versioned_docs/version-v2.8.0/contributor/contribute-docs.md @@ -1,207 +1,257 @@ --- -title: How to contribute docs +title: How to Contribute Docs +sidebar_label: Contribute Docs --- -Starting from version 1.3, the community documentation will be available on the HAMi website. -This document explains how to contribute docs to -the `Project-HAMi/website` repository. +This guide covers everything needed to contribute to the HAMi documentation website - from setting up the local environment to writing, previewing, and submitting changes. + +The documentation site is built with [Docusaurus 3](https://docusaurus.io/) and supports English (primary) and Simplified Chinese. English is the source language for all content. ## Prerequisites -- Docs, like codes, are also categorized and stored by version. - 1.3 is the first version is the first archived. -- Docs need to be translated into multiple languages for readers from different regions. - The community now supports both Chinese and English. - English is the official language of documentation. -- The docs use markdown. If you are unfamiliar with Markdown, - please see [https://guides.github.com/features/mastering-markdown/](https://guides.github.com/features/mastering-markdown/) or - [https://www.markdownguide.org/](https://www.markdownguide.org/) if you are looking for something more substantial. -- The site uses [Docusaurus 2](https://docusaurus.io/), a model static website generator. +- Node.js v20 (required - other versions are not supported) +- npm +- Git with a GitHub account + +Verify your Node version: + +```bash +node -v # should print v20.x.x +``` ## Setup -You can set up your local environment by cloning the website repository. +Fork the [Project-HAMi/website](https://github.com/Project-HAMi/website) repository on GitHub, then clone your fork: -```shell -git clone https://github.com/Project-HAMi/website.git +```bash +git clone https://github.com//website.git cd website +git remote add upstream https://github.com/Project-HAMi/website.git +npm install ``` -Our website is organized like below: +## Local Development -```text -website -├── sidebars.json # sidebar for the current docs version -├── docs # docs directory for the current docs version -│ ├── foo -│ │ └── bar.md # https://mysite.com/docs/next/foo/bar -│ └── hello.md # https://mysite.com/docs/next/hello -├── versions.json # file to indicate what versions are available -├── versioned_docs -│ ├── version-1.1.0 -│ │ ├── foo -│ │ │ └── bar.md # https://mysite.com/docs/foo/bar -│ │ └── hello.md -│ └── version-1.0.0 -│ ├── foo -│ │ └── bar.md # https://mysite.com/docs/1.0.0/foo/bar -│ └── hello.md -├── versioned_sidebars -│ ├── version-1.1.0-sidebars.json -│ └── version-1.0.0-sidebars.json -├── docusaurus.config.js -└── package.json +```bash +npm run start # dev server at http://localhost:3000 with hot-reload +npm run start:network # same, but accessible on your local network +npm run build:fast # English-only production build, ~45 seconds (use for validation) +npm run build # full build including Chinese and all versions, ~80 seconds (mirrors CI) +npm run clear # clear Docusaurus cache (use when you see stale build errors) ``` -The `versions.json` file is a list of versions, from the latest to earliest. -The table below explains how a versioned file maps to its version and the generated URL. +Use `npm run start` while writing. Use `npm run build:fast` before opening a PR. CI runs the full `npm run build` on every PR to `master`. -| Path | Version | URL | -| --------------------------------------- | -------------- | ----------------- | -| `versioned_docs/version-1.0.0/hello.md` | 1.0.0 | /docs/1.0.0/hello | -| `versioned_docs/version-1.1.0/hello.md` | 1.1.0 (latest) | /docs/hello | -| `docs/hello.md` | current | /docs/next/hello | +## Repository Structure -:::tip +``` +website/ +├── docs/ # English source docs (authoritative) +├── i18n/zh/ +│ └── docusaurus-plugin-content-docs/ +│ └── current/ # Chinese translations +├── versioned_docs/version-vX.Y.Z/ # archived doc snapshots +├── blog/ # blog posts +├── sidebars.js # navigation structure +├── docusaurus.config.js # site configuration +└── versions.json # available versioned snapshots +``` -The files in the `docs` directory belong to the `current` docs version. +Contributors primarily work in `docs/` (English source) and `i18n/zh/` (Chinese translations). -The `current` docs version is labeled as `Next` and hosted under `/docs/next/*`. +## Adding a New Document -Contributors mainly contribute documentation to the current version. -::: +### 1. Create the file -## Writing docs +Place the file in the appropriate subdirectory under `docs/`: -### Starting a title at the top +``` +docs/userguide/nvidia-device/new-feature.md +docs/get-started/new-guide.md +docs/contributor/new-policy.md +``` -It's important for your article to specify metadata concerning an article at the top of the Markdown file, in a section called **Front Matter**. +### 2. Add frontmatter -Here is a quick example that explains the most relevant entries in **Front Matter**: +Every document must start with frontmatter: -```markdown +```yaml --- -title: A doc with tags +title: Full Page Title +sidebar_label: Short Nav Label --- +``` + +- `title` - used as the page `

` heading and in metadata +- `sidebar_label` - shorter version shown in the sidebar; omit if the same as `title` -## secondary title +### 3. Register in sidebars.js + +Every new document must be added to `sidebars.js` to appear in navigation. Find the right category and add the doc ID (path relative to `docs/`, without `.md`): + +```js +{ + type: "category", + label: "Get Started", + items: [ + "get-started/deploy-with-helm", + "get-started/verify-hami", + "get-started/your-new-doc" // add here + ] +} ``` -The top section between two lines of --- is the Front Matter section. -These entries tell Docusaurus how to handle the article: +If you are unsure which category fits, mention it in the PR and a maintainer will help. -- Title is the equivalent of the `

` in a HTML document or `# ` in a Markdown article. -- Each document has a unique ID. By default, a document ID is the name of the document - (without the extension) related to the root docs directory. +### 4. Add a Chinese translation (optional) -### Linking to other docs +Mirror the file path under `i18n/zh/docusaurus-plugin-content-docs/current/`. The structure is identical to `docs/`. Keep the frontmatter the same; translate only the content. -You can easily route to other places by adding any of the following links: +If the translation is not ready, a placeholder body is acceptable: -- Absolute URLs to external sites like `https://github.com` or `https://k8s.io` - - you can use any of the Markdown notations for this, so - - `<https://github.com>` or - - `[kubernetes](https://k8s.io)` will work. -- Link to markdown files or the resulting path. - You can use relative paths to index the corresponding files. -- Link to pictures or other resources. -If your article contains images, prefer storing them in `/static/img/docs/` and linking - with absolute paths. Language-aware folders are used: - - `/static/img/docs/common/` for shared images - - `/static/img/docs/en/` for English-only images - - `/static/img/docs/zh/` for Chinese-only images -Example: - - `![WebUI Overview](/img/docs/en/userguide/webui-overview.png)` - - `![WebUI 集群概览](/img/docs/zh/userguide/webui-overview.png)` - - `![Architecture](/img/docs/common/architecture/hami-arch.png)` +```md +--- +title: Full Page Title +sidebar_label: Short Nav Label +--- -### Directory organization +(Translation in progress) +``` -Docusaurus 2 uses a sidebar to manage documents. +## Linking -Creating a sidebar is useful to: +**Internal links** - link to other docs using relative paths to the `.md` file: -- Group multiple related documents -- Display a sidebar on each of those documents -- Provide paginated navigation, with next/previous button +```md +[GitHub Workflow](github-workflow.md) +[Installation](../get-started/deploy-with-helm.md) +``` -The document organization can be found from -[https://github.com/Project-HAMi/website/blob/main/sidebars.js](https://github.com/Project-HAMi/website/blob/main/sidebars.js). +Docusaurus resolves these to correct URLs automatically, including version and locale. -```js -module.exports = { - docs: [ - { - type: "category", - label: "Core Concepts", - collapsed: false, - items: [ - "core-concepts/introduction", - "core-concepts/concepts", - "core-concepts/architecture", - ], - }, - { - type: "doc", - id: "key-features/features", - }, - { - type: "category", - label: "Get Started", - items: [ - "get-started/deploy-with-helm" - ], - }, -.... -``` - -The order of documents in a directory is strictly in the order of items. +**External links** - use full URLs: -```yaml -type: "category", -label: "Core Concepts", -collapsed: false, -items: [ - "core-concepts/introduction", - "core-concepts/concepts", - "core-concepts/architecture", -], +```md +[Kubernetes](https://kubernetes.io) +``` + +**Broken links** - the full build (`npm run build`) reports broken internal links. Fix them before opening a PR. + +## Images + +Store images under `/static/img/docs/` using language-aware subdirectories: + +| Path | Use for | +| --- | --- | +| `/static/img/docs/common/` | Images shared across languages | +| `/static/img/docs/en/` | English-only images | +| `/static/img/docs/zh/` | Chinese-only images | + +Reference images with absolute paths from the site root: + +```md +![Architecture diagram](/img/docs/common/architecture/hami-arch.png) +![WebUI Overview](/img/docs/en/userguide/webui-overview.png) ``` -If you add a document, you must add it to `sidebars.js` to make it display properly. -If you're not sure where your docs are located, you can ask community members in the PR. +Use descriptive alt text. Do not link to external images - host them in the repository. -### About Chinese docs +## Writing Style -There are two situations about the Chinese version of the document: +The following rules apply to all documentation on this site. -- You want to translate the existing English docs to Chinese. In this case, - you need to modify the corresponding file content from - [https://github.com/Project-HAMi/website/tree/main/i18n/zh/docusaurus-plugin-content-docs/current](https://github.com/Project-HAMi/website/tree/main/i18n/zh/docusaurus-plugin-content-docs/current). - The organization of this directory is exactly the same as the outer layer. - `current.json` holds translations for the documentation directory. - You can edit it if you want to translate the name of directory. -- You want to contribute Chinese docs without English version. - Any articles of any kind are welcomed. In this case, you can add - articles and titles to the main directory first. Article content can be TBD first, like this. - Then add the corresponding Chinese content to the Chinese directory. +**Language and tone:** +- Short, direct sentences +- Active voice +- Casual but professional - write like a developer explaining something to another developer +- No filler words: "simply", "just", "Note that", "It's worth noting", "Please note" +- No first-person: avoid "I", "we", "our", "let's" +- Exception: direct quotes or official project announcements where "we" refers to the HAMi project team -## Debugging docs +**Formatting:** +- Use `-` for unordered lists, never `*` or `•` +- Use regular hyphens (`-`), never em-dashes (`—`) +- Headings: use `##` and `###` hierarchy; do not skip levels +- Code: always specify the language in fenced code blocks (` ```bash`, ` ```yaml`, ` ```go`) +- No emoji in documentation content -Now you have already completed docs. After you start a PR to `Project-HAMi/website`, -if you have passed CI, you can get a preview of your document on the website. +**Avoid marketing language:** +- Do not use: "innovative", "seamless", "robust", "powerful", "cutting-edge", "state-of-the-art" +- Do not use: "streamline", "leverage", "intuitive", "comprehensive" +- Do not use: "In conclusion,", "In summary,", "To summarize," -Click **Details** marked in red, and you will enter the preview view of the website. +## Versioning -Click **Next** and you can see the corresponding changes. If you have changes -related to the Chinese version, click the language drop-down box next to it to switch to Chinese. +HAMi docs are versioned alongside each release: -If the previewed page is not what you expected, please check your docs again. +| Location | Version | URL | +| --- | --- | --- | +| `docs/` | current (unreleased) | `/docs/next/*` | +| `versioned_docs/version-v2.8.0/` | v2.8.0 (latest stable) | `/docs/*` | +| `versioned_docs/version-v2.7.0/` | v2.7.0 | `/docs/v2.7.0/*` | + +**Contribute to `docs/`** for changes that apply to the next release. These are the files most contributors should edit. + +Fixes to existing versioned docs are handled by maintainers through cherry-picks. If you find an error in a versioned doc, open an issue or submit a fix to `docs/` - a maintainer will backport if needed. + +## Chinese Translation Workflow + +There are two cases: + +**Translating an existing English doc:** + +1. Find the corresponding file path under `i18n/zh/docusaurus-plugin-content-docs/current/`. +2. The directory structure mirrors `docs/` exactly. +3. Translate the content; keep frontmatter fields identical to the English source. +4. To translate a sidebar category label, edit `i18n/zh/docusaurus-plugin-content-docs/current.json`. + +**Adding a Chinese doc without an English version:** + +This is not recommended. English is the source language. If you want to contribute in Chinese, write the English version first (even a draft), then add the Chinese translation. + +## Previewing Changes + +The dev server shows English only: + +```bash +npm run start +``` + +To preview Chinese translations locally: + +```bash +npm run start -- --locale zh +``` + +To preview both languages, run the full build and serve it: + +```bash +npm run build +npm run serve +``` + +## CI and PR Preview + +When you open a PR against `master`, CI runs `npm run build` (full build). If the build fails, the PR cannot be merged. + +PRs also receive a preview deployment link automatically. Click it to see your changes rendered on the live site before requesting review. Use this to verify links, images, and formatting. + +## Changelog + +The changelog is auto-generated from `CHANGELOG.md` at the repo root by a custom Docusaurus plugin. Do not edit files under `changelog/source/` directly - they are overwritten on every build. + +To update the changelog, edit `CHANGELOG.md` directly. ## FAQ -### Versioning +**The build fails with a broken link error.** +Run `npm run build` locally to see the exact file and line. Fix the link and rebuild. + +**My new page does not appear in the sidebar.** +Check that the doc ID in `sidebars.js` matches the file path exactly (relative to `docs/`, no `.md` extension). + +**The dev server shows a cached version of my changes.** +Stop the server and run `npm run clear`, then restart. -For the newly supplemented documents of each version, they are synchronized to the latest version -on the release date of each version, and the documents of the old version will not be modified. -For errata found in the documentation, fixes are applied with every release. +**How do I document a new feature for an upcoming release?** +Add the documentation to `docs/` (not `versioned_docs/`). It will be snapshotted into `versioned_docs/` when the release is cut. diff --git a/versioned_docs/version-v2.8.0/contributor/contributing.md b/versioned_docs/version-v2.8.0/contributor/contributing.md index 0c88de85..dc3b862a 100644 --- a/versioned_docs/version-v2.8.0/contributor/contributing.md +++ b/versioned_docs/version-v2.8.0/contributor/contributing.md @@ -1,88 +1,381 @@ --- -title: Contributing +title: Contributing to HAMi +sidebar_label: Contributing --- -Welcome to HAMi! +HAMi is a CNCF Sandbox project that brings GPU and AI accelerator virtualization to Kubernetes. The scheduler, device plugins, documentation, and tooling are all built and maintained by community contributors. + +This guide is the starting point for any contribution, whether you are fixing a typo or implementing a new hardware backend. + +## The Critical Rule + +**You must understand every change you submit.** + +Using tools to help write code or documentation is fine. Submitting a change you cannot explain is not. If a reviewer asks why a piece of code works the way it does and you cannot answer, the PR will not be merged. This applies to every contribution, regardless of how it was produced. + +This matters more in HAMi than in most projects. Code that manages GPU memory, device scheduling, or accelerator lifecycle can cause data corruption, hardware faults, or silent misallocation if it is wrong. "It looked right" is not enough. ## Code of Conduct -Please make sure to read and observe the [Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md) +All community members must follow the [CNCF Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md). Report violations to the CNCF CoC committee via cncf-coc@lists.cncf.io. + +## Ways to Contribute + +Writing code is not the only way to contribute. + +| Contribution type | What it involves | +| --- | --- | +| Bug reports | Open a detailed issue with reproduction steps and environment info | +| Bug fixes | Submit a PR with a test case that covers the fix | +| New features | Open an issue first to align on approach, then submit a PR | +| Documentation | Fix errors, fill gaps, add examples, improve clarity | +| Blog posts | Write about HAMi use cases, integrations, or release highlights | +| Translations | Translate English docs into Chinese or help maintain the existing Chinese translations | +| Code review | Read open PRs and share technical feedback | +| Issue triage | Reproduce bugs, ask for missing info, close stale issues | +| Community support | Answer questions in Slack or GitHub Discussions | + +## Community + +| Channel | Purpose | +| --- | --- | +| [GitHub Issues](https://github.com/Project-HAMi/HAMi/issues) | Bug reports and feature requests | +| [GitHub Discussions](https://github.com/Project-HAMi/HAMi/discussions) | Questions, ideas, design proposals | +| [CNCF Slack #hami](https://cloud-native.slack.com/archives/C03E57Q30FY) | Real-time chat | +| [MAINTAINERS](https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md) | Current maintainer list | +| [Community Meetings](https://github.com/Project-HAMi/community) | Bi-weekly video meetings | + +Before opening an issue or PR, search existing issues and discussions for related work. + +**New to HAMi?** Join [CNCF Slack](https://cloud-native.slack.com/archives/C03E57Q30FY) and introduce yourself in `#hami`. Maintainers and existing contributors are happy to help you find a good first issue, review a draft, or answer questions before you open a PR. -## Community Expectations +## Prerequisites -HAMi is a community project driven by its community which strives to promote a healthy, friendly and productive environment. +**For all contributions:** +- Git with a GitHub account +- You must be able to certify contributions under the [Developer Certificate of Origin](https://developercertificate.org/) -## Getting started +**For HAMi core (Go):** +- Go 1.21+ +- `kubectl` and access to a Kubernetes cluster with a supported GPU or accelerator -- Fork the repository on GitHub. -- Make your changes on your fork repository. -- Submit a PR. +**For documentation (website):** +- Node.js v20 +- npm -## Your First Contribution +## Setup -Help is available for contributing in areas like filing issues, developing features, fixing critical bugs and -getting your work reviewed and merged. +### Fork and Clone -If you have questions about the development process, -feel free to [file an issue](https://github.com/Project-HAMi/HAMi/issues/new/choose). +Fork the target repository on GitHub, then clone your fork: -## Find something to work on +```bash +export user="your-github-username" -Help is always welcome - fixing documentation, reporting bugs, writing code. -Look at places where you feel best coding practices aren't followed, code refactoring is needed or tests are missing. -Here is how you get started. +# For core HAMi +git clone https://github.com/$user/HAMi.git +cd HAMi +git remote add upstream https://github.com/Project-HAMi/HAMi.git +git remote set-url --push upstream no_push # prevent accidental upstream push -### Find a good first topic +# For the docs website +git clone https://github.com/$user/website.git +cd website +git remote add upstream https://github.com/Project-HAMi/website.git +npm install +``` -There are [multiple repositories](https://github.com/Project-HAMi/) within the HAMi organization. -Each repository has beginner-friendly issues that provide a good first issue. -For example, [Project-HAMi/HAMi](https://github.com/Project-HAMi/HAMi) has -[help wanted](https://github.com/Project-HAMi/HAMi/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22) and -[good first issue](https://github.com/Project-HAMi/HAMi/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) -labels for issues that should not need deep knowledge of the system. -Maintainers can help new contributors who wish to work on such issues. +### Stay in Sync -Another good way to contribute is to find a documentation improvement, such as a missing/broken link. -Please see [Contributor Workflow](#contributor-workflow) below for the workflow. +Keep your local master branch current with upstream before starting new work: -#### Work on an issue +```bash +git fetch upstream +git checkout master +git rebase upstream/master +``` -When you are willing to take on an issue, reply on the issue. The maintainer will assign it to you. +Use `rebase`, not `merge`, to keep a clean commit history. -### File an Issue +For a detailed walkthrough of the full Git workflow, see the [GitHub Workflow guide](github-workflow.md). -Code contributions are welcome, and bug reports are equally appreciated. -Issues should be filed under the appropriate HAMi sub-repository. +## Finding Work -*Example:* a HAMi issue should be opened to [Project-HAMi/HAMi](https://github.com/Project-HAMi/HAMi/issues). +Good starting points: -Please follow the prompted submission guidelines while opening an issue. +- [`good first issue`](https://github.com/Project-HAMi/HAMi/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) - scoped, well-documented, friendly to new contributors +- [`help wanted`](https://github.com/Project-HAMi/HAMi/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22) - open for contributions, may need domain knowledge +- Documentation gaps and broken links in the [website repository](https://github.com/Project-HAMi/website/issues) + +When you decide to work on an issue, comment on it. A maintainer will assign it to you to prevent duplicate effort. ## Contributor Workflow -Please do not ever hesitate to ask a question or send a pull request. +### Branch Naming + +Use short, descriptive branch names that reflect the change: + +```bash +git checkout -b fix/gpu-memory-calculation +git checkout -b feat/kunlunxin-multi-card +git checkout -b docs/update-ascend-guide +``` + +### Small vs. Large Changes + +**Any PR that adds or changes more than 100 lines of code or documentation requires a GitHub issue or discussion first.** Open the issue, describe what you want to do and why, and wait for maintainer feedback before writing the code. Only then open the PR. + +**Small changes** (bug fixes, typo corrections, docs improvements under 100 lines): open a PR directly, no issue required. + +**Large changes** (new features, API changes, new hardware backends, refactors spanning multiple packages, docs additions over 100 lines): + +1. Open a GitHub issue describing the problem and your proposed approach. +2. Get alignment from maintainers before investing significant time. +3. Open a draft PR early once you have direction - share progress before the implementation is final. + +PRs opened without a prior issue for large changes will be asked to go back and open one first. + +### Validate Before Pushing + +For Go code: + +```bash +make verify +make test +``` + +For documentation: + +```bash +npm run build:fast # English-only, ~45 seconds - use during development +npm run build # Full build with all locales, ~80 seconds - mirrors CI +``` + +## Code Style + +### Go + +- Format all code with `gofmt` before committing. Unformatted code will fail CI. +- Follow [Go Code Review Comments](https://github.com/golang/go/wiki/CodeReviewComments) for style decisions. +- Write table-driven tests where it makes sense. Test names should describe the scenario, not the function. +- Keep functions small and focused. If a function needs a long comment to explain what it does, consider splitting it. +- Error messages should be lowercase and not end with punctuation (Go convention). + +### Documentation + +See the [Writing Style](contribute-docs.md#writing-style) section in the docs contribution guide. + +## Commit Standards + +### Format + +HAMi uses [Conventional Commits](https://www.conventionalcommits.org/): + +``` +<type>(<optional scope>): <description> + +[optional body] + +[optional footer(s)] +``` + +**Types:** + +| Type | Use for | +| --- | --- | +| `feat` | New functionality | +| `fix` | Bug fix | +| `docs` | Documentation changes only | +| `chore` | Maintenance: deps, build config, CI | +| `refactor` | Code restructure with no behavior change | +| `test` | Adding or updating tests | +| `perf` | Performance improvement | + +**Good examples:** + +``` +feat(scheduler): add memory oversell ratio config option +fix(deviceplugin): handle graceful shutdown on SIGTERM +docs: correct vGPU memory limit example in Ascend guide +chore: bump Go to 1.22 +test: add unit tests for MLU device discovery +``` + +**Rules:** +- Subject line: 72 characters or fewer +- Imperative mood: "add", "fix", "update" - not "added", "fixed", "updates" +- Body: explain *what* and *why*, not *how* +- No period at the end of the subject line + +### DCO Sign-off (Required) + +Every commit must include a `Signed-off-by` line. CI blocks PRs that are missing it. + +```bash +git commit -s -m "fix: correct memory calculation for MLU" +``` + +The `-s` flag appends: -This is a rough outline of what a contributor's workflow looks like: +``` +Signed-off-by: Your Name <your@email.com> +``` -- Create a topic branch from where to base the contribution. This is usually master. -- Make commits of logical units. -- Push changes in a topic branch to a personal fork of the repository. -- Submit a pull request to [Project-HAMi/HAMi](https://github.com/Project-HAMi/HAMi). +This certifies you have the right to submit the work under the project's license. See the [Developer Certificate of Origin](https://developercertificate.org/) for the full text. -## Creating Pull Requests +**Forgot to sign off?** Fix it before pushing: -Pull requests are often called PRs. -HAMi generally follows the standard [github pull request](https://help.github.com/articles/about-pull-requests/) process. -To submit a proposed change, please develop the code/fix and add new test cases. -After that, run these local verifications before submitting pull request to predict the pass or -fail of continuous integration. +```bash +# Single commit +git commit --amend -s --no-edit -- Run and pass `make verify` +# Multiple commits on the branch +git rebase HEAD~<n> --signoff +``` + +## Pull Requests + +### Before Opening + +- [ ] Every commit has a `Signed-off-by` line +- [ ] Commit messages follow Conventional Commits format +- [ ] If this PR changes more than 100 lines, a GitHub issue was opened first and is linked below +- [ ] Local checks pass (`make verify` for Go, `npm run build:fast` for docs) +- [ ] Tests added or updated for code changes +- [ ] Docs updated if user-facing behavior changed +- [ ] Related issue linked in the PR description + +### PR Description + +Keep it short and factual: + +- What does this change do? +- Why is it needed? +- How was it tested? +- Reference related issues: `Fixes #123` or `Relates to #456` + +**Formatting rules for PR descriptions, issue bodies, and commit messages:** +- No em-dashes (`—`) - use a regular hyphen (`-`) instead +- No emojis +- No filler phrases ("This PR aims to...", "In this PR, we...") +- No marketing language ("seamless", "robust", "powerful", "innovative") +- Write in your own words - short, direct sentences + +### Keeping PRs Focused + +One logical change per PR. Large, unfocused PRs take longer to review and are harder to revert if something breaks. If you are fixing multiple independent issues, open separate PRs. + +### Squashing Commits + +Before merge, clean up your commit history. Squash fixup commits, review-feedback commits, and typo corrections into the relevant logical commit. Each remaining commit should represent a meaningful unit of work that compiles and passes tests independently. + +For step-by-step squash instructions, see the [GitHub Workflow guide](github-workflow.md#squash-commits). + +### Review Process + +1. After the PR is opened, a maintainer or reviewer is assigned. +2. Address all review comments. If you disagree with feedback, explain why in the thread. +3. Update the PR by pushing to the same branch - do not close and reopen. +4. CI must pass before merge. +5. Once an approver marks the PR approved, it will be merged. ## Code Review -To make it easier for your PR to receive reviews, consider the reviewers will need you to: +When reviewing others' PRs: + +- Reference the exact line and explain the concern - do not just flag something as wrong. +- Distinguish blocking issues from optional suggestions. Use `nit:` for minor style notes that should not block merge. +- Acknowledge what works well. A review that only lists problems is harder to act on. +- If you spot something minor (typo, formatting), use a GitHub suggestion so the author can apply it in one click. +- Reviews are collaborative. Assume good intent. + +## AI Usage + +AI tools may be used to assist with writing code, documentation, or commit messages. There is one hard rule: + +**Do not submit AI-generated text directly as your PR description, issue body, or commit message.** + +Maintainers need to communicate with the person behind the contribution - not with a language model. Write in your own words, even if AI helped you draft a starting point. Text that is clearly AI-generated (verbose summaries, excessive lists, filler phrases, "In conclusion") will be flagged and the author asked to rewrite. + +What is acceptable: +- Using AI to help understand unfamiliar code +- Using AI to draft a commit message that you then edit and own +- Using AI to check grammar or clarity +- Using AI to generate code that you review, test, and understand + +What is not acceptable: +- Pasting an AI-generated PR description without reading and rewriting it +- Submitting code you cannot explain if asked during review +- Using AI-generated text as issue comments or discussion posts + +If AI played a significant role beyond autocomplete, mention it briefly in the PR description. This helps reviewers calibrate their review depth. + +## Documentation Contributions + +Documentation lives in the [Project-HAMi/website](https://github.com/Project-HAMi/website) repository and is built with Docusaurus 3. + +For a complete guide covering frontmatter, sidebar registration, image paths, local preview, and Chinese translation workflow, see [How to Contribute Docs](contribute-docs.md). + +**Quick rules:** +- English is the source language. All new docs go to `docs/` first. +- Chinese translations go under `i18n/zh/docusaurus-plugin-content-docs/current/`. +- Every new doc must be added to `sidebars.js`. +- Run `npm run build:fast` to validate before opening a PR. + +## Hardware Vendor Contributions + +HAMi supports multiple GPU and accelerator vendors. If you are adding support for a new device or fixing vendor-specific behavior: + +- Follow the existing structure in `pkg/device/` (one directory per vendor). +- Test on real hardware where possible. Simulated tests are acceptable for CI, but hardware validation is expected for new backends before merge. +- Follow the documentation pattern under `docs/userguide/<vendor>-device/`. +- Include working YAML examples under `docs/userguide/<vendor>-device/examples/`. + +Supported vendors: NVIDIA, Cambricon (MLU), Hygon (DCU), Mthreads, Iluvatar, Enflame (GCU), AWS Neuron, Kunlunxin (XPU), Metax, Ascend. + +## Translations + +The HAMi website supports two languages: **English** (primary) and **Simplified Chinese**. No other languages are currently supported. + +English is the authoritative source. All new documentation is written in English first. Chinese translations live in `i18n/zh/`. + +To add or update a translation: + +1. Find the corresponding file under `i18n/zh/docusaurus-plugin-content-docs/current/`. +2. Translate the content, keeping frontmatter and document structure identical to the English source. +3. Submit a PR with only translation changes - do not mix translation and content edits in the same PR. + +If an English page has no Chinese counterpart yet, you can create one. A placeholder (`TBD`) in the body is acceptable if you want to register the page before completing the translation. + +## Contributor Roles + +HAMi uses a defined contributor ladder with progressively more responsibility and access: + +- **Community Participant** - follows the CoC, participates in discussions +- **Contributor** - submits PRs and issues, helps other users +- **Organization Member** - established contributor with at least 5 accepted PRs, enabled 2FA, and two sponsors +- **Reviewer** - responsible for reviewing PRs in a specific area, at least 10 reviews on record +- **Maintainer** - responsible for the project as a whole, approves and merges PRs + +For full role requirements, the promotion path, and how to nominate yourself or others, see the [Contributor Ladder](ladder.md). + +## Filing Issues + +Use the [HAMi issue tracker](https://github.com/Project-HAMi/HAMi/issues) for bugs and feature requests. Before filing: + +- Search for existing issues that cover the same problem. +- For security vulnerabilities, follow the [security policy](https://github.com/Project-HAMi/HAMi/blob/master/SECURITY.md) instead of opening a public issue. + +For website issues, use the [website issue tracker](https://github.com/Project-HAMi/website/issues). + +When filing a bug, include: +- HAMi version and installation method +- Kubernetes version and cluster setup +- GPU or accelerator type and driver version +- Steps to reproduce +- Actual vs. expected behavior +- Relevant logs or error output + +## License -- follow [good coding guidelines](https://github.com/golang/go/wiki/CodeReviewComments). -- write [good commit messages](https://chris.beams.io/posts/git-commit/). -- break large changes into a logical series of smaller patches which individually make easily understandable changes, and in aggregate solve a broader issue. +By contributing to HAMi, you agree that your contributions will be licensed under the [Apache License 2.0](https://github.com/Project-HAMi/HAMi/blob/master/LICENSE). diff --git a/versioned_docs/version-v2.8.0/contributor/contributors.md b/versioned_docs/version-v2.8.0/contributor/contributors.md index c940d91c..5966bb10 100644 --- a/versioned_docs/version-v2.8.0/contributor/contributors.md +++ b/versioned_docs/version-v2.8.0/contributor/contributors.md @@ -8,17 +8,17 @@ title: Contributors The following people, in alphabetical order, have either authored or signed off on commits in the HAMi repository: | Contributor | Email | -|-----------------|-----------| -| [archlitchi](https://github.com/archlitchi) | archlitchi@gmail.com| +| --- | --- | +| [archlitchi](https://github.com/archlitchi) | [archlitchi@gmail.com](mailto:archlitchi@gmail.com) | | [atttx123](https://github.com/atttx123) | - | -| [chaunceyjiang](https://github.com/chaunceyjiang) | chaunceyjiang@gmail.com| +| [chaunceyjiang](https://github.com/chaunceyjiang) | [chaunceyjiang@gmail.com](mailto:chaunceyjiang@gmail.com) | | [CoderTH](https://github.com/CoderTH) | - | | [gsakun](https://github.com/gsakun) | - | | [lengrongfu](https://github.com/lengrongfu) | - | -| [ouyangluwei](https://github.com/ouyangluwei163) | ouyangluwei@riseunion.io | -| peizhaoyou | peizhaoyou@4paradigm.com | -| [wawa0210](https://github.com/wawa0210) | xiaozhang0210@hotmail.com | +| [ouyangluwei](https://github.com/ouyangluwei163) | [ouyangluwei@riseunion.io](mailto:ouyangluwei@riseunion.io) | +| peizhaoyou | [peizhaoyou@4paradigm.com](mailto:peizhaoyou@4paradigm.com) | +| [wawa0210](https://github.com/wawa0210) | [xiaozhang0210@hotmail.com](mailto:xiaozhang0210@hotmail.com) | | [whybeyoung](https://github.com/whybeyoung) | - | -| [yinyu](https://github.com/Nimbus318) | nimbus-nimo@proton.me | -| [yangshiqi](https://github.com/yangshiqi) | yangshiqi@riseunion.io | -| zhengbingxian | - | \ No newline at end of file +| [yinyu](https://github.com/Nimbus318) | [nimbus-nimo@proton.me](mailto:nimbus-nimo@proton.me) | +| [yangshiqi](https://github.com/yangshiqi) | [yangshiqi@riseunion.io](mailto:yangshiqi@riseunion.io) | +| zhengbingxian | - | diff --git a/versioned_docs/version-v2.8.0/contributor/github-workflow.md b/versioned_docs/version-v2.8.0/contributor/github-workflow.md index a362a3b5..168eb795 100644 --- a/versioned_docs/version-v2.8.0/contributor/github-workflow.md +++ b/versioned_docs/version-v2.8.0/contributor/github-workflow.md @@ -1,282 +1,189 @@ --- -title: "GitHub Workflow" -description: An overview of the GitHub workflow used by the HAMi project. It includes some tips and suggestions on things such as keeping your local environment in sync with upstream and commit hygiene. +title: GitHub Workflow +sidebar_label: GitHub Workflow --- -> This doc is lifted from [Kubernetes github-workflow](https://github.com/kubernetes/community/blob/master/contributors/guide/github-workflow.md). +This guide covers the end-to-end Git and GitHub workflow for contributing to HAMi. It applies to both the [HAMi core repository](https://github.com/Project-HAMi/HAMi) and the [documentation website](https://github.com/Project-HAMi/website). ![Git workflow](/img/docs/common/contributor/github-workflow/git-workflow.png) -## Fork in the cloud +## Fork and clone -1. Visit [https://github.com/Project-HAMi/HAMi](https://github.com/Project-HAMi/HAMi) -2. Click `Fork` button (top right) to establish a cloud-based fork. +Fork the target repository on GitHub, then clone your fork locally: -## Clone fork to local storage +```bash +export user="your-github-username" -Per Go's [workspace instructions][go-workspace], place HAMi' code on your -`GOPATH` using the following cloning procedure. - -[go-workspace]: https://golang.org/doc/code.html#Workspaces - -Define a local working directory: +# For HAMi core +git clone https://github.com/$user/HAMi.git +cd HAMi +git remote add upstream https://github.com/Project-HAMi/HAMi.git +git remote set-url --push upstream no_push -```sh -# If your GOPATH has multiple paths, pick -# just one and use it instead of $GOPATH here. -# You must follow exactly this pattern, -# neither `$GOPATH/src/github.com/${your github profile name/` -# nor any other pattern will work. -export working_dir="$(go env GOPATH)/src/github.com/Project-HAMi" +# For the docs website +git clone https://github.com/$user/website.git +cd website +git remote add upstream https://github.com/Project-HAMi/website.git +git remote set-url --push upstream no_push ``` -Set `user` to match your github profile name: +Verify your remotes: -```sh -export user={your github profile name} +```bash +git remote -v ``` -Both `$working_dir` and `$user` are mentioned in the figure above. - -Create your clone: +Expected output: -```sh -mkdir -p $working_dir -cd $working_dir -git clone https://github.com/$user/HAMi.git -# or: git clone git@github.com:$user/HAMi.git - -cd $working_dir/HAMi -git remote add upstream https://github.com/Project-HAMi/HAMi -# or: git remote add upstream git@github.com:Project-HAMi/HAMi.git - -# Never push to upstream master -git remote set-url --push upstream no_push - -# Confirm that your remotes make sense: -git remote -v +``` +origin https://github.com/<your-username>/HAMi.git (fetch) +origin https://github.com/<your-username>/HAMi.git (push) +upstream https://github.com/Project-HAMi/HAMi.git (fetch) +upstream no_push (push) ``` -## Branch +The `no_push` setting prevents accidental pushes to the upstream repository. -Get your local master up to date: +## Keep master in sync -```sh -# Depending on which repository you are working from, -# the default branch may be called 'main' instead of 'master'. +Before starting any new work, sync your local master with upstream: -cd $working_dir/HAMi +```bash git fetch upstream git checkout master git rebase upstream/master ``` -Branch from it: +Use `rebase`, not `merge`. Merge commits clutter the history and make it harder to cherry-pick fixes. -```sh -git checkout -b myfeature +## Create a branch + +Branch off master with a short, descriptive name: + +```bash +git checkout -b fix/gpu-memory-calculation +git checkout -b feat/kunlunxin-multi-card +git checkout -b docs/update-ascend-guide ``` -Then edit code on the `myfeature` branch. +Work entirely on this branch. Do not commit directly to master. ## Keep your branch in sync -```sh -# Depending on which repository you are working from, -# the default branch may be called 'main' instead of 'master'. +If upstream master has moved while you are working: -# While on your myfeature branch +```bash git fetch upstream git rebase upstream/master ``` -Please don't use `git pull` instead of the above `fetch` / `rebase`. `git pull` -does a merge, which leaves merge commits. These make the commit history messy -and violate the principle that commits ought to be individually understandable -and useful (see below). You can also consider changing your `.git/config` file via -`git config branch.autoSetupRebase always` to change the behavior of `git pull`, -or another non-merge option such as `git pull --rebase`. - -## Commit +Resolve any conflicts, then continue: -Commit your changes. - -```sh -git commit --signoff +```bash +git rebase --continue ``` -Likely you go back and edit/build/test some more then `commit --amend` -in a few cycles. - -## Push +## Commit -When ready to review (or to establish an offsite backup of your work), -push your branch to your fork on `github.com`: +HAMi uses [Conventional Commits](https://www.conventionalcommits.org/) and requires a DCO sign-off on every commit. -```sh -git push -f ${your_remote_name} myfeature +```bash +git commit -s -m "fix: correct memory calculation for MLU devices" ``` -## Create a pull request - -1. Visit your fork at `https://github.com/$user/HAMi` -2. Click the `Compare & Pull Request` button next to your `myfeature` branch. - -_If you have upstream write access_, please refrain from using the GitHub UI for -creating PRs, because GitHub will create the PR branch inside the main -repository rather than inside your fork. - -### Get a code review - -Once your pull request has been opened it will be assigned to one or more -reviewers. Those reviewers will do a thorough code review, looking for -correctness, bugs, opportunities for improvement, documentation and comments, -and style. - -Commit changes made in response to review comments to the same branch on your -fork. +The `-s` flag adds the required `Signed-off-by` line. Without it, CI will block the PR. -Very small PRs are easy to review. Very large PRs are very difficult to review. +See the [contributing guide](contributing.md) for commit type conventions and message rules. -### Squash commits - -After a review, prepare your PR for merging by squashing your commits. - -All commits left on your branch after a review should represent meaningful milestones -or units of work. Use commits to add clarity to the development and review process. - -Before merging a PR, squash the following kinds of commits: - -- Fixes/review feedback -- Typos -- Merges and rebases -- Work in progress - -Aim to have every commit in a PR compile and pass tests independently if you can, -but it's not a requirement. In particular, `merge` commits must be removed, as they will not pass tests. - -To squash your commits, perform an [interactive -rebase](https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History): - -1. Check your git branch: - - ```bash - git status - ``` - - Output is similar to: - - ```text - On branch your-contribution - Your branch is up to date with 'origin/your-contribution'. - ``` - -2. Start an interactive rebase using a specific commit hash, or count backwards from - your last commit using `HEAD~<n>`, where `<n>` represents the number of commits to include in the rebase. - - ```bash - git rebase -i HEAD~3 - ``` - - Output is similar to: +## Push - ```text - pick 2ebe926 Original commit - pick 31f33e9 Address feedback - pick b0315fe Second unit of work +Push your branch to your fork: - # Rebase 7c34fc9..b0315ff onto 7c34fc9 (3 commands) - # - # Commands: - # p, pick <commit> = use commit - # r, reword <commit> = use commit, but edit the commit message - # e, edit <commit> = use commit, but stop for amending - # s, squash <commit> = use commit, but meld into previous commit - # f, fixup <commit> = like "squash", but discard this commit's log message - ... +```bash +git push origin fix/gpu-memory-calculation +``` - ``` +If you have rebased after a previous push, use `--force-with-lease` rather than `--force`: -3. Use a command line text editor to change the word `pick` to `squash` for - the commits you want to squash, then save your changes and continue the rebase: +```bash +git push origin fix/gpu-memory-calculation --force-with-lease +``` - ```text - pick 2ebe926 Original commit - squash 31f33e9 Address feedback - pick b0315fe Second unit of work - ... - ``` +`--force-with-lease` refuses the push if someone else has pushed to the same branch since your last fetch, preventing accidental overwrites. - Output (after saving changes) is similar to: +## Open a pull request - ```text - [detached HEAD 61fdded] Second unit of work - Date: Thu Mar 5 19:01:32 2020 +0100 - 2 files changed, 15 insertions(+), 1 deletion(-) - ... +1. Go to your fork on GitHub: `https://github.com/<your-username>/HAMi` +2. Click **Compare & Pull Request** next to your branch. +3. Set the base repository to `Project-HAMi/HAMi` and the base branch to `master`. +4. Fill in the PR description: what the change does, why it is needed, and how it was tested. +5. Reference any related issue: `Fixes #123` or `Relates to #456`. - Successfully rebased and updated refs/heads/master. - ``` +Keep the PR focused on one logical change. See the [contributing guide](contributing.md) for guidance on PR scope. -4. Force push your changes to your remote branch: +## Squash commits - ```bash - git push --force - ``` +Before a PR is merged, clean up the commit history. Squash fixup commits, review-feedback commits, and typo corrections into the logical commit they belong to. Each remaining commit should represent a meaningful unit of work. -For mass automated fixups (e.g. automated doc formatting), use one or more -commits for the changes to tooling and a final commit to apply the fixup en -masse. This makes reviews easier. +To squash interactively: -## Merging a commit +```bash +# Replace 3 with the number of commits to rebase +git rebase -i HEAD~3 +``` -Once you've received review and approval, your commits are squashed, your PR is ready for merging. +The editor opens with a list of commits: -Merging happens automatically after both a Reviewer and Approver have approved the PR. -If you haven't squashed your commits, they may ask you to do so before approving a PR. +``` +pick abc1234 fix: correct memory calculation for MLU devices +pick def5678 address review feedback +pick ghi9012 fix typo +``` -## Reverting a commit +Change `pick` to `squash` (or `s`) for commits to fold into the one above: -In case you wish to revert a commit, use the following instructions. +``` +pick abc1234 fix: correct memory calculation for MLU devices +squash def5678 address review feedback +squash ghi9012 fix typo +``` -_If you have upstream write access_, please refrain from using the -`Revert` button in the GitHub UI for creating the PR, because GitHub -will create the PR branch inside the main repository rather than inside your fork. +Save and close the editor. Git opens another editor to combine the commit messages - write a single clean message and save. -- Create a branch and sync it with upstream. +Force-push the result: - ```sh - # Depending on which repository you are working from, - # the default branch may be called 'main' instead of 'master'. +```bash +git push origin fix/gpu-memory-calculation --force-with-lease +``` - # create a branch - git checkout -b myrevert +## Address review feedback - # sync the branch with upstream - git fetch upstream - git rebase upstream/master - ``` +Push additional commits to the same branch as you address feedback. Do not close and reopen the PR. -- If the commit you wish to revert is a: +```bash +# Make changes, then: +git add <files> +git commit -s -m "fix: address review feedback on memory limit check" +git push origin fix/gpu-memory-calculation +``` - - **merge commit:** +Squash these into the relevant commits before the PR is merged. - ```sh - # SHA is the hash of the merge commit you wish to revert - git revert -m 1 SHA - ``` +## Revert a commit - - **single commit:** +To revert a merged commit, create a new branch off master and use `git revert`: - ```sh - # SHA is the hash of the single commit you wish to revert - git revert SHA - ``` +```bash +git fetch upstream +git checkout master +git rebase upstream/master +git checkout -b revert/fix-gpu-memory-calculation -- This will create a new commit reverting the changes. Push this new commit to your remote. +# For a single commit +git revert <SHA> - ```sh - git push ${your_remote_name} myrevert - ``` +# For a merge commit +git revert -m 1 <SHA> +``` -- [Create a Pull Request](#create-a-pull-request) using this branch. +Push the branch and open a PR as normal. Do not use the GitHub UI **Revert** button - it creates the branch inside the upstream repository instead of your fork. diff --git a/versioned_docs/version-v2.8.0/contributor/governance.md b/versioned_docs/version-v2.8.0/contributor/governance.md index 95984570..4a94b599 100644 --- a/versioned_docs/version-v2.8.0/contributor/governance.md +++ b/versioned_docs/version-v2.8.0/contributor/governance.md @@ -2,55 +2,36 @@ title: Governance --- -Heterogeneous AI Computing Virtualization Middleware (HAMi), formerly known as k8s-vGPU-scheduler, is an "all-in-one" tools designed to manage Heterogeneous AI Computing Devices in a k8s cluster +Heterogeneous AI Computing Virtualization Middleware (HAMi), formerly known as k8s-vGPU-scheduler, is a tool for managing heterogeneous AI computing devices in a Kubernetes cluster. ## Values -The HAMi and its leadership embrace the following values: +The HAMi project and its leadership embrace the following values: -* Openness: Communication and decision-making happens in the open and is discoverable for future - reference. As much as possible, all discussions and work take place in public - forums and open repositories. +- **Openness**: Communication and decision-making happens in the open and is discoverable for future reference. Discussions and work take place in public forums and open repositories wherever possible. -* Fairness: All stakeholders have the opportunity to provide feedback and submit - contributions, which will be considered on their merits. +- **Fairness**: All stakeholders have the opportunity to provide feedback and submit contributions, which are considered on their merits. -* Community over Product or Company: Sustaining and growing the community takes - priority over shipping code or sponsors' organizational goals. Each - contributor participates in the project as an individual. +- **Community over product or company**: Sustaining and growing the community takes priority over shipping code or sponsors' organizational goals. Each contributor participates in the project as an individual. -* Inclusivity: Innovation comes from different perspectives and skill sets, and this - can only be accomplished in a welcoming and respectful environment. +- **Inclusivity**: Different perspectives and skill sets drive better outcomes. This is only possible in a welcoming and respectful environment. -* Participation: Responsibilities within the project are earned through - participation, and there is a clear path up the contributor ladder into leadership - positions. +- **Participation**: Responsibilities within the project are earned through participation, and there is a clear path up the contributor ladder into leadership positions. ## Membership -Currently, the maintainers are the governing body for the project. This may -change as the community grows, such as by adopting an elected steering committee. +The maintainers are the current governing body for the project. This may change as the community grows, such as by adopting an elected steering committee. ## Meetings -Time zones permitting, Maintainers are expected to participate in the public -developer meeting, which occurs -[Google Docs](https://docs.google.com/document/d/1YC6hco03_oXbF9IOUPJ29VWEddmITIKIfSmBX8JtGBw/edit). +Maintainers are expected to participate in the public developer meeting when time zones permit. Details are in this [Google Docs](https://docs.google.com/document/d/1YC6hco03_oXbF9IOUPJ29VWEddmITIKIfSmBX8JtGBw/edit) document. -Maintainers will also have closed meetings in order to discuss security reports -or Code of Conduct violations. Such meetings should be scheduled by any -Maintainer on receipt of a security issue or CoC report. All current Maintainers -must be invited to such closed meetings, except for any Maintainer who is -accused of a CoC violation. +Maintainers also hold closed meetings to discuss security reports or Code of Conduct violations. Any Maintainer who receives a security issue or CoC report should schedule such a meeting. All current Maintainers must be invited, except any Maintainer who is the subject of a CoC complaint. ## Code of Conduct -[Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md) -violations by community members will be referred to the CNCF Code of Conduct -Committee. Should the CNCF CoC Committee need to work with the project on resolution, the -Maintainers will appoint a non-involved contributor to work with them. +[Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md) violations by community members are referred to the CNCF Code of Conduct Committee. If the CNCF CoC Committee needs to work with the project on resolution, the Maintainers will appoint a non-involved contributor to assist. -## Modifying this Charter +## Modifying this charter -Changes to this Governance and its supporting documents may be approved by -a 2/3 vote of the Maintainers. +Changes to this governance document and its supporting documents require a 2/3 vote of the Maintainers. diff --git a/versioned_docs/version-v2.8.0/contributor/ladder.md b/versioned_docs/version-v2.8.0/contributor/ladder.md index 4c323d0e..7d428889 100644 --- a/versioned_docs/version-v2.8.0/contributor/ladder.md +++ b/versioned_docs/version-v2.8.0/contributor/ladder.md @@ -2,165 +2,149 @@ title: Contributor Ladder --- -This docs different ways to get involved and level up within the project. You can see different roles within the project in the contributor roles. +This document describes the contributor roles within the HAMi project, along with the responsibilities and privileges that come with each role. Contributors generally start at the first level and advance as their involvement grows. -This contributor ladder outlines the different contributor roles within the project, along with the responsibilities and privileges that come with them. +## Overview -Each of the contributor roles below is organized into lists of three types of things. "Responsibilities" are things that a contributor is expected to do. "Requirements" are qualifications a person needs to meet to be in that role, and "Privileges" are things contributors on that level are entitled to. +Each role below lists three things: **Responsibilities** (what is expected), **Requirements** (what qualifies someone for the role), and **Privileges** (what the role grants). ### Community Participant -Description: A Community Participant engages with the project and its community, contributing their time, thoughts, etc. Community participants are usually users who have stopped being anonymous and started being active in project discussions. +A Community Participant engages with the project and its community. These are typically users who have moved from anonymous usage to active participation in discussions. -* Responsibilities: - * Must follow the [CNCF CoC](https://github.com/cncf/foundation/blob/main/code-of-conduct.md) -* How users can get involved with the community: - * Participating in community discussions - * Helping other users - * Submitting bug reports - * Commenting on issues - * Trying out new releases - * Attending community events +- Responsibilities: + - Follow the [CNCF Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md) +- Ways to get involved: + - Participate in community discussions + - Help other users + - Submit bug reports + - Comment on issues + - Try out new releases + - Attend community events ### Contributor -Description: A Contributor contributes directly to the project and adds value to it. Contributions need not be code. People at the Contributor level may be new contributors, or they may only contribute occasionally. - -* Responsibilities include: - * Follow the CNCF CoC - * Follow the project contributing guide -* Requirements (one or several of the below): - * Report and sometimes resolve issues - * Occasionally submit PRs - * Contribute to the documentation - * Show up at meetings, takes notes - * Answer questions from other community members - * Submit feedback on issues and PRs - * Test releases and patches and submit reviews - * Run or helps run events - * Promote the project in public - * Help run the project infrastructure - * [TODO: other small contributions] -* Privileges: - * Invitations to contributor events - * Eligible to become an Organization Member - -A very special thanks to the [long list of people](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md) who have contributed to and helped maintain the project. Thanks to everyone who contributed and helped maintain the project. - -As long as you contribute to HAMi, your name will be added [here](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md). If you don't find your name, please contact us to add it. +A Contributor contributes directly to the project. Contributions do not have to be code. Contributors may be new or may contribute only occasionally. + +- Responsibilities: + - Follow the CNCF Code of Conduct + - Follow the project contributing guide +- Requirements (one or more of the following): + - Report and sometimes resolve issues + - Submit PRs occasionally + - Contribute to documentation + - Attend meetings and take notes + - Answer questions from other community members + - Submit feedback on issues and PRs + - Test releases and patches and submit reviews + - Run or help run events + - Promote the project publicly + - Help maintain project infrastructure +- Privileges: + - Invitations to contributor events + - Eligible to become an Organization Member + +Contributors are listed in the [AUTHORS.md file](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md). If a name is missing, open an issue on the HAMi repository to have it added. ### Organization Member -Description: An Organization Member is an established contributor who regularly participates in the project. Organization Members have privileges in both project repositories and elections, and as such are expected to act in the interests of the whole project. - -An Organization Member must meet the responsibilities and has the requirements of a Contributor, plus: - -* Responsibilities include: - * Continues to contribute regularly, as demonstrated by having at least 50 GitHub contributions per year -* Requirements: - * Enabled [two-factor authentication] on their GitHub account - * Must have successful contributions to the project or community, including at least one of the following: - * 5 accepted PRs, - * Reviewed 5 PRs, - * Resolved and closed 3 Issues, - * Become responsible for a key project management area, - * Or some equivalent combination or contribution - * Must have been contributing for at least 1 months - * Must be actively contributing to at least one project area - * Must have two sponsors who are also Organization Members, at least one of whom does not work for the same employer - * **[Open an issue][membership request] against the Project-HAMi/HAMi repo** - * Ensure your sponsors are @mentioned on the issue - * Complete every item on the issue checklist - * Make sure that the list of contributions included is representative of your work on the project. - * Have your sponsoring reviewers reply confirmation of sponsorship: `+1` - * Once your sponsors have responded, your request will be handled by the `HAMi GitHub Admin team`. - -* Privileges: - * May be assigned Issues and Reviews - * May give commands to CI/CD automation - * Can be added to [TODO: Repo Host] teams - * Can recommend other contributors to become Org Members - -The process for a Contributor to become an Organization Member is as follows: - -1. Contact Maintainers and get at least two maintainers to agree -2. Submit an Issue application to become a Member +An Organization Member is an established contributor who participates regularly. Organization Members have privileges in both project repositories and elections, and are expected to act in the interests of the whole project. + +An Organization Member must meet all Contributor responsibilities and requirements, plus: + +- Responsibilities: + - Continue contributing regularly, as demonstrated by at least 50 GitHub contributions per year +- Requirements: + - [Two-factor authentication][two-factor authentication] enabled on their GitHub account + - Successful contributions to the project or community, including at least one of the following: + - 5 accepted PRs + - 5 PRs reviewed + - 3 issues resolved and closed + - Responsibility for a key project management area + - An equivalent combination of contributions + - Contributing for at least 1 month + - Actively contributing to at least one project area + - Two sponsors who are also Organization Members, at least one of whom works for a different employer + - [Open a membership request issue][membership request] against the Project-HAMi/HAMi repo: + - Mention sponsors in the issue + - Complete every item on the issue checklist + - Ensure the contributions listed are representative of the work done + - Sponsors reply with confirmation: `+1` + - After sponsor confirmation, the request is handled by the HAMi GitHub Admin team +- Privileges: + - May be assigned issues and reviews + - May issue commands to CI/CD automation + - Can be added to HAMi project teams + - Can recommend other contributors for Org Member status ### Reviewer -Description: A Reviewer has responsibility for specific code, documentation, test, or other project areas. They are collectively responsible, with other Reviewers, for reviewing all changes to those areas and indicating whether those changes are ready to merge. They have a track record of contribution and review in the project. - -Reviewers are responsible for a "specific area." This can be a specific code directory, driver, chapter of the docs, test job, event, or other clearly-defined project component that is smaller than an entire repository or subproject. Most often it is one or a set of directories in one or more Git repositories. The "specific area" below refers to this area of responsibility. +A Reviewer is responsible for a specific area of the project: a code directory, a section of the docs, a test suite, or another clearly-defined component. Reviewers are collectively responsible for reviewing all changes to their area and indicating whether changes are ready to merge. Reviewers have all the rights and responsibilities of an Organization Member, plus: -* Responsibilities include: - * Following the reviewing guide - * Reviewing most Pull Requests against their specific areas of responsibility - * Reviewing at least 20 PRs per year - * Helping other contributors become reviewers -* Requirements: - * Experience as a Contributor for at least 3 months - * Is an Organization Member - * Has reviewed, or helped review, at least 10 Pull Requests - * Has analyzed and resolved test failures in their specific area - * Has demonstrated an in-depth knowledge of the specific area - * Commits to being responsible for that specific area - * Is supportive of new and occasional contributors and helps get useful PRs in shape to commit -* Additional privileges: - * Has GitHub or CI/CD rights to approve pull requests in specific directories - * Can recommend and review other contributors to become Reviewers - -The process of becoming a Reviewer is: - -1. The contributor is nominated by opening a PR against the appropriate repository, which adds their GitHub username to the OWNERS file for one or more directories. -2. At least two members of the team that owns that repository or main directory, who are already Approvers, approve the PR. +- Responsibilities: + - Follow the reviewing guide + - Review most pull requests against their specific area of responsibility + - Review at least 20 PRs per year + - Help other contributors become reviewers +- Requirements: + - At least 3 months of experience as a Contributor + - Organization Member status + - At least 10 pull requests reviewed or co-reviewed + - Demonstrated ability to analyze and resolve test failures in their area + - In-depth knowledge of the specific area + - Commitment to ongoing responsibility for that area + - Supportive of new and occasional contributors +- Privileges: + - GitHub or CI/CD rights to approve pull requests in specific directories + - Can recommend and review other contributors for Reviewer status + +To become a Reviewer: + +1. A contributor opens a PR against the appropriate repository, adding their GitHub username to the OWNERS file for one or more directories. +2. At least two existing Approvers for that repository or directory approve the PR. ### Maintainer -Description: Maintainers are very established contributors who are responsible for the entire project. As such, they have the ability to approve PRs against any area of the project, and are expected to participate in making decisions about the strategy and priorities of the project. - -A Maintainer must meet the responsibilities and requirements of a Reviewer, plus: - -The current list of maintainers can be found in the [MAINTAINERS](https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md). +A Maintainer is a highly established contributor responsible for the entire project. Maintainers can approve PRs against any area of the project and are expected to participate in decisions about project strategy and priorities. -### An active maintainer should +A Maintainer must meet all Reviewer responsibilities and requirements. -* Actively participate in reviewing pull requests and incoming issues. There are no hard rules on what is “active enough” and this is left up to the judgement of the current group of maintainers. +The current list of maintainers is in [MAINTAINERS](https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md). -* Actively participate in discussions about design and the future of the project. +**An active maintainer:** -* Take responsibility for backports to appropriate branches for PRs they approve and merge. +- Actively reviews pull requests and incoming issues. There are no hard rules on what counts as active - this is left to the judgement of the current maintainer group. +- Participates in discussions about design and the future of the project. +- Takes responsibility for backports to appropriate branches for PRs they approve and merge. +- Follows code, testing, and design conventions as determined by consensus among active maintainers. +- Steps down gracefully when no longer planning to actively participate. -* Do their best to follow all code, testing, and design conventions as determined by consensus among active maintainers. +**Becoming a maintainer:** -* Gracefully step down from their maintainership role when they are no longer planning to actively participate in the project. +New maintainers are added by consensus among the current maintainer group, via Slack or email discussion. A majority must support the addition, and no single maintainer should object. -### How to be a maintainer +When adding a new maintainer, open a PR to [HAMi](https://github.com/Project-HAMi/HAMi) and update [MAINTAINERS](https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md). Once merged, the person becomes a maintainer. -New maintainers are added by consensus among the current group of maintainers. This can be done via a private discussion via Slack or email. A majority of maintainers should support the addition of the new person, and no single maintainer should object to adding the new maintainer. +**Removing maintainers:** -When adding a new maintainer, file a PR to [HAMi](https://github.com/Project-HAMi/HAMi) and update [MAINTAINERS](https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md). Once this PR is merged, you will become a maintainer of HAMi. - -### Removing Maintainers - -It is normal for maintainers to come and go based on their other responsibilities. Inactive maintainers may be removed if there is no expectation of ongoing participation. If a former maintainer resumes participation, they should be given quick consideration for re-adding to the team. +Maintainers may step back as their other responsibilities change. Inactive maintainers may be removed if there is no expectation of renewed participation. Former maintainers who return should be given prompt consideration for reinstatement. ## Inactivity -It is important for contributors to be and stay active to set an example and show commitment to the project. Inactivity is harmful to the project as it may lead to unexpected delays, contributor attrition, and a lost of trust in the project. +Active participation is important for project health. Inactivity is measured by: + +- No contributions for more than 3 months +- No communication for more than 3 months -* Inactivity is measured by: - * Periods of no contributions for longer than 3 months - * Periods of no communication for longer than 3 months -* Consequences of being inactive include: - * Involuntary removal or demotion - * Being asked to move to Emeritus status +Consequences may include involuntary removal, demotion, or a move to Emeritus status. -## Involuntary Removal or Demotion +## Involuntary removal or demotion -Involuntary removal/demotion of a contributor happens when responsibilities and requirements aren't being met. This may include repeated patterns of inactivity, extended period of inactivity, a period of failing to meet the requirements of your role, and/or a violation of the Code of Conduct. This process is important because it protects the community and its deliverables while also opens up opportunities for new contributors to step in. +Involuntary removal or demotion happens when a contributor's responsibilities and requirements are no longer being met. This may result from repeated or extended inactivity, failure to meet role requirements, or a Code of Conduct violation. -Involuntary removal or demotion is handled through a vote by a majority of the current Maintainers. +Involuntary removal or demotion is decided by a majority vote of the current Maintainers. [two-factor authentication]: https://help.github.com/articles/about-two-factor-authentication +[membership request]: https://github.com/Project-HAMi/HAMi/issues/new diff --git a/versioned_docs/version-v2.8.0/contributor/lifted.md b/versioned_docs/version-v2.8.0/contributor/lifted.md index d5ba7db6..bc276558 100644 --- a/versioned_docs/version-v2.8.0/contributor/lifted.md +++ b/versioned_docs/version-v2.8.0/contributor/lifted.md @@ -3,11 +3,7 @@ title: How to manage lifted codes --- This document explains how lifted code is managed. -A common user case for this task is developer lifting code from other repositories to `pkg/util/lifted` directory. - -- [Steps of lifting code](#steps-of-lifting-code) -- [How to write lifted comments](#how-to-write-lifted-comments) -- [Examples](#examples) +A common use case for this task is developer lifting code from other repositories to `pkg/util/lifted` directory. ## Steps of lifting code @@ -21,9 +17,9 @@ A common user case for this task is developer lifting code from other repositori Lifted comments shall be placed just before the lifted code (could be a func, type, var or const). Only empty lines and comments are allowed between lifted comments and lifted code. -Lifted comments are composed by one or multi comment lines, each in the format of `+lifted:KEY[=VALUE]`. Value is optional for some keys. +Lifted comments are composed by one or multiple comment lines, each in the format of `+lifted:KEY[=VALUE]`. Value is optional for some keys. -Valid keys are as follow: +Valid keys are as follow: - source: @@ -47,7 +43,7 @@ Lift function `IsQuotaHugePageResourceName` to `corehelpers.go`: // IsQuotaHugePageResourceName returns true if the resource name has the quota // related huge page resource prefix. func IsQuotaHugePageResourceName(name corev1.ResourceName) bool { - return strings.HasPrefix(string(name), corev1.ResourceHugePagesPrefix) || strings.HasPrefix(string(name), corev1.ResourceRequestsHugePagesPrefix) + return strings.HasPrefix(string(name), corev1.ResourceHugePagesPrefix) || strings.HasPrefix(string(name), corev1.ResourceRequestsHugePagesPrefix) } ``` @@ -70,11 +66,11 @@ Lift and change function `GetNewReplicaSet` to `deployment.go` // GetNewReplicaSet returns a replica set that matches the intent of the given deployment; get ReplicaSetList from client interface. // Returns nil if the new replica set doesn't exist yet. func GetNewReplicaSet(deployment *appsv1.Deployment, f ReplicaSetListFunc) (*appsv1.ReplicaSet, error) { - rsList, err := ListReplicaSetsByDeployment(deployment, f) - if err != nil { - return nil, err - } - return FindNewReplicaSet(deployment, rsList), nil + rsList, err := ListReplicaSetsByDeployment(deployment, f) + if err != nil { + return nil, err + } + return FindNewReplicaSet(deployment, rsList), nil } ``` @@ -88,7 +84,7 @@ Added in `doc.go`: ### Lifting const -Lift const `isNegativeErrorMsg` to `corevalidation.go `: +Lift const `isNegativeErrorMsg` to `corevalidation.go`: ```go // +lifted:source=https://github.com/kubernetes/kubernetes/blob/release-1.22/pkg/apis/core/validation/validation.go#L59 diff --git a/versioned_docs/version-v2.8.0/core-concepts/architecture.md b/versioned_docs/version-v2.8.0/core-concepts/architecture.md index 4558de34..800d534c 100644 --- a/versioned_docs/version-v2.8.0/core-concepts/architecture.md +++ b/versioned_docs/version-v2.8.0/core-concepts/architecture.md @@ -6,7 +6,7 @@ The overall architecture of HAMi is shown as below: ![Architecture](/img/docs/common/core-concepts/architect.jpg) -The HAMi consists of the following components: +HAMi consists of the following components: - HAMi MutatingWebhook - HAMi scheduler-extender @@ -17,7 +17,7 @@ The HAMi consists of the following components: HAMi MutatingWebhook checks if this task can be handled by HAMi, It scans the resource field of each pod submitted, -If each resource these pod requires is either 'CPU', 'Memory' or a HAMi-resource, +If each resource the pod requires is either 'CPU', 'Memory' or a HAMi-resource, Then it will set the schedulerName field of this pod to 'HAMi-scheduler'. ## HAMi scheduler {#hami-scheduler} diff --git a/versioned_docs/version-v2.8.0/core-concepts/gpu-virtualization.md b/versioned_docs/version-v2.8.0/core-concepts/gpu-virtualization.md index d9f0cdd1..89c2f530 100644 --- a/versioned_docs/version-v2.8.0/core-concepts/gpu-virtualization.md +++ b/versioned_docs/version-v2.8.0/core-concepts/gpu-virtualization.md @@ -53,7 +53,7 @@ HAMi currently remains based on Device Plugin, but the official team has launche ## HAMi Virtual GPU Scheduling Principles -HAMi leverages three Kubernetes extension mechanisms simultaneously (MutatingWebhook, Scheduler Extender, and Device Plugin), each with its own responsibility: +HAMi uses three Kubernetes extension mechanisms simultaneously (MutatingWebhook, Scheduler Extender, and Device Plugin), each with its own responsibility: - **Fine-grained resource declaration**: Users can declare `nvidia.com/gpumem` (VRAM in MiB) and `nvidia.com/gpucores` (compute %) - **Resource-aware scheduling**: The Scheduler-Extender reads GPU specifications from Node Annotations, performing Filter and Bind based on remaining VRAM/compute capacity @@ -246,5 +246,5 @@ The two dimensions are orthogonal. Common combinations: Here are some recommended next steps: -- Learn about HAMi's [architecture](./architecture) -- [Install HAMi](../installation/prerequisites) in your Kubernetes cluster +- Learn about HAMi's [architecture](./architecture.md) +- [Install HAMi](../installation/prerequisites.md) in your Kubernetes cluster diff --git a/versioned_docs/version-v2.8.0/core-concepts/introduction.md b/versioned_docs/version-v2.8.0/core-concepts/introduction.md index d7f4f15f..6b55018a 100644 --- a/versioned_docs/version-v2.8.0/core-concepts/introduction.md +++ b/versioned_docs/version-v2.8.0/core-concepts/introduction.md @@ -39,5 +39,6 @@ HAMi is a [Cloud Native Computing Foundation](https://cncf.io/) [Sandbox project Here are some recommended next steps: +- Understand HAMi's [GPU virtualization principles](./gpu-virtualization.md) - Learn about HAMi's [architecture](./architecture.md) - [Install HAMi](../installation/prerequisites.md) in your Kubernetes cluster diff --git a/versioned_docs/version-v2.8.0/developers/build.md b/versioned_docs/version-v2.8.0/developers/build.md index 1a260c80..c646b638 100644 --- a/versioned_docs/version-v2.8.0/developers/build.md +++ b/versioned_docs/version-v2.8.0/developers/build.md @@ -11,13 +11,13 @@ The following tools are required: - go v1.20+ - make -### build +### Build ```bash make ``` -If everything are successfully built, the following output are printed +If everything is successfully built, the following output is printed ```bash go build -ldflags '-s -w -X github.com/Project-HAMi/HAMi/pkg/version.version=v0.0.1' -o bin/scheduler ./cmd/scheduler @@ -34,13 +34,13 @@ The following tools are required: - docker - make -### build +### Build ```bash make docker ``` -If everything are successfully built, the following output are printed +If everything is successfully built, the following output is printed ```bash go build -ldflags '-s -w -X github.com/Project-HAMi/HAMi/pkg/version.version=v0.0.1' -o bin/scheduler ./cmd/scheduler @@ -73,21 +73,21 @@ FINISHED => [stage-3 3/6] COPY --from=GOBUILD /k8s-vgpu/bin /k8s-vgpu/bin 0.5s => [stage-3 4/6] COPY ./docker/entrypoint.sh /k8s-vgpu/bin/entrypoint.sh 0.2s => [stage-3 5/6] COPY ./lib /k8s-vgpu/lib 0.2s - => [nvbuild 6/9] RUN tar -xf cmake-3.19.8-Linux-x86_64.tar.gz 2.1s - => [nvbuild 7/9] RUN cp /libvgpu/cmake-3.19.8-Linux-x86_64/bin/cmake /libvgpu/cmake-3.19.8-Linux-x86_64/bin/cmake3 1.3s - => [nvbuild 8/9] RUN apt-get -y install openssl libssl-dev 7.7s - => [nvbuild 9/9] RUN bash ./build.sh 4.0s - => [stage-3 6/6] COPY --from=NVBUILD /libvgpu/build/libvgpu.so /k8s-vgpu/lib/nvidia/ 0.3s - => exporting to image 1.8s - => => exporting layers 1.8s - => => writing image sha256:fc0ce42b41f9a177921c9bfd239babfa06fc77cf9e4087e8f2d959d749e8039f 0.0s - => => naming to docker.io/projecthami/hami:master-103b2b677e018a40af6322a56c2e9d5d5c62cccf 0.0s -The push refers to repository [docker.io/projecthami/hami] + => [nvbuild 6/9] RUN tar -xf cmake-3.19.8-Linux-x86_64.tar.gz 2.1s + => [nvbuild 7/9] RUN cp /libvgpu/cmake-3.19.8-Linux-x86_64/bin/cmake /libvgpu/cmake-3.19.8-Linux-x86_64/bin/cmake3 1.3s + => [nvbuild 8/9] RUN apt-get -y install openssl libssl-dev 7.7s + => [nvbuild 9/9] RUN bash ./build.sh 4.0s + => [stage-3 6/6] COPY --from=NVBUILD /libvgpu/build/libvgpu.so /k8s-vgpu/lib/nvidia/ 0.3s + => exporting to image 1.8s + => => exporting layers 1.8s + => => writing image sha256:fc0ce42b41f9a177921c9bfd239babfa06fc77cf9e4087e8f2d959d749e8039f 0.0s + => => naming to docker.io/projecthami/hami:master-103b2b677e018a40af6322a56c2e9d5d5c62cccf 0.0s +The push refers to repository [docker.io/projecthami/hami] ``` ## Make HAMi-Core -HAMi-Core is recommended to be built in a nvidia/cuda image: +Build HAMi-Core inside a nvidia/cuda image: ```bash git clone https://github.com/Project-HAMi/HAMi-core.git diff --git a/versioned_docs/version-v2.8.0/developers/dynamic-mig.md b/versioned_docs/version-v2.8.0/developers/dynamic-mig.md index 0d257953..ce4051b4 100644 --- a/versioned_docs/version-v2.8.0/developers/dynamic-mig.md +++ b/versioned_docs/version-v2.8.0/developers/dynamic-mig.md @@ -9,14 +9,14 @@ This feature will not be implemented without the help of @sailorvii. ## Introduction -The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, MPS and MIG are preferred. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. The goal is an automatic slice plugin that creates slices on demand. -For the scheduling method, node-level binpack and spread will be supported. Referring to the binpack plugin, the scheduler considers CPU, memory, GPU memory, and other user-defined resources. +The NVIDIA GPU built-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so MPS and MIG are preferred. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. The goal is to develop an automatic slice plugin and create the slice when the user requires it. +For the scheduling method, node-level binpack and spread will be supported. Referring to the binpack plugin, the scheduler considers CPU, Mem, GPU memory and other user-defined resources. HAMi is done by using [hami-core](https://github.com/Project-HAMi/HAMi-core), which is a cuda-hacking library. But mig is also widely used across the world. A unified API for dynamic-mig and hami-core is needed. ## Targets - CPU, Mem, and GPU combined schedule -- GPU dynamic slice: Hami-core and MIG +- GPU dynamic slice: HAMi-core and MIG - Support node-level binpack and spread by GPU memory, CPU and Mem - A unified vGPU Pool different virtualization techniques - Tasks can choose to use MIG, use HAMi-core, or use both. @@ -37,61 +37,61 @@ data: knownMigGeometries: - models: [ "A30" ] allowedGeometries: - - + - - name: 1g.6gb memory: 6144 count: 4 - - + - - name: 2g.12gb memory: 12288 count: 2 - - + - - name: 4g.24gb memory: 24576 count: 1 - models: [ "A100-SXM4-40GB", "A100-40GB-PCIe", "A100-PCIE-40GB", "A100-SXM4-40GB" ] allowedGeometries: - - + - - name: 1g.5gb memory: 5120 count: 7 - - + - - name: 2g.10gb memory: 10240 count: 3 - name: 1g.5gb memory: 5120 count: 1 - - + - - name: 3g.20gb memory: 20480 count: 2 - - + - - name: 7g.40gb memory: 40960 count: 1 - models: [ "A100-SXM4-80GB", "A100-80GB-PCIe", "A100-PCIE-80GB"] allowedGeometries: - - + - - name: 1g.10gb memory: 10240 count: 7 - - + - - name: 2g.20gb memory: 20480 count: 3 - name: 1g.10gb memory: 10240 count: 1 - - + - - name: 3g.40gb memory: 40960 count: 2 - - + - - name: 7g.79gb memory: 80896 count: 1 - nodeconfig: + nodeconfig: - name: nodeA operatingmode: hami-core - name: nodeB @@ -105,7 +105,7 @@ data: ## Examples Dynamic mig is compatible with hami tasks, as the example below: -Just Setting `nvidia.com/gpu` and `nvidia.com/gpumem`. +Set `nvidia.com/gpu` and `nvidia.com/gpumem`. ```yaml apiVersion: v1 @@ -115,7 +115,7 @@ metadata: spec: containers: - name: ubuntu-container1 - image: ubuntu:20.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: @@ -135,12 +135,12 @@ metadata: spec: containers: - name: ubuntu-container1 - image: ubuntu:20.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: nvidia.com/gpu: 2 # requesting 2 vGPUs - nvidia.com/gpumem: 8000 # Each vGPU contains 8000m device memory (Optional,Integer + nvidia.com/gpumem: 8000 # Each vGPU contains 8000m device memory (Optional,Integer) ``` ## Procedures @@ -149,7 +149,7 @@ The Procedure of a vGPU task which uses dynamic-mig is shown below: <img src="https://github.com/Project-HAMi/HAMi/blob/master/docs/develop/imgs/hami-dynamic-mig-procedure.png?raw=true" width="800" alt="HAMi dynamic MIG procedure flowchart showing task scheduling process" /> -After a task is submitted, deviceshare plugin will iterate over templates defined in configMap `hami-scheduler-device`, and find the first available template to fit. You can always change the content of that configMap, and restart vc-scheduler to customize. +After submitting a task, the deviceshare plugin iterates over templates defined in configMap `hami-scheduler-device` and finds the first available template to fit. You can always change the content of that configMap, and restart vc-scheduler to customize. If you submit the example on an empty A100-PCIE-40GB node, then it will select a GPU and choose MIG template below: diff --git a/versioned_docs/version-v2.8.0/developers/hami-core-design.md b/versioned_docs/version-v2.8.0/developers/hami-core-design.md index 2764f3f9..907dbe7a 100644 --- a/versioned_docs/version-v2.8.0/developers/hami-core-design.md +++ b/versioned_docs/version-v2.8.0/developers/hami-core-design.md @@ -16,7 +16,7 @@ HAMi-core offers the following key features: ![nvidia-smi output showing virtualized GPU memory with HAMi-core](/img/docs/common/developers/hami-core-design/sample-nvidia-smi.png) -2. Limit the device utilization +2. Limit the device utilization Implements a custom time-slicing mechanism to control GPU usage. diff --git a/versioned_docs/version-v2.8.0/developers/hami-webui-development-guide.md b/versioned_docs/version-v2.8.0/developers/hami-webui-development-guide.md index 81f47427..2bc121cd 100644 --- a/versioned_docs/version-v2.8.0/developers/hami-webui-development-guide.md +++ b/versioned_docs/version-v2.8.0/developers/hami-webui-development-guide.md @@ -151,4 +151,3 @@ Prefer extracting reusable fetch/filter/pagination logic into hooks: - Quickly understand how the project runs (BFF + Vite + API proxy) - Clarify where to place code and how to connect routes/menus/i18n/API - Reduce PR rework caused by inconsistent conventions - diff --git a/versioned_docs/version-v2.8.0/developers/kunlunxin-topology.md b/versioned_docs/version-v2.8.0/developers/kunlunxin-topology.md index c48e25a8..3a8079c2 100644 --- a/versioned_docs/version-v2.8.0/developers/kunlunxin-topology.md +++ b/versioned_docs/version-v2.8.0/developers/kunlunxin-topology.md @@ -30,34 +30,34 @@ The selection process is shown below: ## Score In the scoring phase, all filtered nodes are evaluated and scored to select the optimal one -for scheduling. The metric used is **MTF** (Minimized Tasks to Fill), +for scheduling. The scoring uses a metric called **MTF** (Minimized Tasks to Fill), which quantifies how well a node can accommodate future tasks after allocation. The table below shows examples of XPU occupation and proper MTF values: -| XPU Occupation | MTF | Description | -|----------------|-----|-------------| -| 11111111 | 0 | Fully occupied; no more tasks can be scheduled | -| 00000000 | 1 | A task requiring 8 XPUs can fully utilize it | -| 00000011 | 2 | A 4-XPU task and a 2-XPU task can be scheduled | -| 00000001 | 3 | A 4-XPU, 2-XPU, and 1-XPU task can fill it | -| 00010001 | 4 | Two 2-XPU tasks and two 1-XPU tasks can fill it | +| XPU Occupation | MTF | Description | +|----------------|-----|--------------------------------------------------| +| 11111111 | 0 | Fully occupied; no more tasks can be scheduled | +| 00000000 | 1 | A task requiring 8 XPUs can fully utilize it | +| 00000011 | 2 | A 4-XPU task and a 2-XPU task can be scheduled | +| 00000001 | 3 | A 4-XPU, 2-XPU, and 1-XPU task can fill it | +| 00010001 | 4 | Two 2-XPU tasks and two 1-XPU tasks can fill it | The node score is derived from the **delta(MTF)** - the change in MTF value after allocation. A smaller delta(MTF) indicates a better fit and results in a higher score. The scoring logic is shown below: -| delta(MTF) | Score | Example | -|------------|-------|---------| -| -1 | 2000 | 00000111->00001111 | -| 0 | 1000 | 00000111->00110111 | -| 1 | 0 | 00001111->00011111 | -| 2 | -1000 | 00000000->00000001 | +| delta(MTF) | Score | Example | +|:----------:|:------:|:-----------------------| +| -1 | 2000 | 00000111->00001111 | +| 0 | 1000 | 00000111->00110111 | +| 1 | 0 | 00001111->00011111 | +| 2 | -1000 | 00000000->00000001 | ## Bind In the bind phase, the allocation result is patched into the pod annotations. For example: -```yaml +```text BAIDU_COM_DEVICE_IDX=0,1,2,3 ``` diff --git a/versioned_docs/version-v2.8.0/developers/protocol.md b/versioned_docs/version-v2.8.0/developers/protocol.md index 12ba67eb..f575904e 100644 --- a/versioned_docs/version-v2.8.0/developers/protocol.md +++ b/versioned_docs/version-v2.8.0/developers/protocol.md @@ -4,24 +4,24 @@ title: Protocol design ### Device Registration -<img src="/img/docs/common/developers/protocol/protocol-register.png" width="600px" alt="HAMi project diagram" /> +<img src="/img/docs/common/developers/protocol/protocol-register.png" width="600px" alt="HAMi device registration protocol diagram showing node annotation process" /> -HAMi needs to know the spec of each AI devices in the cluster in order to schedule properly. During device registration, device-plugin needs to keep patching the spec of each device into node annotations every 30 seconds, in the format of the following: +HAMi needs to know the spec of each AI device in the cluster to schedule properly. During device registration, device-plugin needs to keep patching the spec of each device into node annotations every 30 seconds, in the format of the following: -```yaml +```text hami.io/node-handshake-\{device-type\}: Reported_\{device_node_current_timestamp\} hami.io/node-\{device-type\}-register: \{Device 1\}:\{Device2\}:...:\{Device N\} ``` The definition of each device is in the following format: -```yaml +```text \{Device UUID\},\{device split count\},\{device memory limit\},\{device core limit\},\{device type\},\{device numa\},\{healthy\} ``` An example is shown below: -```yaml +```text hami.io/node-handshake-nvidia: Reported 2024-01-23 04:30:04.434037031 +0000 UTC m=+1104711.777756895 hami.io/node-handshake-mlu: Requesting_2024.01.10 04:06:57 hami.io/node-mlu-register: MLU-45013011-2257-0000-0000-000000000000,10,23308,0,MLU-MLU370-X4,0,false:MLU-54043011-2257-0000-0000-000000000000,10,23308,0, @@ -31,10 +31,10 @@ hami.io/node-nvidia-register: GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec,10,32768, In this example, this node has two different AI devices, 2 Nvidia-V100 GPUs, and 2 Cambricon 370-X4 MLUs -A device node may become unavailable due to hardware or network failure, if a node hasn't registered in last 5 minutes, scheduler will mark that node as 'unavailable'. +A device node may become unavailable due to hardware or network failure. If a node hasn't registered in the last 5 minutes, the scheduler marks it as 'unavailable'. Since system clock on scheduler node and 'device' node may not align properly, scheduler node will patch the following device node annotations every 30s -```yaml +```text hami.io/node-handshake-\{device-type\}: Requesting_{scheduler_node_current_timestamp} ``` diff --git a/versioned_docs/version-v2.8.0/developers/scheduling.md b/versioned_docs/version-v2.8.0/developers/scheduling.md index db9d5662..7e4e571f 100644 --- a/versioned_docs/version-v2.8.0/developers/scheduling.md +++ b/versioned_docs/version-v2.8.0/developers/scheduling.md @@ -1,83 +1,83 @@ --- -title: Scheduler Policy +title: Scheduler Policy --- ## Summary -Current in a cluster with many GPU nodes, nodes are not `binpack` or `spread` when making scheduling decisions, nor are GPU cards `binpack` or `spread` when using vGPU. +Currently, in a cluster with many GPU nodes, nodes are not `binpack` or `spread` when making scheduling decisions, nor are GPU cards `binpack` or `spread` when using vGPU. ## Proposal -The scheduler adds a `node-scheduler-policy` and `gpu-scheduler-policy` to config, then scheduler to use this policy can impl node `binpack` or `spread` or GPU `binpack` or `spread`. and -use can set Pod annotation to change this default policy, use `hami.io/node-scheduler-policy` and `hami.io/gpu-scheduler-policy` to overlay scheduler config. +A `node-scheduler-policy` and `gpu-scheduler-policy` can be set in config. The scheduler uses this policy to implement node `binpack` or `spread` or GPU `binpack` or `spread`. +Pod annotations `hami.io/node-scheduler-policy` and `hami.io/gpu-scheduler-policy` can override the default scheduler config. ### User Stories -This is a GPU cluster, having two node, the following story takes this cluster as a prerequisite. +This is a GPU cluster, having two nodes, the following story takes this cluster as a prerequisite. ![HAMi scheduler policy story diagram, showing node and GPU resource distribution](/img/docs/common/developers/scheduling/scheduler-policy-story.png) #### Story 1 -node binpack, use one node’s GPU card whenever possible, egs: +node binpack, use one node’s GPU card whenever possible, e.g.: - cluster resources: - node1: GPU having 4 GPU device - node2: GPU having 4 GPU device - request: - - pod1: User 1 GPU - - pod2: User 1 GPU + - pod1: Use 1 GPU + - pod2: Use 1 GPU - scheduler result: - - pod1: scheduler to node1 - - pod2: scheduler to node1 + - pod1: scheduled to node1 + - pod2: scheduled to node1 #### Story 2 -node spread, use GPU cards from different nodes as much as possible, egs: +node spread, use GPU cards from different nodes as much as possible, e.g.: - cluster resources: - node1: GPU having 4 GPU device - node2: GPU having 4 GPU device - request: - - pod1: User 1 GPU - - pod2: User 1 GPU + - pod1: Use 1 GPU + - pod2: Use 1 GPU - scheduler result: - - pod1: scheduler to node1 - - pod2: scheduler to node2 + - pod1: scheduled to node1 + - pod2: scheduled to node2 #### Story 3 -GPU binpack, use the same GPU card as much as possible, egs: +GPU binpack, use the same GPU card as much as possible, e.g.: - cluster resources: - node1: GPU having 4 GPU device, they are GPU1,GPU2,GPU3,GPU4 - request: - - pod1: User 1 GPU, gpucore is 20%, gpumem-percentage is 20% - - pod2: User 1 GPU, gpucore is 20%, gpumem-percentage is 20% + - pod1: Use 1 GPU, gpucore is 20%, gpumem-percentage is 20% + - pod2: Use 1 GPU, gpucore is 20%, gpumem-percentage is 20% - scheduler result: - - pod1: scheduler to node1, select GPU1 this device - - pod2: scheduler to node1, select GPU1 this device + - pod1: scheduled to node1, select GPU1 + - pod2: scheduled to node1, select GPU1 #### Story 4 -GPU spread, use different GPU cards when possible, egs: +GPU spread, use different GPU cards when possible, e.g.: - cluster resources: - node1: GPU having 4 GPU device, they are GPU1,GPU2,GPU3,GPU4 - request: - - pod1: User 1 GPU, gpucore is 20%, gpumem-percentage is 20% - - pod2: User 1 GPU, gpucore is 20%, gpumem-percentage is 20% + - pod1: Use 1 GPU, gpucore is 20%, gpumem-percentage is 20% + - pod2: Use 1 GPU, gpucore is 20%, gpumem-percentage is 20% - scheduler result: - - pod1: scheduler to node1, select GPU1 this device - - pod2: scheduler to node1, select GPU2 this device + - pod1: scheduled to node1, select GPU1 + - pod2: scheduled to node1, select GPU2 ## Design Details @@ -89,45 +89,45 @@ GPU spread, use different GPU cards when possible, egs: Binpack mainly considers node resource usage. The more full the usage, the higher the score. -``` -score: ((request + used) / allocatable) * 10 +```text +score: ((request + used) / allocatable) * 10 ``` 1. Binpack scoring information for Node 1 is as follows -``` +```text Node1 score: ((1+3)/4) * 10= 10 ``` 1. Binpack scoring information for Node 2 is as follows -``` +```text Node2 score: ((1+2)/4) * 10= 7.5 ``` -So, in `Binpack` policy, the selected node is `Node1`. +In `Binpack` policy, `Node1` is selected. #### Spread Spread mainly considers node resource usage. The less it is used, the higher the score. -``` -score: ((request + used) / allocatable) * 10 +```text +score: ((request + used) / allocatable) * 10 ``` 1. Spread scoring information for Node 1 is as follows -``` +```text Node1 score: ((1+3)/4) * 10= 10 ``` 1. Spread scoring information for Node 2 is as follows -``` +```text Node2 score: ((1+2)/4) * 10= 7.5 ``` -So, in `Spread` policy, the selected node is `Node2`. +In `Spread` policy, `Node2` is selected. ### GPU-scheduler-policy @@ -137,42 +137,42 @@ So, in `Spread` policy, the selected node is `Node2`. Binpack mainly focuses on the computing power and video memory usage of each card. The more it is used, the higher the score. -``` +```text score: ((request.core + used.core) / allocatable.core + (request.mem + used.mem) / allocatable.mem)) * 10 ``` 1. Binpack scoring information for GPU 1 is as follows -``` +```text GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75 ``` 1. Binpack scoring information for GPU 2 is as follows -``` +```text GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75 ``` -So, in `Binpack` policy, the selected node is `GPU2`. +In `Binpack` policy, `GPU2` is selected. #### Spread Spread mainly focuses on the computing power and video memory usage of each card. The less it is used, the higher the score. -``` +```text score: ((request.core + used.core) / allocatable.core + (request.mem + used.mem) / allocatable.mem)) * 10 ``` 1. Spread scoring information for GPU 1 is as follows -``` +```text GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75 ``` 1. Spread scoring information for GPU 2 is as follows -``` +```text GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75 ``` -So, in `Spread` policy, the selected node is `GPU1`. +In `Spread` policy, `GPU1` is selected. diff --git a/versioned_docs/version-v2.8.0/faq/faq.md b/versioned_docs/version-v2.8.0/faq/faq.md index f661e687..e4bef061 100644 --- a/versioned_docs/version-v2.8.0/faq/faq.md +++ b/versioned_docs/version-v2.8.0/faq/faq.md @@ -69,7 +69,7 @@ While HAMi's own priority serves a different, device-specific purpose (runtime p **Currently Supported**: - **Volcano**: Can be integrated with Volcano by using the [`volcano-vgpu-device-plugin`](https://github.com/Project-HAMi/volcano-vgpu-device-plugin) under the HAMi project for GPU resource scheduling and management. -- **Koordinator**: HAMi can also be integrated with Koordinator to provide end-to-end GPU sharing solutions. By deploying HAMi-core on nodes and configuring the appropriate labels and resource requests in Pods, Koordinator can use HAMi’s GPU isolation capabilities, allowing multiple Pods to share the same GPU and significantly improve GPU resource utilization. +- **Koordinator**: HAMi can also be integrated with Koordinator to provide end-to-end GPU sharing solutions. By deploying HAMi-core on nodes and configuring the appropriate labels and resource requests in Pods, Koordinator uses HAMi’s GPU isolation capabilities, allowing multiple Pods to share the same GPU and improve GPU resource utilization. For detailed configuration and usage instructions, refer to the Koordinator documentation: [Device Scheduling - GPU Share With HAMi](https://koordinator.sh/docs/user-manuals/device-scheduling-gpu-share-with-hami/) @@ -78,7 +78,7 @@ While HAMi's own priority serves a different, device-specific purpose (runtime p - **KubeVirt & Kata Containers**: Incompatible due to their reliance on virtualization for resource isolation, whereas HAMi’s GPU Device Plugin depends on direct GPU mounting into containers. Supporting these would require adapting the device allocation logic, balancing performance overhead and implementation complexity. HAMi prioritizes high-performance scenarios with direct GPU mounting and thus does not currently support these virtualization solutions. -## Why are there [HAMI-core Warn(...)] logs in my Pod's output? Can I disable them? +## Why are there [HAMi-core Warn(...)] logs in my Pod's output? Can I disable them? This is normal and can be ignored. If needed, disable the logs by setting the environment variable `LIBCUDA_LOG_LEVEL=0` in the container. @@ -150,7 +150,7 @@ Device Plugins can only report a single resource type. GPU memory and compute in - HAMi stores detailed GPU resource information (e.g., compute power, memory, model) as **node annotations** for use by the scheduler. - Example annotation: - ``` + ```yaml hami.io/node-nvidia-register: GPU-fc28df76-54d2-c387-e52e-5f0a9495968c,10,49140,100,NVIDIA-NVIDIA L40S,0,true:GPU-b97db201-0442-8531-56d4-367e0c7d6edd,10,49140,100,... ``` @@ -158,8 +158,8 @@ Device Plugins can only report a single resource type. GPU memory and compute in **Why does the Node Capacity show `volcano.sh/vgpu-number` and `volcano.sh/vgpu-memory` when using `volcano-vgpu-device-plugin`?** -- volcano-vgpu-device-plugin creates **[three independent Device Plugin instances](https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/2bf6dfe37f7b716f05d0d3210f89898087c06d99/pkg/plugin/vgpu/mig-strategy.go#L65-L85)** , each registering with kubelet for volcano.sh/vgpu-number, volcano.sh/vgpu-memory, and volcano.sh/vgpu-cores resources respectively. After kubelet receives the registration, it automatically writes the resources into Capacity and Allocatable. -- **Note** : volcano.sh/vgpu-memory resource is subject to Kubernetes extended resources quantity limit (**maximum 32,767** ). For GPUs with large memory (e.g., A100 80GB), configure the `--gpu-memory-factor` parameter to avoid exceeding the limit. +- volcano-vgpu-device-plugin creates **[three independent Device Plugin instances](https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/2bf6dfe37f7b716f05d0d3210f89898087c06d99/pkg/plugin/vgpu/mig-strategy.go#L65-L85)**, each registering with kubelet for volcano.sh/vgpu-number, volcano.sh/vgpu-memory, and volcano.sh/vgpu-cores resources respectively. After kubelet receives the registration, it automatically writes the resources into Capacity and Allocatable. +- **Note:** volcano.sh/vgpu-memory resource is subject to Kubernetes extended resources quantity limit (**maximum 32,767**). For GPUs with large memory (e.g., A100 80GB), configure the `--gpu-memory-factor` parameter to avoid exceeding the limit. ## Why don’t some domestic vendors require a runtime for installation? diff --git a/versioned_docs/version-v2.8.0/get-started/deploy-with-helm.md b/versioned_docs/version-v2.8.0/get-started/deploy-with-helm.md index 12eacc19..36d6cdc2 100644 --- a/versioned_docs/version-v2.8.0/get-started/deploy-with-helm.md +++ b/versioned_docs/version-v2.8.0/get-started/deploy-with-helm.md @@ -2,52 +2,46 @@ title: Deploy HAMi using Helm --- -This guide will cover: +This guide covers: -- Configure nvidia container runtime in each GPU nodes -- Install HAMi using helm -- Launch a vGPU task -- Check if the corresponding device resources are limited inside container +- Configuring NVIDIA container runtime on each GPU node +- Deploying HAMi using Helm +- Launching a vGPU task +- Verifying container resource limits ## Prerequisites {#prerequisites} -- [Helm](https://helm.sh/zh/docs/) version v3+ -- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) version v1.16+ -- [CUDA](https://developer.nvidia.com/cuda-toolkit) version v10.2+ +- [Helm](https://helm.sh/zh/docs/) v3+ +- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) v1.16+ +- [CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2+ - [NVIDIA Driver](https://www.nvidia.cn/drivers/unix/) v440+ ## Installation {#installation} -### Configure nvidia-container-toolkit {#configure-nvidia-container-toolkit} +### 1. Configure nvidia-container-toolkit {#configure-nvidia-container-toolkit} -<summary> Configure nvidia-container-toolkit </summary> +Perform the following steps on all GPU nodes. -Execute the following steps on all your GPU nodes. +This guide assumes that NVIDIA drivers and the `nvidia-container-toolkit` are already installed, and that `nvidia-container-runtime` is set as the default low-level runtime. -This README assumes pre-installation of NVIDIA drivers and the -`nvidia-container-toolkit`. Additionally, it assumes configuration of the -`nvidia-container-runtime` as the default low-level runtime. +See [nvidia-container-toolkit installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). -Please see: [nvidia-container-toolkit install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) +The following example applies to Debian-based systems using Docker or containerd: -#### Example for debian-based systems with `Docker` and `containerd` {#example-for-debian-based-systems-with-docker-and-containerd} - -##### Install the `nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} +#### Install the `nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} ```bash -distribution=$(. /etc/os-release;echo $ID$VERSION_ID) -curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - -curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ - sudo tee /etc/apt/sources.list.d/libnvidia-container.list +curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ + && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ + sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ + sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit ``` -##### Configure `Docker` {#configure-docker} +#### Configure Docker {#configure-docker} -When running `Kubernetes` with `Docker`, edit the configuration file, -typically located at `/etc/docker/daemon.json`, to set up -`nvidia-container-runtime` as the default low-level runtime: +When running Kubernetes with Docker, edit the configuration file (usually `/etc/docker/daemon.json`) to set `nvidia-container-runtime` as the default runtime: ```json { @@ -61,17 +55,15 @@ typically located at `/etc/docker/daemon.json`, to set up } ``` -And then restart `Docker`: +Restart Docker: ```bash -sudo systemctl daemon-reload && systemctl restart docker +sudo systemctl daemon-reload && sudo systemctl restart docker ``` -##### Configure `containerd` {#configure-containerd} +#### Configure containerd {#configure-containerd} -When running `Kubernetes` with `containerd`, modify the configuration file -typically located at `/etc/containerd/config.toml`, to set up -`nvidia-container-runtime` as the default low-level runtime: +When using Kubernetes with containerd, modify the configuration file (usually `/etc/containerd/config.toml`) to set `nvidia-container-runtime` as the default runtime: ```toml version = 2 @@ -90,53 +82,49 @@ version = 2 BinaryName = "/usr/bin/nvidia-container-runtime" ``` -And then restart `containerd`: +Restart containerd: ```bash -sudo systemctl daemon-reload && systemctl restart containerd +sudo systemctl daemon-reload && sudo systemctl restart containerd ``` -#### 2. Label your nodes {#label-your-nodes} +### 2. Label your nodes {#label-your-nodes} -Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". -Without this label, the nodes cannot be managed by the HAMi scheduler. +Label your GPU nodes for HAMi scheduling with `gpu=on`. Nodes without this label cannot be managed by the scheduler. ```bash kubectl label nodes {nodeid} gpu=on ``` -#### 3. Deploy HAMi using Helm {#deploy-hami-using-helm} +### 3. Deploy HAMi using Helm {#deploy-hami-using-helm} -First, you need to check your Kubernetes version by using the following command: +Check your Kubernetes version: ```bash kubectl version ``` -Then, add the HAMi repo in helm +Add the Helm repository: ```bash helm repo add hami-charts https://project-hami.github.io/HAMi/ ``` -During installation, set the Kubernetes scheduler image version to match your -Kubernetes server version. For instance, if your cluster server version is -1.16.8, use the following command for deployment: +During installation, set the Kubernetes scheduler image to match your cluster version. For example, if your cluster version is 1.29.0: ```bash helm install hami hami-charts/hami \ - --set scheduler.kubeScheduler.imageTag=v1.16.8 \ + --set scheduler.kubeScheduler.imageTag=v1.29.0 \ -n kube-system ``` -If everything goes well, you will see both vgpu-device-plugin and vgpu-scheduler pods are in the Running state +If successful, both `hami-device-plugin` and `hami-scheduler` pods should be in the `Running` state. -### Demo {#demo} +## Demo {#demo} -#### 1. Submit demo task {#submit-demo-task} +### 1. Submit demo task {#submit-demo-task} -Containers can now request NVIDIA vGPUs using the `nvidia.com/gpu` resource -type. +Containers can now request NVIDIA vGPUs using the `nvidia.com/gpu` resource type. ```yaml apiVersion: v1 @@ -146,23 +134,23 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: - nvidia.com/gpu: 1 # requesting 1 vGPUs - nvidia.com/gpumem: 10240 # Each vGPU contains 10240m device memory (Optional,Integer) + nvidia.com/gpu: 1 # Request 1 vGPU + nvidia.com/gpumem: 10240 # Each vGPU provides 10240 MiB device memory (optional) ``` -#### 2. Verify in container resource control {#verify-in-container-resource-control} +### 2. Verify container resource limits {#verify-in-container-resource-control} -Execute the following query command: +Run the following command: ```bash kubectl exec -it gpu-pod nvidia-smi ``` -The result should be: +Expected output: ```text [HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: Initializing..... diff --git a/versioned_docs/version-v2.8.0/get-started/verify-hami.md b/versioned_docs/version-v2.8.0/get-started/verify-hami.md index 98c038cb..00c76cc8 100644 --- a/versioned_docs/version-v2.8.0/get-started/verify-hami.md +++ b/versioned_docs/version-v2.8.0/get-started/verify-hami.md @@ -20,10 +20,11 @@ HAMi requires the `nvidia-container-toolkit` to be installed and set as the defa ### 1. Install nvidia-container-toolkit (Debian/Ubuntu example) ```bash -distribution=$(. /etc/os-release;echo $ID$VERSION_ID) -curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \ - | sudo tee /etc/apt/sources.list.d/libnvidia-container.list -curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - +curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ + && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ + sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ + sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list + sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit ``` @@ -107,11 +108,11 @@ helm install hami hami-charts/hami -n kube-system kubectl get pods -n kube-system | grep hami ``` -Expected: Both `hami-scheduler` and `vgpu-device-plugin` pods should be in the `Running` state. +Expected: Both `hami-scheduler` and `hami-device-plugin` pods should be in the `Running` state. ## Step 3: Launch and Verify a vGPU Task -HAMi enforces fractional resource limits (vGPU): +This step verifies HAMi is enforcing fractional resource limits (vGPU). ### 1. Submit a vGPU demo task @@ -124,7 +125,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/installation/aws-installation.md b/versioned_docs/version-v2.8.0/installation/aws-installation.md index 5714c127..4a55a823 100644 --- a/versioned_docs/version-v2.8.0/installation/aws-installation.md +++ b/versioned_docs/version-v2.8.0/installation/aws-installation.md @@ -30,8 +30,8 @@ You can customize the installation by adjusting the [configuration](../userguide ## Install with AWS Add-on -Before installing HAMi using the AWS add-on, you need to install **cert-manager**. -You can find it in the AWS Marketplace add-ons section and install it through the AWS Console. +Before installing HAMi using the AWS add-on, you need to install **cert-manager**. +You can find it in the AWS Marketplace add-ons section and install it through the AWS Console. You may also refer to the [AWS User Guide](https://docs.aws.amazon.com/eks/latest/userguide/lbc-manifest.html#lbc-cert) for installation instructions. Once cert-manager is installed, you can install the HAMi add-on from the AWS Marketplace. @@ -40,7 +40,7 @@ Once cert-manager is installed, you can install the HAMi add-on from the AWS Mar You can verify your installation with the following command: -``` +```bash kubectl get pods -n kube-system ``` @@ -50,7 +50,7 @@ If both the **hami-device-plugin** and **hami-scheduler** pods are in the `Runni ### NVIDIA Devices -[Use Exclusive GPU](https://project-hami.io/docs/userguide/nvidia-device/examples/use-exclusive-card) -[Allocate Specific Device Memory to a Container](https://project-hami.io/docs/userguide/nvidia-device/examples/allocate-device-memory) -[Allocate Device Core Resources to a Container](https://project-hami.io/docs/userguide/nvidia-device/examples/allocate-device-core) -[Assign Tasks to MIG Instances](https://project-hami.io/docs/userguide/nvidia-device/examples/dynamic-mig-example) +- [Use Exclusive GPU](https://project-hami.io/docs/userguide/nvidia-device/examples/use-exclusive-card) +- [Allocate Specific Device Memory to a Container](https://project-hami.io/docs/userguide/nvidia-device/examples/allocate-device-memory) +- [Allocate Device Core Resources to a Container](https://project-hami.io/docs/userguide/nvidia-device/examples/allocate-device-core) +- [Assign Tasks to MIG Instances](https://project-hami.io/docs/userguide/nvidia-device/examples/dynamic-mig-example) diff --git a/versioned_docs/version-v2.8.0/installation/how-to-use-hami-dra.md b/versioned_docs/version-v2.8.0/installation/how-to-use-hami-dra.md index 89769583..95f64063 100644 --- a/versioned_docs/version-v2.8.0/installation/how-to-use-hami-dra.md +++ b/versioned_docs/version-v2.8.0/installation/how-to-use-hami-dra.md @@ -1,10 +1,9 @@ --- -title: HAMi DRA +linktitle: HAMi DRA +title: HAMi DRA for Kubernetes translated: true --- -# HAMi DRA for Kubernetes - ## Introduction HAMi has provided support for K8s [DRA](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) (Dynamic Resource Allocation). @@ -29,7 +28,7 @@ Then install with the following command: helm install hami hami-charts/hami --set dra.enable=true -n hami-system ``` -> **Note:** *DRA mode is not compatible with traditional mode. Do not enable both at the same time.* +> **NOTICE:** *DRA mode is not compatible with traditional mode. Do not enable both at the same time.* ## Supported Devices @@ -43,6 +42,6 @@ Please refer to the corresponding page to install the device driver. HAMi DRA provides the same monitoring capabilities as the traditional model. When installing HAMi DRA, the monitoring service will be enabled by default. You can expose the monitoring service to the local environment via NodePort or add Prometheus collection to access monitoring metrics. -You can view the monitoring metrics provided by HAMi DRA [here](../userguide/monitoring/device-allocation). +You can view the monitoring metrics provided by HAMi DRA on the [Device Allocation Monitoring page](../userguide/monitoring/device-allocation). For more information, please refer to [HAMi DRA monitor](https://github.com/Project-HAMi/HAMi-DRA/blob/main/docs/MONITOR.md) diff --git a/versioned_docs/version-v2.8.0/installation/how-to-use-volcano-ascend.md b/versioned_docs/version-v2.8.0/installation/how-to-use-volcano-ascend.md index 4f19b126..0e643d87 100644 --- a/versioned_docs/version-v2.8.0/installation/how-to-use-volcano-ascend.md +++ b/versioned_docs/version-v2.8.0/installation/how-to-use-volcano-ascend.md @@ -1,17 +1,16 @@ --- -title: Volcano Ascend vNPU +linktitle: Volcano Ascend vNPU +title: User Guide for Ascend Devices in Volcano --- -# User Guide for Ascend Devices in Volcano - ## Introduction - Volcano supports vNPU feature for both Ascend 310 and Ascend 910 using the `ascend-device-plugin`. It also supports managing heterogeneous Ascend cluster(Cluster with multiple Ascend types, i.e. 910A,910B2,910B3,310p) + Volcano supports vNPU feature for both Ascend 310 and Ascend 910 using the `ascend-device-plugin`. It also supports managing heterogeneous Ascend cluster (Cluster with multiple Ascend types, i.e., 910A, 910B2, 910B3, 310P) **Use case**: -- NPU and vNPU cluster for Ascend 910 series -- NPU and vNPU cluster for Ascend 310 series +- NPU and vNPU cluster for Ascend 910 series +- NPU and vNPU cluster for Ascend 310 series - Heterogeneous Ascend cluster This feature is only available in volcano >= 1.14. @@ -29,7 +28,7 @@ helm repo add volcano-sh https://volcano-sh.github.io/helm-charts helm install volcano volcano-sh/volcano -n volcano-system --create-namespace ``` -Additional installation methods can be found [here](https://github.com/volcano-sh/volcano?tab=readme-ov-file#quick-start-guide). +Additional installation methods can be found in [the Volcano Quick Start Guide](https://github.com/volcano-sh/volcano?tab=readme-ov-file#quick-start-guide). ### Label the Node with ascend=on @@ -109,7 +108,7 @@ spec: The supported Ascend chips and their `ResourceNames` are shown in the following table: | ChipName | ResourceName | ResourceMemoryName | -|-------|-------|-------| +| ------- | ------- | ------- | | 910A | huawei.com/Ascend910A | huawei.com/Ascend910A-memory | | 910B2 | huawei.com/Ascend910B2 | huawei.com/Ascend910B2-memory | | 910B3 | huawei.com/Ascend910B3 | huawei.com/Ascend910B3-memory | diff --git a/versioned_docs/version-v2.8.0/installation/how-to-use-volcano-vgpu.md b/versioned_docs/version-v2.8.0/installation/how-to-use-volcano-vgpu.md index a221b16f..b318ed92 100644 --- a/versioned_docs/version-v2.8.0/installation/how-to-use-volcano-vgpu.md +++ b/versioned_docs/version-v2.8.0/installation/how-to-use-volcano-vgpu.md @@ -5,7 +5,7 @@ linktitle: Use Volcano vGPU :::note -You *DON'T* need to install HAMi when using volcano-vgpu, only use +You *DON'T* need to install HAMi when using volcano-vgpu, only use [Volcano vGPU device-plugin](https://github.com/Project-HAMi/volcano-vgpu-device-plugin) is good enough. It can provide device-sharing mechanism for NVIDIA devices managed by Volcano. This is based on [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin), it uses [HAMi-core](https://github.com/Project-HAMi/HAMi-core) to support hard isolation of GPU cards. @@ -95,7 +95,7 @@ status: ### Running vGPU Jobs -vGPU can be requested by both set "volcano.sh/vgpu-number", "volcano.sh/vgpu-cores" and "volcano.sh/vgpu-memory" in resources.limits. +vGPU can be requested by setting `volcano.sh/vgpu-number`, `volcano.sh/vgpu-cores` and `volcano.sh/vgpu-memory` in `resources.limits`. ```shell cat <<EOF | kubectl apply -f - diff --git a/versioned_docs/version-v2.8.0/installation/offline-installation.md b/versioned_docs/version-v2.8.0/installation/offline-installation.md index 9c70a3d8..6f02bbcb 100644 --- a/versioned_docs/version-v2.8.0/installation/offline-installation.md +++ b/versioned_docs/version-v2.8.0/installation/offline-installation.md @@ -9,7 +9,7 @@ If your cluster can’t directly access the internet, you can install HAMi offli You need to save the following images into a tarball file and copy it into the cluster. ```yaml -projecthami/hami:{HAMi version} +projecthami/hami:{HAMi version} docker.io/jettech/kube-webhook-certgen:v1.5.2 liangjw/kube-webhook-certgen:v1.1.1 registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:{your kubernetes version} @@ -61,4 +61,4 @@ Run the following command: kubectl get pods -n kube-system ``` -If you can see both the 'device-plugin' and 'scheduler' running, then HAMi is installed successfully, +If you can see both the 'device-plugin' and 'scheduler' running, then HAMi is installed successfully. diff --git a/versioned_docs/version-v2.8.0/installation/online-installation.md b/versioned_docs/version-v2.8.0/installation/online-installation.md index f6129ed0..9f9563a7 100644 --- a/versioned_docs/version-v2.8.0/installation/online-installation.md +++ b/versioned_docs/version-v2.8.0/installation/online-installation.md @@ -24,10 +24,10 @@ kubectl version --short ## Installation Ensure the `scheduler.kubeScheduler.imageTag` matches your Kubernetes server version. -For instance, if your cluster server is v1.16.8, use the following command to deploy: +For instance, if your cluster server is v1.29.0, use the following command to deploy: ```bash -helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system +helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.29.0 -n kube-system ``` Customize your installation by editing the [configurations](../userguide/configure.md). diff --git a/versioned_docs/version-v2.8.0/installation/prerequisites.md b/versioned_docs/version-v2.8.0/installation/prerequisites.md index 9f1d4507..bce55c7f 100644 --- a/versioned_docs/version-v2.8.0/installation/prerequisites.md +++ b/versioned_docs/version-v2.8.0/installation/prerequisites.md @@ -25,9 +25,10 @@ For details see [Installing the NVIDIA Container Toolkit](https://docs.nvidia.co #### Install the `nvidia-container-toolkit` ```bash -distribution=$(. /etc/os-release;echo $ID$VERSION_ID) -curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - -curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list +curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ + && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ + sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ + sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit ``` @@ -57,7 +58,7 @@ sudo nvidia-ctk runtime configure --runtime=containerd And then restart containerd: ```bash -sudo systemctl daemon-reload && systemctl restart containerd +sudo systemctl daemon-reload && sudo systemctl restart containerd ``` ### Label your nodes diff --git a/versioned_docs/version-v2.8.0/installation/uninstall.md b/versioned_docs/version-v2.8.0/installation/uninstall.md index 64bd069b..d322f0d0 100644 --- a/versioned_docs/version-v2.8.0/installation/uninstall.md +++ b/versioned_docs/version-v2.8.0/installation/uninstall.md @@ -2,14 +2,134 @@ title: Uninstall --- -The step to uninstall HAMi is simple: +## Prerequisites + +Before uninstalling HAMi, make sure you have: + +- Access to your Kubernetes cluster with `kubectl` +- Helm 3.x installed +- Administrative privileges to uninstall resources from `kube-system` namespace + +## Uninstalling HAMi + +### Basic Uninstallation + +The simplest way to uninstall HAMi is using Helm: ```bash helm uninstall hami -n kube-system ``` +This command will remove all HAMi resources, including: +- Scheduler pods +- Device plugin DaemonSets +- ConfigMaps and Secrets +- RBAC roles and bindings + :::note -Uninstallation won't kill running tasks. +**Important:** Uninstalling HAMi won't automatically stop or kill running GPU tasks. Container processes will continue to use GPU resources even after HAMi components are removed. ::: + +### Complete Cleanup (Optional) + +If you want to perform a complete cleanup and remove all HAMi-related resources, follow these steps: + +#### 1. Stop or reschedule running tasks + +Before uninstalling, consider gracefully stopping your GPU workloads: + +```bash +kubectl delete pods -l gpu-workload=true --all-namespaces --grace-period=30 +``` + +Or reschedule them to nodes without GPU requirements. + +#### 2. Verify no HAMi pods are running + +After uninstallation, verify that HAMi components are removed: + +```bash +kubectl get pods -n kube-system | grep -i hami +``` + +Should return no results. + +#### 3. Clean up HAMi ConfigMaps (if custom configuration was used) + +```bash +kubectl delete configmap hami-scheduler-device -n kube-system --ignore-not-found +``` + +#### 4. Remove any PersistentVolumes created by HAMi (if applicable) + +```bash +kubectl get pv | grep hami +kubectl delete pv <pv-name> # if any HAMi-related PVs exist +``` + +## Verification + +To verify that HAMi has been completely removed from your cluster: + +```bash +# Check if HAMi helm release is gone +helm list -n kube-system | grep hami + +# Verify no HAMi pods exist +kubectl get pods --all-namespaces | grep -i hami + +# Check for remaining HAMi resources +kubectl get all -n kube-system -o wide | grep -i hami +``` + +All commands should return no results if uninstallation was successful. + +## Reinstalling HAMi + +To reinstall HAMi, follow the [installation guide](./online-installation.md). + +## Troubleshooting + +### HAMi pods stuck in terminating state + +If HAMi pods are stuck in the "Terminating" state, you can force delete them: + +```bash +kubectl delete pods -n kube-system -l app=hami --grace-period=0 --force +``` + +Then try the uninstall command again. + +### Helm release not found error + +If you get an error that the helm release is not found, HAMi is already uninstalled: + +```bash +Error: release named "hami" not found +``` + +You can verify this by checking the pods as shown in the verification section above. + +### GPU resources still in use after uninstallation + +If GPU resources are still allocated to pods after uninstalling HAMi, it means those pods are still running. You'll need to: + +1. Stop the pods that are using GPUs +2. Check node status to see GPU availability +3. Wait for pods to be rescheduled by Kubernetes + +```bash +# Check which nodes have GPU resources +kubectl describe nodes | grep -A 5 "Allocated resources" + +# Restart a node if necessary (use with caution) +kubectl drain <node-name> --ignore-daemonsets +kubectl uncordon <node-name> +``` + +## See Also + +- [Installation Guide](./online-installation.md) +- [HAMi Documentation](../core-concepts/introduction.md) diff --git a/versioned_docs/version-v2.8.0/installation/upgrade.md b/versioned_docs/version-v2.8.0/installation/upgrade.md index 389bb458..038d5242 100644 --- a/versioned_docs/version-v2.8.0/installation/upgrade.md +++ b/versioned_docs/version-v2.8.0/installation/upgrade.md @@ -2,12 +2,305 @@ title: Upgrade --- -Upgrading HAMi to the latest version is a simple process, update the repository and restart the chart: +## Overview + +Upgrading HAMi to a new version should be done carefully to avoid disrupting GPU workloads. This guide covers the upgrade process, compatibility considerations, and best practices. + +## Before You Upgrade + +### 1. Check Compatibility + +Verify that your target HAMi version is compatible with your current Kubernetes version and NVIDIA driver: + +```bash +# Current HAMi version +helm list -n kube-system | grep hami + +# Kubernetes version +kubectl version --short + +# NVIDIA driver version (on GPU nodes) +nvidia-smi | grep "Driver Version" +``` + +### 2. Backup Current Configuration + +Save your current HAMi configuration in case you need to rollback: + +```bash +# Backup current values +helm get values hami -n kube-system > hami-backup-values.yaml + +# Backup ConfigMaps +kubectl get configmap hami-scheduler-device -n kube-system -o yaml > hami-configmap-backup.yaml + +# Check current state +kubectl get all -n kube-system -l app=hami -o yaml > hami-state-backup.yaml +``` + +### 3. Clear Running Workloads + +**CRITICAL:** Before upgrading, stop or reschedule all GPU workloads. Upgrading with running tasks can cause segmentation faults and unpredictable behavior. + +**Gracefully drain GPU workloads:** + +```bash +# Find pods using GPU +kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.spec.containers[]?.resources.limits | select(. != null) | select(has("nvidia.com/gpu") or has("enflame.com/vgcu"))) | "\(.metadata.namespace) \(.metadata.name)"' + +# Delete or reschedule these pods +kubectl delete pods <pod-name> -n <namespace> --grace-period=30 +``` + +Or reschedule to non-GPU nodes if available: ```bash +# Add node selector to force scheduling away from GPU nodes +kubectl patch deployment <deployment-name> -n <namespace> -p '{"spec":{"template":{"spec":{"nodeSelector":{"gpu":"false"}}}}}' +``` + +### 4. Verify HAMi Components Are Running + +Before proceeding, ensure all HAMi components are healthy: + +```bash +# Check pod status +kubectl get pods -n kube-system -l app=hami + +# Check for errors +kubectl logs -n kube-system -l app=hami-scheduler --tail=50 +kubectl logs -n kube-system -l app=hami-device-plugin --tail=50 +``` + +## Upgrade Process + +### Standard Upgrade (Recommended) + +For most cases, use the standard upgrade process: + +```bash +# Update Helm repository +helm repo update hami-charts + +# Check available versions +helm search repo hami-charts/hami --versions + +# Get current values (preserve custom configuration) +helm get values hami -n kube-system > current-values.yaml + +# Perform upgrade +helm upgrade hami hami-charts/hami -n kube-system -f current-values.yaml +``` + +### In-Place Upgrade (If Using Existing Installation) + +If you don't have a custom values file, you can upgrade directly: + +```bash +helm repo update hami-charts +helm upgrade hami hami-charts/hami -n kube-system +``` + +### Uninstall and Reinstall (For Major Version Changes) + +For major version upgrades with breaking changes, uninstall first: + +```bash +# Uninstall current version helm uninstall hami -n kube-system + +# Update repository helm repo update + +# Reinstall with new version helm install hami hami-charts/hami -n kube-system ``` -> **WARNING:** *If you upgrade HAMi without clearing your submitted tasks, it may result in segmentation fault.* +## Post-Upgrade Verification + +After the upgrade completes, verify that HAMi is functioning correctly: + +### 1. Check Pod Status + +```bash +kubectl get pods -n kube-system -l app=hami +``` + +All pods should be in `Running` state. + +### 2. Verify Component Health + +```bash +# Check scheduler logs for errors +kubectl logs -n kube-system -l app=hami-scheduler | grep -i "error\|warning" | head -20 + +# Check device plugin logs +kubectl logs -n kube-system -l app=hami-device-plugin | grep -i "error" | head -20 +``` + +### 3. Test GPU Allocation + +Deploy a test pod to verify GPU resources are properly allocated: + +```bash +kubectl apply -f - <<EOF +apiVersion: v1 +kind: Pod +metadata: + name: gpu-test-pod + namespace: default +spec: + containers: + - name: test + image: nvidia/cuda:12.2.0-runtime-ubuntu22.04 + command: ["nvidia-smi"] + resources: + limits: + nvidia.com/gpu: 1 + restartPolicy: Never +EOF + +# Check if pod runs successfully +kubectl logs gpu-test-pod + +# Clean up +kubectl delete pod gpu-test-pod +``` + +### 4. Verify Node GPU Status + +```bash +# Check GPU allocatable resources on each node +kubectl describe nodes | grep -A 5 "Allocated resources" + +# Check HAMi annotations on nodes +kubectl get nodes -o yaml | grep -A 10 "hami.io" +``` + +## Troubleshooting + +### Pods Stuck in Pending State + +If pods remain in pending state after upgrade: + +```bash +# Check pod events +kubectl describe pod <pod-name> + +# Check scheduler logs +kubectl logs -n kube-system -l app=hami-scheduler | grep -i "pending\|error" + +# Verify GPU availability +kubectl describe nodes | grep -i "gpu" +``` + +**Solution:** Restart the HAMi device plugin: + +```bash +kubectl rollout restart daemonset/hami-device-plugin -n kube-system +``` + +### GPU Not Recognized After Upgrade + +If GPUs are not being detected: + +```bash +# Verify NVIDIA driver is still loaded on nodes +kubectl debug node/<node-name> -it --image=ubuntu + +# Inside the debug container +lspci | grep -i gpu +nvidia-smi +exit + +# Restart device plugin on affected node +kubectl delete pods -n kube-system -l app=hami-device-plugin --field-selector spec.nodeName=<node-name> +``` + +### Segmentation Fault During Upgrade + +If you see segmentation faults: + +1. **Root Cause:** Running workloads during upgrade (as warned above) +2. **Immediate Action:** Restart affected pods: + ```bash + kubectl delete pods <affected-pod-name> -n <namespace> + ``` +3. **Prevention:** Always clear workloads before upgrading + +### Helm Chart Configuration Changed + +If your custom values are no longer compatible: + +```bash +# Compare old and new values +helm show values hami-charts/hami > new-defaults.yaml +diff current-values.yaml new-defaults.yaml + +# Update your values file with deprecated keys removed +# Then retry the upgrade +helm upgrade hami hami-charts/hami -n kube-system -f current-values.yaml +``` + +## Rollback Procedures + +If something goes wrong during upgrade, you can rollback to the previous version: + +### Rollback Using Helm + +```bash +# View revision history +helm history hami -n kube-system + +# Rollback to previous release +helm rollback hami -n kube-system + +# Or rollback to specific revision +helm rollback hami <revision-number> -n kube-system +``` + +### Manual Rollback + +If helm rollback doesn't work: + +```bash +# Get previous HAMi version from backup +helm install hami hami-charts/hami -n kube-system --version <previous-version> -f hami-backup-values.yaml + +# Or restore from kubectl backup +kubectl apply -f hami-state-backup.yaml +``` + +## Version Compatibility Matrix + +| HAMi Version | Min Kubernetes | Max Kubernetes | NVIDIA Driver | Notes | +|---|---|---|---|---| +| v2.8.x | 1.23 | 1.28 | ≥450.x | Latest stable | +| v2.7.x | 1.21 | 1.27 | ≥450.x | | +| v2.6.x | 1.20 | 1.26 | ≥450.x | | + +For earlier versions, refer to the [releases page](https://github.com/Project-HAMi/HAMi/releases). + +## Best Practices + +1. **Test in Staging First** - Always test upgrades in a non-production environment first +2. **Maintain Backups** - Keep ConfigMap and state backups before upgrading +3. **Schedule Maintenance Windows** - Upgrade during low-usage periods +4. **Monitor After Upgrade** - Watch logs and metrics for 30 minutes post-upgrade +5. **Document Changes** - Keep notes of what was upgraded and when +6. **Have Rollback Plan** - Always know how to quickly rollback if needed + +## Getting Help + +If you encounter issues during upgrade: + +1. Check the [troubleshooting guide](../troubleshooting/troubleshooting.md) +2. Review HAMi scheduler and device plugin logs +3. Check [GitHub issues](https://github.com/Project-HAMi/HAMi/issues) +4. Ask in the [community discussions](https://github.com/Project-HAMi/HAMi/discussions) + +## See Also + +- [Installation Guide](./online-installation.md) +- [Uninstallation Guide](./uninstall.md) +- [HAMi Documentation](../core-concepts/introduction.md) diff --git a/versioned_docs/version-v2.8.0/installation/webui-installation.md b/versioned_docs/version-v2.8.0/installation/webui-installation.md index 1ba18399..47ca9218 100644 --- a/versioned_docs/version-v2.8.0/installation/webui-installation.md +++ b/versioned_docs/version-v2.8.0/installation/webui-installation.md @@ -81,8 +81,8 @@ When troubleshooting, check the HAMi WebUI component logs. Run: ```bash -kubectl logs --namespace=hami deploy/my-hami-webui -c hami-webui-fe-oss -kubectl logs --namespace=hami deploy/my-hami-webui -c hami-webui-be-oss +kubectl logs --namespace=kube-system deploy/my-hami-webui -c hami-webui-fe-oss +kubectl logs --namespace=kube-system deploy/my-hami-webui -c hami-webui-be-oss ``` For more information, see [Pods](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#interacting-with-running-pods) and [Deployments](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#interacting-with-deployments-and-services). @@ -91,18 +91,8 @@ For more information, see [Pods](https://kubernetes.io/docs/reference/kubectl/ch To remove the Helm release, use: -`helm uninstall <RELEASE-NAME> <NAMESPACE-NAME>` - -```bash -helm uninstall my-hami-webui -n hami -``` - -This removes the resources associated with that release in the `hami` namespace. - -To delete the `hami` namespace (if you no longer need it): - ```bash -kubectl delete namespace hami +helm uninstall my-hami-webui -n kube-system ``` ## Related documentation diff --git a/versioned_docs/version-v2.8.0/key-features/device-sharing.md b/versioned_docs/version-v2.8.0/key-features/device-sharing.md index 19797b9e..f9c48a0e 100644 --- a/versioned_docs/version-v2.8.0/key-features/device-sharing.md +++ b/versioned_docs/version-v2.8.0/key-features/device-sharing.md @@ -2,7 +2,7 @@ title: Device sharing --- -HAMi provides device-sharing capabilities, enabling multiple tasks to share the same GPU, MLU, or NPU device, +HAMi lets multiple tasks share the same GPU, MLU, or NPU device, maximizing the utilization of heterogeneous AI computing resources. ## Device Sharing {#device-sharing} @@ -20,5 +20,4 @@ HAMi's device sharing enables: ## Benefits {#benefits} -By leveraging these features, HAMi enhances resource efficiency and security in shared-device environments. -Organizations can optimize their AI infrastructure for greater flexibility and performance while meeting diverse computational demands. +These features improve resource efficiency and isolation in shared-device environments across diverse device types and workloads. diff --git a/versioned_docs/version-v2.8.0/releases.md b/versioned_docs/version-v2.8.0/releases.md index 43b122aa..e92b3ed4 100644 --- a/versioned_docs/version-v2.8.0/releases.md +++ b/versioned_docs/version-v2.8.0/releases.md @@ -4,7 +4,7 @@ title: Releases ## Release Notes and Assets -Release notes are available on GitHub at https://github.com/Project-HAMi/HAMi/releases +Release notes are available on GitHub at [https://github.com/Project-HAMi/HAMi/releases](https://github.com/Project-HAMi/HAMi/releases) ## Release Management @@ -37,7 +37,7 @@ Typically only critical fixes are selected for patch releases. Usually there wil ### Versioning -HAMi uses GitHub tags to manage versions. New releases and release candidates are published using the wildcard tag`v<major>.<minor>.<patch>`. +HAMi uses GitHub tags to manage versions. New releases and release candidates are published using the wildcard tag `v<major>.<minor>.<patch>`. Whenever a PR is merged into the master branch, CI will pull the latest code, generate an image and upload it to the mirror repository. The latest image of HAMi components can usually be downloaded online using the latest tag. @@ -60,7 +60,7 @@ Release branches and PRs are managed as follows: * For critical fixes that need to be included in a patch release, PRs should always be first merged to master and then cherry-picked to the release branch. PRs need to be guaranteed to have a release note written and these descriptions will be reflected in the next patch release. - The cherry-pick process of PRs is executed through the script. See usage [here](https://project-hami.io/docs/contributor/cherry-picks). + The cherry-pick process of PRs is executed through the script. See [cherry-pick usage](https://project-hami.io/docs/contributor/cherry-picks). * For complex changes, specially critical bugfixes, separate PRs may be required for master and release branches. * The milestone mark (for example v1.4) will be added to PRs which means changes in PRs are one of the contents of the corresponding release. * During PR review, the Assignee selection is used to indicate the reviewer. @@ -70,46 +70,9 @@ Release branches and PRs are managed as follows: A minor release will contain a mix of features, enhancements, and bug fixes. Major features follow the HAMi Design Proposal process. You can refer to -[here](https://github.com/Project-HAMi/HAMi/tree/master/docs/proposals/resource-interpreter-webhook) as a proposal example. +[this proposal example](https://github.com/Project-HAMi/HAMi/tree/master/docs/proposals/resource-interpreter-webhook). During the start of a release, there may be many issues assigned to the release milestone. The priorities for the release are discussed in the bi-weekly community meetings. As the release progresses several issues may be moved to the next milestone. -Hence, if an issue is important it is important to advocate its priority early in the release cycle. - -<!-- ### Release Artifacts - -The HAMi container images are available at `dockerHub`. -You can visit `https://hub.docker.com/r/karmada/<component_name>` to see the details of images. -For example, [here](https://hub.docker.com/r/karmada/karmada-controller-manager) for karmada-controller-manager. - -Since v1.2.0, the following artifacts are uploaded: - -* crds.tar.gz -* karmada-chart-v\<version_number\>.tgz -* karmadactl-darwin-amd64.tgz -* karmadactl-darwin-amd64.tgz.sha256 -* karmadactl-darwin-arm64.tgz -* karmadactl-darwin-arm64.tgz.sha256 -* karmadactl-linux-amd64.tgz -* karmadactl-linux-amd64.tgz.sha256 -* karmadactl-linux-arm64.tgz -* karmadactl-linux-arm64.tgz.sha256 -* kubectl-karmada-darwin-amd64.tgz -* kubectl-karmada-darwin-amd64.tgz.sha256 -* kubectl-karmada-darwin-arm64.tgz -* kubectl-karmada-darwin-arm64.tgz.sha256 -* kubectl-karmada-linux-amd64.tgz -* kubectl-karmada-linux-amd64.tgz.sha256 -* kubectl-karmada-linux-arm64.tgz -* kubectl-karmada-linux-arm64.tgz.sha256 -* Source code(zip) -* Source code(tar.gz) - -You can visit `https://github.com/Project-HAMi/HAMi/releases/download/v<version_number>/<artifact_name>` to download the artifacts above. - -For example: - -```shell -wget https://github.com/Project-HAMi/HAMi/releases/download/v1.3.0/karmadactl-darwin-amd64.tgz -``` --> \ No newline at end of file +If an issue is a priority, advocate for it early in the release cycle. diff --git a/versioned_docs/version-v2.8.0/troubleshooting/troubleshooting.md b/versioned_docs/version-v2.8.0/troubleshooting/troubleshooting.md index 29f6aa78..ffa4ff59 100644 --- a/versioned_docs/version-v2.8.0/troubleshooting/troubleshooting.md +++ b/versioned_docs/version-v2.8.0/troubleshooting/troubleshooting.md @@ -6,16 +6,16 @@ title: Troubleshooting - Currently, A100 MIG can be supported in only "none" and "mixed" modes. - Tasks with the "nodeName" field cannot be scheduled at the moment; please use "nodeSelector" instead. - Only computing tasks are currently supported; video codec processing is not supported. -- Since v2.3.10, HAMi has changed the `device-plugin` environment variable name from `NodeName` to `NODE_NAME`. +- Since v2.3.10, HAMi has changed the `device-plugin` environment variable name from `NodeName` to `NODE_NAME`. If you're using an image version earlier than v2.3.10, the `device-plugin` may fail to start. To resolve this issue, you have two options: - Manually edit the DaemonSet using `kubectl edit daemonset` and update the environment variable from `NodeName` to `NODE_NAME`. - - Upgrade the `device-plugin` image to the latest version using Helm: + - Upgrade the `device-plugin` image to the latest version using Helm: ```bash helm upgrade hami hami/hami -n kube-system - ``` + ``` This will apply the fix automatically. diff --git a/versioned_docs/version-v2.8.0/userguide/ascend-device/device-template.md b/versioned_docs/version-v2.8.0/userguide/ascend-device/device-template.md index 9fc849bf..14022d30 100644 --- a/versioned_docs/version-v2.8.0/userguide/ascend-device/device-template.md +++ b/versioned_docs/version-v2.8.0/userguide/ascend-device/device-template.md @@ -14,6 +14,7 @@ vnpus: resourceMemoryName: huawei.com/Ascend910A-memory memoryAllocatable: 32768 memoryCapacity: 32768 + memoryFactor: 1 aiCore: 30 templates: - name: vir02 @@ -34,6 +35,7 @@ vnpus: resourceMemoryName: huawei.com/Ascend910B3-memory memoryAllocatable: 65536 memoryCapacity: 65536 + memoryFactor: 1 aiCore: 20 aiCPU: 7 templates: @@ -51,6 +53,7 @@ vnpus: resourceMemoryName: huawei.com/Ascend310P-memory memoryAllocatable: 21527 memoryCapacity: 24576 + memoryFactor: 1 aiCore: 8 aiCPU: 7 templates: diff --git a/versioned_docs/version-v2.8.0/userguide/ascend-device/enable-ascend-sharing.md b/versioned_docs/version-v2.8.0/userguide/ascend-device/enable-ascend-sharing.md index 392b1183..2975435c 100644 --- a/versioned_docs/version-v2.8.0/userguide/ascend-device/enable-ascend-sharing.md +++ b/versioned_docs/version-v2.8.0/userguide/ascend-device/enable-ascend-sharing.md @@ -2,64 +2,124 @@ title: Enable Ascend sharing --- -Memory slicing is supported based on virtualization template, lease available template is automatically used. For detailed information, check [device-template](./device-template.md). +The Ascend device plugin supports NPU-slicing for HAMi. It supports two modes: + +### 1. Template-based Hard Slicing (vNPU) + +Memory slicing is supported based on virtualization template, the least available template is automatically used. For detailed information, check [device-template](./device-template.md). + +### 2. Soft Slicing with Runtime Interception (hami-vnpu-core) + +This mode implements a soft slicing mechanism based on `libvnpu.so` interception and `limiter` token scheduling, enabling fine-grained resource sharing. + +:::note +- `hami-vnpu-core` currently only supports ARM platforms. +- `hami-vnpu-core` currently only supports HAMi scheduler. +::: ## Prerequisites -* Ascend device type: 910B, 910A, 310P -* driver version >= 24.1.rc1 -* Ascend docker runtime +- Ascend device type: 910B, 910A, 310P +- [Ascend docker runtime](https://gitcode.com/Ascend/mind-cluster/tree/master/component/ascend-docker-runtime) + +**Additional requirements for Soft Slicing (hami-vnpu-core):** + +- **Ascend Driver Version**: ≥ 25.5 +- **Chip Mode**: enable `device-share` mode on Ascend chips for virtualization + +To enable `device-share` mode, run: + +```bash +npu-smi set -t device-share -i <id> -d <value> +``` + +| Parameter | Description | +| --------- | ----------- | +| `id` | Device ID. The NPU ID found by running `npu-smi info -l`. | +| `value` | Container enable status: `0` (Disabled, default) or `1` (Enabled). | ## Enabling Ascend-sharing support -* Due to dependencies with HAMi, you need to set the arguments in the process of installing HAMi: +Due to dependencies with HAMi, you need to set the following arguments when installing HAMi: - ``` - devices.ascend.enabled=true - ``` +```yaml +devices.ascend.enabled=true +``` - For more details, see 'devices' section in values.yaml. +For more details, see the `devices` section in `values.yaml`: - ```yaml - devices: - ascend: - enabled: true - image: "ascend-device-plugin:master" - imagePullPolicy: IfNotPresent - extraArgs: [] - nodeSelector: - ascend: "on" - tolerations: [] - resources: - - huawei.com/Ascend910A - - huawei.com/Ascend910A-memory - - huawei.com/Ascend910B - - huawei.com/Ascend910B-memory - - huawei.com/Ascend310P - - huawei.com/Ascend310P-memory - ``` +```yaml +devices: + ascend: + enabled: true + image: "ascend-device-plugin:master" + imagePullPolicy: IfNotPresent + extraArgs: [] + nodeSelector: + ascend: "on" + tolerations: [] + resources: + - huawei.com/Ascend910A + - huawei.com/Ascend910A-memory + - huawei.com/Ascend910B + - huawei.com/Ascend910B-memory + - huawei.com/Ascend310P + - huawei.com/Ascend310P-memory +``` + +If you require HAMi to automatically add the `runtimeClassName` configuration to Pods requesting Ascend resources (this is disabled by default), set `devices.ascend.runtimeClassName` to a non-empty string in HAMi's `values.yaml`, ensuring it matches the name of the `RuntimeClass` resource: + +```yaml +devices: + ascend: + runtimeClassName: ascend +``` + +## Deployment + +### 1. Label the Node + +```bash +kubectl label node {ascend-node} ascend=on +``` + +### 2. Deploy RuntimeClass + +```bash +kubectl apply -f https://raw.githubusercontent.com/Project-HAMi/ascend-device-plugin/main/ascend-runtimeclass.yaml +``` + +### 3. Deploy ConfigMap + +This ConfigMap is used for global configurations such as resourceName, mode, and templates. By setting `hamiVnpuCore: true` at the top level, all nodes will enable soft-partitioning based on `hami-vnpu-core`. + +```bash +kubectl apply -f https://raw.githubusercontent.com/Project-HAMi/ascend-device-plugin/main/ascend-device-configmap.yaml +``` + +:::note +You can skip this step if the ConfigMap already exists. +::: -* Tag Ascend node with the following command +#### (Optional) Node Custom Configuration - ```bash - kubectl label node {ascend-node} ascend=on - ``` +The `hami-device-node-config` ConfigMap allows you to enable or override `hami-vnpu-core` for specific nodes. Node-level settings take higher priority than the global `hamiVnpuCore` switch. -* Install [Ascend docker runtime](https://gitee.com/ascend/ascend-docker-runtime) +```bash +kubectl apply -f https://raw.githubusercontent.com/Project-HAMi/ascend-device-plugin/main/ascend-device-node-configmap.yaml +``` -* [Download YAML for Ascend-vgpu-device-plugin](https://github.com/Project-HAMi/ascend-device-plugin/blob/master/build/ascendplugin-hami.yaml) from HAMi Project, and run the following commands to deploy +### 4. Deploy ascend-device-plugin - ```bash - wge https://raw.githubusercontent.com/Project-HAMi/ascend-device-plugin/refs/heads/main/ascend-device-plugin.yaml - kubectl apply -f ascend-device-plugin.yaml - ``` +```bash +kubectl apply -f https://raw.githubusercontent.com/Project-HAMi/ascend-device-plugin/main/ascend-device-plugin.yaml +``` -## Running Ascend jobs +## Running Ascend Jobs -### Ascend 910B +To exclusively use an entire card or request multiple cards, you only need to set the corresponding resourceName. If multiple tasks need to share the same NPU, set the resource request to `1` and configure the appropriate `ResourceMemoryName`. -Ascend 910Bs can now be requested by a container -using the `huawei.com/ascend910B` and `huawei.com/ascend910B-memory` resource type: +### Ascend 910B (Hard Slicing) ```yaml apiVersion: v1 @@ -73,14 +133,12 @@ spec: command: ["bash", "-c", "sleep 86400"] resources: limits: - huawei.com/Ascend910B: 1 # requesting 1 Ascend - huawei.com/Ascend910B-memory: 2000 # requesting 2000m device memory + huawei.com/Ascend910B: "1" + # if you don't specify Ascend910B-memory, it will use a whole NPU. + huawei.com/Ascend910B-memory: "4096" ``` -### Ascend 310P - -Ascend 310Ps can now be requested by a container -using the `huawei.com/ascend310P` and `huawei.com/ascend310P-memory` resource type: +### Ascend 310P (Hard Slicing) ```yaml apiVersion: v1 @@ -94,15 +152,77 @@ spec: command: ["bash", "-c", "sleep 86400"] resources: limits: - huawei.com/Ascend310P: 1 # requesting 1 Ascend - huawei.com/Ascend310P-memory: 1024 # requesting 1024m device memory + huawei.com/Ascend310P: "1" + huawei.com/Ascend310P-memory: "1024" +``` + +### Soft Slicing (hami-vnpu-core) + +Add the annotation `huawei.com/vnpu-mode: 'hami-core'` to enable soft slicing for a Pod. You can also request a percentage of compute cores using the `-core` resource: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: ascend-soft-slice-pod + annotations: + huawei.com/vnpu-mode: 'hami-core' +spec: + containers: + - name: npu_pod + image: ascendhub.huawei.com/public-ascendhub/ascend-mindspore:23.0.RC3-centos7 + command: ["bash", "-c", "sleep 86400"] + resources: + limits: + huawei.com/Ascend910B3: "1" + huawei.com/Ascend910B3-memory: "28672" + huawei.com/Ascend910B3-core: "40" # Request 40% of compute cores +``` + +### Multi-card Parallel Inference (Soft Slicing) + +The soft partitioning mechanism supports requesting multiple virtual devices within the same Pod. When performing multi-card parallel inference (e.g., using vLLM), the value of `--gpu-memory-utilization` must not exceed the ratio of the container's total memory limit to the sum of physical memory of the selected cards. + +**Example: 2-Card Tensor Parallelism (TP=2) with vLLM** + +Assume each physical card has 64Gi of memory, and you plan to use 32Gi on each of the 2 cards (totaling 64Gi): + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: vllm-npu-2card + annotations: + huawei.com/vnpu-mode: 'hami-core' +spec: + containers: + - name: vllm-container + image: vllm-ascend:latest + command: ["/bin/sh", "-c"] + args: + - | + vllm serve /model/Qwen3-0.6B \ + --host 0.0.0.0 \ + --port 8002 \ + --enforce-eager \ + --tensor-parallel-size 2 \ + --gpu-memory-utilization 0.5 + resources: + limits: + huawei.com/Ascend910B3: "2" + huawei.com/Ascend910B3-memory: "65536" + huawei.com/Ascend910B3-core: "50" ``` -### Notes +:::note +`--gpu-memory-utilization 0.5` = Total requested memory (64Gi) / Total physical memory (128Gi across 2 cards). +::: + +## Notes -1. Currently, the Ascend 910b supports only two sharding policies, which are 1/4 and 1/2. Ascend 310p supports 3 sharding policies: 1/7, 2/7, 4/7. The memory request of the job will automatically align with the most close sharding policy. In this example, the task will allocate 16384M device memory. +1. For hard slicing, Ascend 910B supports only two sharding policies: 1/4 and 1/2. Ascend 310P supports three sharding policies: 1/7, 2/7, 4/7. The memory request will automatically align with the closest sharding policy. -1. Ascend-sharing in init container is not supported. +1. Ascend-sharing in init containers is not supported. 1. `huawei.com/Ascend910B-memory` only works when `huawei.com/Ascend910B=1`. `huawei.com/Ascend310P-memory` only works when `huawei.com/Ascend310P=1`. diff --git a/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-310p.md b/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-310p.md index 20473f4e..74d39655 100644 --- a/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-310p.md +++ b/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-310p.md @@ -1,30 +1,27 @@ --- -title: Allocate 310p slice +title: Allocate 310P slice --- -To allocate a certain size of GPU device memory, you need only to assign `huawei.com/ascend310P-memory` besides `huawei.com/ascend310P`. +To allocate a certain size of device memory, assign `huawei.com/Ascend310P-memory` alongside `huawei.com/Ascend310P`. ```yaml apiVersion: v1 kind: Pod metadata: - name: ascend310p-pod + name: ascend310p-job spec: - tolerations: - - key: aaa - operator: Exists - effect: NoSchedule + runtimeClassName: ascend containers: - name: ubuntu-container image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04 command: ["bash", "-c", "sleep 86400"] resources: limits: - huawei.com/Ascend310P: 1 - huawei.com/Ascend310P-memory: 1024 + huawei.com/Ascend310P: 1 # requesting 1 NPU + huawei.com/Ascend310P-memory: 2000 # requesting 2000m device memory ``` -> **NOTICE:** *compute resource of Ascend310P is also limited with `huawei.com/Ascend310P-memory`, equals to the percentage of device memory allocated.* +> **NOTICE:** *Compute resource of Ascend310P is also limited with `huawei.com/Ascend310P-memory`, equal to the percentage of device memory allocated.* ## Select Device by UUID @@ -43,4 +40,3 @@ metadata: spec: # ... rest of pod spec ``` - diff --git a/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-910b.md b/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-910b.md index 1a2cd8a5..b504633a 100644 --- a/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-910b.md +++ b/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-910b.md @@ -1,26 +1,27 @@ --- -title: Allocate 910b slice +title: Allocate 910B slice --- -To allocate a certain size of GPU device memory, you need only to assign `huawei.com/ascend910-memory` besides `huawei.com/ascend910`. +To allocate a certain size of device memory, assign `huawei.com/Ascend910B-memory` alongside `huawei.com/Ascend910B`. ```yaml apiVersion: v1 kind: Pod metadata: - name: gpu-pod + name: ascend910b-job spec: + runtimeClassName: ascend containers: - name: ubuntu-container image: ascendhub.huawei.com/public-ascendhub/ascend-mindspore:23.0.RC3-centos7 command: ["bash", "-c", "sleep 86400"] resources: limits: - huawei.com/Ascend910: 1 # requesting 1 NPU - huawei.com/Ascend910-memory: 2000 # requesting 2000m device memory + huawei.com/Ascend910B: 1 # requesting 1 NPU + huawei.com/Ascend910B-memory: 2000 # requesting 2000m device memory ``` -> **NOTICE:** *compute resource of Ascend910B is also limited with `huawei.com/Ascend910-memory`, equals to the percentage of device memory allocated.* +> **NOTICE:** *Compute resource of Ascend910B is also limited with `huawei.com/Ascend910B-memory`, equal to the percentage of device memory allocated.* ## Select Device by UUID diff --git a/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-exclusive.md b/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-exclusive.md index 6919918f..1501da3c 100644 --- a/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-exclusive.md +++ b/versioned_docs/version-v2.8.0/userguide/ascend-device/examples/allocate-exclusive.md @@ -2,7 +2,7 @@ title: Allocate exclusive device --- -To allocate a whole Ascend device, you need to only assign `huawei.com/ascend910` or `huawei.com/310p` without other fields. +To allocate a whole Ascend device, set the corresponding resourceName (e.g. `huawei.com/Ascend910B` or `huawei.com/Ascend310P`) without specifying a memory resource. ```yaml apiVersion: v1 @@ -10,11 +10,12 @@ kind: Pod metadata: name: gpu-pod1 spec: + runtimeClassName: ascend containers: - name: ubuntu-container image: ascendhub.huawei.com/public-ascendhub/ascend-mindspore:23.0.RC3-centos7 command: ["bash", "-c", "sleep 86400"] resources: limits: - huawei.com/Ascend910B: 2 # requesting 2 whole Ascend 910b devices + huawei.com/Ascend910B: 2 # requesting 2 whole Ascend 910B devices ``` diff --git a/versioned_docs/version-v2.8.0/userguide/awsneuron-device/enable-awsneuron-managing.md b/versioned_docs/version-v2.8.0/userguide/awsneuron-device/enable-awsneuron-managing.md index 034786a0..84515e10 100644 --- a/versioned_docs/version-v2.8.0/userguide/awsneuron-device/enable-awsneuron-managing.md +++ b/versioned_docs/version-v2.8.0/userguide/awsneuron-device/enable-awsneuron-managing.md @@ -8,18 +8,18 @@ AWS Neuron devices are specialized hardware accelerators designed by AWS to opti HAMi now integrates with [my-scheduler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#deploy-neuron-scheduler-extension), providing the following capabilities: -* **Neuron sharing**: HAMi now supports sharing on aws.amazon.com/neuron by allocating device cores(aws.amazon.com/neuroncore), each Neuron core equals to 1/2 neuron device. +* **Neuron sharing**: HAMi now supports sharing on aws.amazon.com/neuron by allocating device cores(aws.amazon.com/neuroncore), each Neuron core equals 1/2 of a neuron device. -* **Topology awareness**: When allocating multiple aws-neuron devices in a container, HAMi will make sure these devices are connected with one another, so to minimize the communication cost between neuron devices. For details about how these devices are connected, refer to [Container Device Allocation On Different Instance Types](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#container-device-allocation-on-different-instance-types). +* **Topology awareness**: When allocating multiple aws-neuron devices in a container, HAMi ensures these devices are connected to minimize the communication cost between neuron devices. For details about how these devices are connected, refer to [Container Device Allocation On Different Instance Types](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#container-device-allocation-on-different-instance-types). ## Prerequisites * Neuron-device-plugin * EC2 instance of type `Inf` or `Trn` -## Enabling GCU-sharing Support +## Enabling Neuron-sharing Support -* Deploy neuron-device-plugin on EC2 neuron nodes according to document the AWS document: [Neuro Device Plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#neuron-device-plugin) +* Deploy neuron-device-plugin on EC2 neuron nodes according to the AWS document: [Neuro Device Plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#neuron-device-plugin) * Deploy HAMi @@ -29,13 +29,13 @@ helm install hami hami-charts/hami -n kube-system ## Device Granularity -HAMi divides each AWS Neuron device into 2 units for resource allocation. You could allocate half of neuron device. +HAMi divides each AWS Neuron device into 2 units for resource allocation. You can allocate half of a neuron device. ### Neuron Allocation * Each unit of `aws.amazon.com/neuroncore` represents 1/2 of neuron device * Don't assign `aws.amazon.com/neuron` like other devices, only assigning `aws.amazon.com/neuroncore` is enough -* When the number of `aws.amazon.com/neuroncore`>=2, it equals to setting `awa.amazon.com/neuron=1/2 * neuronCoreNumber` +* When the number of `aws.amazon.com/neuroncore`>=2, it is equivalent to setting `aws.amazon.com/neuron=1/2 * neuronCoreNumber` * The topology awareness scheduling is automatically enabled when tasks require multiple neuron devices. ## Running Neuron jobs @@ -109,7 +109,7 @@ spec: # ... rest of pod spec ``` -> **NOTE:** The device ID format is `{node-name}-AWSNeuron-{index}`. You can find the available device IDs in the node annotations. +> **NOTICE:** The device ID format is `{node-name}-AWSNeuron-{index}`. You can find the available device IDs in the node annotations. ### Finding Device UUIDs @@ -129,6 +129,6 @@ Look for annotations containing device information in the node status. ## Notes -1. AWS Neuron sharing takes effect only for containers that apply for one AWS Neuron device(i.e `aws.amazon.com/neuroncore`=1 ). +1. AWS Neuron sharing takes effect only for containers that apply for one AWS Neuron device (i.e., `aws.amazon.com/neuroncore`=1). 2. `neuron-ls` inside container shows the total device memory, which is NOT a bug. Device memory will be properly limited when tasks are running. diff --git a/versioned_docs/version-v2.8.0/userguide/awsneuron-device/examples/allocate-neuron-core.md b/versioned_docs/version-v2.8.0/userguide/awsneuron-device/examples/allocate-neuron-core.md index 51bd8d02..65aed7aa 100644 --- a/versioned_docs/version-v2.8.0/userguide/awsneuron-device/examples/allocate-neuron-core.md +++ b/versioned_docs/version-v2.8.0/userguide/awsneuron-device/examples/allocate-neuron-core.md @@ -2,7 +2,7 @@ title: Allocate AWS Neuron core --- -To allocate 1/2 neuron device, you could allocate a neuroncore, like the example below: +To allocate 1/2 of a neuron device, use `aws.amazon.com/neuroncore`, as shown below: ```yaml apiVersion: v1 @@ -23,4 +23,4 @@ spec: requests: cpu: "1" memory: 1Gi -``` \ No newline at end of file +``` diff --git a/versioned_docs/version-v2.8.0/userguide/awsneuron-device/examples/allocate-neuron-device.md b/versioned_docs/version-v2.8.0/userguide/awsneuron-device/examples/allocate-neuron-device.md index 753685eb..e4203710 100644 --- a/versioned_docs/version-v2.8.0/userguide/awsneuron-device/examples/allocate-neuron-device.md +++ b/versioned_docs/version-v2.8.0/userguide/awsneuron-device/examples/allocate-neuron-device.md @@ -1,8 +1,8 @@ --- -title: Allocate AWS Neuron core +title: Allocate AWS Neuron device --- -To allocate one or more aws neuron devices exclusively, you could allocate using `aws.amazon.com/neuron` +To allocate one or more AWS Neuron devices exclusively, use `aws.amazon.com/neuron` ```yaml apiVersion: v1 @@ -23,4 +23,4 @@ spec: requests: cpu: "1" memory: 1Gi -``` \ No newline at end of file +``` diff --git a/versioned_docs/version-v2.8.0/userguide/cambricon-device/enable-cambricon-mlu-sharing.md b/versioned_docs/version-v2.8.0/userguide/cambricon-device/enable-cambricon-mlu-sharing.md index 874ec41e..28e7ea1e 100644 --- a/versioned_docs/version-v2.8.0/userguide/cambricon-device/enable-cambricon-mlu-sharing.md +++ b/versioned_docs/version-v2.8.0/userguide/cambricon-device/enable-cambricon-mlu-sharing.md @@ -59,7 +59,7 @@ To request shared MLU resources in a container, use the following resource types * `cambricon.com/mlu.smlu.vmemory` * `cambricon.com/mlu.smlu.vcore` -Here is an YAML example: +Here is a YAML example: ```yaml apiVersion: apps/v1 @@ -80,7 +80,7 @@ spec: spec: containers: - name: c-1 - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["sleep"] args: ["100000"] resources: @@ -97,7 +97,7 @@ spec: Pods with the `cambricon.com/mlumem` resource specified in an init container will not be scheduled. 2. **Resource constraints only apply to shared mode (`vmlu=1`).** - + The `cambricon.com/mlu.smlu.vmemory` and `cambricon.com/mlu.smlu.vcore` resources are only effective when `cambricon.com/vmlu` is set to `1`. If `vmlu > 1`, a full MLU device will be allocated regardless of `vmemory` and `vcore` values. diff --git a/versioned_docs/version-v2.8.0/userguide/cambricon-device/examples/allocate-core-and-memory.md b/versioned_docs/version-v2.8.0/userguide/cambricon-device/examples/allocate-core-and-memory.md index 29a225cd..921fd673 100644 --- a/versioned_docs/version-v2.8.0/userguide/cambricon-device/examples/allocate-core-and-memory.md +++ b/versioned_docs/version-v2.8.0/userguide/cambricon-device/examples/allocate-core-and-memory.md @@ -24,7 +24,7 @@ spec: spec: containers: - name: c-1 - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["sleep"] args: ["100000"] resources: diff --git a/versioned_docs/version-v2.8.0/userguide/cambricon-device/examples/allocate-exclusive.md b/versioned_docs/version-v2.8.0/userguide/cambricon-device/examples/allocate-exclusive.md index f188e283..73976eb4 100644 --- a/versioned_docs/version-v2.8.0/userguide/cambricon-device/examples/allocate-exclusive.md +++ b/versioned_docs/version-v2.8.0/userguide/cambricon-device/examples/allocate-exclusive.md @@ -23,7 +23,7 @@ spec: spec: containers: - name: c-1 - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["sleep"] args: ["100000"] resources: diff --git a/versioned_docs/version-v2.8.0/userguide/cambricon-device/specify-device-memory-usage.md b/versioned_docs/version-v2.8.0/userguide/cambricon-device/specify-device-memory-usage.md index 819203b9..c3259856 100644 --- a/versioned_docs/version-v2.8.0/userguide/cambricon-device/specify-device-memory-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/cambricon-device/specify-device-memory-usage.md @@ -10,7 +10,7 @@ This field is optional. Each unit of `cambricon.com/mlu.smlu.vmemory` represents resources: limits: cambricon.com/vmlu: 1 # requesting 1 MLU - cambricon.com/mlu.smlu.vmemory: "20" # Each GPU contains 20% device memory + cambricon.com/mlu.smlu.vmemory: "20" # Each MLU contains 20% device memory ``` :::note diff --git a/versioned_docs/version-v2.8.0/userguide/cambricon-device/specify-device-type-to-use.md b/versioned_docs/version-v2.8.0/userguide/cambricon-device/specify-device-type-to-use.md index 3590d151..57c1cb37 100644 --- a/versioned_docs/version-v2.8.0/userguide/cambricon-device/specify-device-type-to-use.md +++ b/versioned_docs/version-v2.8.0/userguide/cambricon-device/specify-device-type-to-use.md @@ -6,7 +6,7 @@ To enable device type specification, you need to add the `--enable-device-type` When this option is enabled, different MLU types will expose distinct resource names. For example: -- `cambricon.com/mlu370.smlu.vcore` +- `cambricon.com/mlu370.smlu.vcore` - `cambricon.com/mlu370.smlu.vmemory` This allows fine-grained control over resource allocation based on MLU model types. @@ -16,6 +16,6 @@ You can specify these resources in your container specification like this: resources: limits: cambricon.com/vmlu: 1 # requesting 1 MLU - cambricon.com/mlu370.smlu.vmemory: "20" # Each GPU contains 20% device memory - cambricon.com/mlu370.smlu.vcore: "10" # Each GPU contains 10% compute cores + cambricon.com/mlu370.smlu.vmemory: "20" # Each MLU contains 20% device memory + cambricon.com/mlu370.smlu.vcore: "10" # Each MLU contains 10% compute cores ``` diff --git a/versioned_docs/version-v2.8.0/userguide/configure.md b/versioned_docs/version-v2.8.0/userguide/configure.md index 7cbcf6e4..9b45fd4d 100644 --- a/versioned_docs/version-v2.8.0/userguide/configure.md +++ b/versioned_docs/version-v2.8.0/userguide/configure.md @@ -12,12 +12,12 @@ All the configurations listed below are managed within the hami-scheduler-device You can update these configurations using one of the following methods: 1. Directly edit the ConfigMap: If HAMi has already been successfully installed, you can manually update - the hami-scheduler-device ConfigMap using the kubectl edit command to manually update the hami-scheduler-device ConfigMap. + the hami-scheduler-device ConfigMap using the kubectl edit command. ```bash kubectl edit configmap hami-scheduler-device -n <namespace> ``` - + After making changes, restart the related HAMi components to apply the updated configurations. 2. Modify Helm Chart: Update the corresponding values in the @@ -33,6 +33,7 @@ You can update these configurations using one of the following methods: | `nvidia.defaultMem` | Integer | The default device memory of the current job, in MB. '0' means using 100% of the device memory. | `0` | | `nvidia.defaultCores` | Integer | Percentage of GPU cores reserved for the current job. `0` allows any GPU with enough memory; `100` reserves the entire GPU exclusively. | `0` | | `nvidia.defaultGPUNum` | Integer | Default number of GPUs. If set to `0`, it will be filtered out. If `nvidia.com/gpu` is not set in the pod resource, the webhook checks `nvidia.com/gpumem`, `resource-mem-percentage`, and `nvidia.com/gpucores`, adding `nvidia.com/gpu` with this default value if any of them are set. | `1` | + | `nvidia.memoryFactor` | Integer | During resource requests, the actual value of `nvidia.com/gpumem` will be multiplied by this factor. If `mock-device-plugin` is deployed, the actual value `nvidia.com/gpumem` in `node.status.capacity` will also be amplified by the corresponding multiple. | `1` | | `nvidia.resourceCountName` | String | vGPU number resource name. | `"nvidia.com/gpu"` | | `nvidia.resourceMemoryName` | String | vGPU memory size resource name. | `"nvidia.com/gpumem"` | | `nvidia.resourceMemoryPercentageName` | String | vGPU memory fraction resource name. | `"nvidia.com/gpumem-percentage"` | @@ -41,12 +42,11 @@ You can update these configurations using one of the following methods: ## Node Configs: ConfigMap -HAMi allows configuring per-node behavior for device plugin. Edit +HAMi allows configuring per-node behavior for device plugin. Edit the ConfigMap: ```sh kubectl -n <namespace> edit cm hami-device-plugin ``` - * `name`: Name of the node. * `operatingmode`: Operating mode of the node, can be "hami-core" or "mig", default: "hami-core". * `devicememoryscaling`: Overcommit ratio of device memory. @@ -55,14 +55,14 @@ kubectl -n <namespace> edit cm hami-device-plugin * `filterdevices`: Devices that are not registered to HAMi. * `uuid`: UUIDs of devices to ignore * `index`: Indexes of devices to ignore. - * A device is ignored by HAMi if it's in `uuid` or `index` list. + * A device is ignored by HAMi if it is in the `uuid` or `index` list. ## Chart Configs: arguments You can customize your vGPU support by setting the following arguments using `-set`, for example ```bash -helm install hami hami-charts/hami --set devicePlugin.deviceMemoryScaling=5 ... +helm install hami hami-charts/hami --set devicePlugin.deviceMemoryScaling=5 -n kube-system ``` | Argument | Type | Description | Default | @@ -71,7 +71,7 @@ helm install hami hami-charts/hami --set devicePlugin.deviceMemoryScaling=5 ... | `scheduler.defaultSchedulerPolicy.nodeSchedulerPolicy` | String | GPU node scheduling policy: `"binpack"` allocates jobs to the same GPU node as much as possible. `"spread"` allocates jobs to different GPU nodes as much as possible. | `"binpack"` | | `scheduler.defaultSchedulerPolicy.gpuSchedulerPolicy` | String | GPU scheduling policy: `"binpack"` allocates jobs to the same GPU as much as possible. `"spread"` allocates jobs to different GPUs as much as possible. | `"spread"` | -## Pod configs: annotations +## Pod Configs: Annotations | Argument | Type | Description | Example | |----------|------|-------------|---------| @@ -83,7 +83,7 @@ helm install hami hami-charts/hami --set devicePlugin.deviceMemoryScaling=5 ... | `hami.io/gpu-scheduler-policy` | String | GPU scheduling policy: `"binpack"` allocates the pod to the same GPU card for execution. `"spread"` allocates the pod to different GPU cards for execution. | `"binpack"` or `"spread"` | | `nvidia.com/vgpu-mode` | String | The type of vGPU instance this pod wishes to use. | `"hami-core"` or `"mig"` | -## Container configs: env +## Container Configs: Env | Argument | Type | Description | Default | |----------|------|-------------|---------| diff --git a/versioned_docs/version-v2.8.0/userguide/device-supported.md b/versioned_docs/version-v2.8.0/userguide/device-supported.md index 446a8e39..c80c4fdb 100644 --- a/versioned_docs/version-v2.8.0/userguide/device-supported.md +++ b/versioned_docs/version-v2.8.0/userguide/device-supported.md @@ -4,8 +4,8 @@ title: Device supported by HAMi The table below lists the devices supported by HAMi: -| Type | Manufactor | Models | MemoryIsolation | CoreIsolation | MultiCard Support | -|------|------------|------|-----------------|---------------|-------------------| +| Type | Manufacturer | Models | MemoryIsolation | CoreIsolation | MultiCard Support | +| ---- | ---------- | ------ | --------------- | ------------- | ----------------- | | GPU | NVIDIA | All | Yes | Yes | Yes | | MLU | Cambricon | 370, 590 | Yes | Yes | No | | DCU | Hygon | Z100, Z100L | Yes | Yes | No | @@ -14,5 +14,6 @@ The table below lists the devices supported by HAMi: | GPU | Mthreads | MTT S4000 | Yes | Yes | No | | GPU | Metax | MXC500 | Yes | Yes | No | | GCU | Enflame | S60 | Yes | Yes | No | -| XPU | Kunlunxin | P800 | Yes | Yes | No | +| XPU | Kunlunxin | P800 | Yes | Yes | No | +| GPU | Vastai | VA16 | Yes | Yes | No | | DPU | Teco | Checking | In progress | In progress | No | diff --git a/versioned_docs/version-v2.8.0/userguide/enflame-device/enable-enflame-gcu-sharing.md b/versioned_docs/version-v2.8.0/userguide/enflame-device/enable-enflame-gcu-sharing.md index eeee1ea2..995a9741 100644 --- a/versioned_docs/version-v2.8.0/userguide/enflame-device/enable-enflame-gcu-sharing.md +++ b/versioned_docs/version-v2.8.0/userguide/enflame-device/enable-enflame-gcu-sharing.md @@ -5,19 +5,19 @@ title: Enable Enflame GPU Sharing ## Introduction -**HAMi now supports sharing on enflame.com/gcu(i.e S60) by implementing most device-sharing features as NVIDIA GPUs**, including: +**HAMi now supports sharing on enflame.com/gcu (i.e., S60) by implementing most device-sharing features as NVIDIA GPUs**, including: **GCU sharing**: Each task can allocate a portion of GCU instead of a whole GCU card, thus GCU can be shared among multiple tasks. -**Device Memory and Core Control**: GCUs can be allocated with certain percentage of device memory and core, HAMi ensures it does not exceed the boundary. +**Device Memory and Core Control**: GCUs can be allocated with a certain percentage of device memory and core, with hard limits enforced to prevent exceeding the allocation. **Device UUID Selection**: You can specify which GCU devices to use or exclude using annotations. -**Very Easy to use**: You don't need to modify your task yaml to use the HAMi scheduler. All your GPU jobs will be automatically supported after installation. +**No task YAML changes required**: All GCU jobs are automatically supported after installation. ## Prerequisites -* Enflame gcushare-device-plugin >= 2.1.6 (please consult your device provider, gcushare has two components: gcushare-scheduler-plugin and gcushare-device-plugin, only gcushare-device-plugin is needed here ) +* Enflame gcushare-device-plugin >= 2.1.6 (please consult your device provider, gcushare has two components: gcushare-scheduler-plugin and gcushare-device-plugin; only gcushare-device-plugin is needed here) * driver version >= 1.2.3.14 * kubernetes >= 1.24 * enflame-container-toolkit >=2.0.50 @@ -27,8 +27,7 @@ title: Enable Enflame GPU Sharing * Deploy gcushare-device-plugin on enflame nodes (Please consult your device provider to acquire its package and document) > **NOTICE:** *Install only gpushare-device-plugin, don't install gpu-scheduler-plugin package.* - -> **NOTE:** The default resource names are: +> **NOTICE:** The default resource names are: > > * `enflame.com/vgcu` for GCU count, only support 1 now. > * `enflame.com/vgcu-percentage` for the percentage of memory and cores in a gcu slice. @@ -55,7 +54,7 @@ HAMi divides each Enflame GCU into 100 units for resource allocation. When you r ## Running Enflame jobs Enflame GCUs can now be requested by a container -using the `enflame.com/vgcu` and `enflame.com/vgcu-percentage` resource type: +using the `enflame.com/vgcu` and `enflame.com/vgcu-percentage` resource type: ```yaml apiVersion: v1 @@ -67,7 +66,7 @@ spec: terminationGracePeriodSeconds: 0 containers: - name: pod-gcu-example1 - image: ubuntu:18.04 + image: ubuntu:22.04 imagePullPolicy: IfNotPresent command: - sleep @@ -79,7 +78,7 @@ spec: enflame.com/vgcu-percentage: 22 ``` -> **NOTICE:** *You can find more examples in [examples/enflame folder](https://github.com/Project-HAMi/HAMi/tree/release-v2.6/examples/enflame/)* +> **NOTICE:** *You can find more examples in [examples/enflame folder](https://github.com/Project-HAMi/HAMi/tree/master/examples/enflame/)* ## Device UUID Selection @@ -99,7 +98,7 @@ spec: # ... rest of pod spec ``` -> **NOTE:** The device ID format is `{node-name}-enflame-{index}`. You can find the available device IDs in the node status. +> **NOTICE:** The device ID format is `{node-name}-enflame-{index}`. You can find the available device IDs in the node status. ### Finding Device UUIDs @@ -119,7 +118,7 @@ Look for annotations containing device information in the node status. ## Notes -1. GCUshare takes effect only for containers that apply for one GCU(i.e enflame.com/vgcu=1 ). +1. GCUshare takes effect only for containers that apply for one GCU (i.e., enflame.com/vgcu=1 ). 2. Multiple GCU allocation in one container is not supported yet diff --git a/versioned_docs/version-v2.8.0/userguide/hami-webui-user-guide.md b/versioned_docs/version-v2.8.0/userguide/hami-webui-user-guide.md index 1acf4763..ff862e3b 100644 --- a/versioned_docs/version-v2.8.0/userguide/hami-webui-user-guide.md +++ b/versioned_docs/version-v2.8.0/userguide/hami-webui-user-guide.md @@ -4,7 +4,7 @@ linktitle: HAMi WebUI --- [HAMi WebUI](https://github.com/Project-HAMi/HAMi-WebUI) is the visual interface provided by HAMi for unified monitoring and analysis of GPU resources and workloads. -With a unified visualization, users can intuitively view the cluster running status, including GPU usage, node information, and workload status, so that they can more efficiently understand resource distribution and the overall system state. +It provides a unified view of cluster GPU usage, node information, and workload status. ## Core capabilities @@ -77,5 +77,4 @@ HAMi WebUI provides a visualization solution centered on GPU resources, enabling - Analyze resource usage - Locate issues and optimize resource usage -In production environments, WebUI can significantly reduce the complexity of GPU operations and management. - +In production environments, WebUI reduces the overhead of GPU operations and management. diff --git a/versioned_docs/version-v2.8.0/userguide/hygon-device/enable-hygon-dcu-sharing.md b/versioned_docs/version-v2.8.0/userguide/hygon-device/enable-hygon-dcu-sharing.md index ff4d9cfe..76ec2935 100644 --- a/versioned_docs/version-v2.8.0/userguide/hygon-device/enable-hygon-dcu-sharing.md +++ b/versioned_docs/version-v2.8.0/userguide/hygon-device/enable-hygon-dcu-sharing.md @@ -8,9 +8,9 @@ title: Enable Hygon DCU sharing **DCU sharing**: Each task can allocate a portion of DCU instead of a whole DCU card, thus DCU can be shared among multiple tasks. -**Device Memory Control**: DCUs can be allocated with certain device memory size on certain type(i.e Z100) and have made it that it does not exceed the boundary. +**Device Memory Control**: DCUs can be allocated with a specific device memory size on certain types (e.g., Z100), with hard limits enforced to prevent exceeding the allocation. -**Device compute core limitation**: DCUs can be allocated with certain percentage of device core(i.e hygon.com/dcucores:60 indicate this container uses 60% compute cores of this device) +**Device compute core limitation**: DCUs can be allocated with certain percentage of device core (i.e., hygon.com/dcucores:60 indicates this container uses 60% compute cores of this device) **DCU Type Specification**: You can specify which type of DCU to use or to avoid for a certain task, by setting "hygon.com/use-dcutype" or "hygon.com/nouse-dcutype" annotations. @@ -21,12 +21,12 @@ title: Enable Hygon DCU sharing ## Enabling DCU-sharing Support -* Deploy the dcu-vgpu-device-plugin [here](https://github.com/Project-HAMi/dcu-vgpu-device-plugin) +* Deploy the [dcu-vgpu-device-plugin](https://github.com/Project-HAMi/dcu-vgpu-device-plugin) ## Running DCU jobs Hygon DCUs can now be requested by a container -using the `hygon.com/dcunum` , `hygon.com/dcumem` and `hygon.com/dcucores` resource type: +using the `hygon.com/dcunum`, `hygon.com/dcumem` and `hygon.com/dcucores` resource type: ```yaml apiVersion: v1 @@ -51,21 +51,21 @@ spec: ## Enable vDCU inside container -You need to enable vDCU inside container in order to use it. +You need to enable vDCU inside the container to use it. -```yaml +```bash source /opt/hygondriver/env.sh ``` -check if you have successfully enabled vDCU by using following command +Check if you have successfully enabled vDCU by using the following command: -```yaml -hy-virtual -show-device-info +```bash +hy-smi virtual -show-device-info ``` If you have an output like this, then you have successfully enabled vDCU inside container. -```yaml +```text Device 0: Actual Device: 0 Compute units: 60 diff --git a/versioned_docs/version-v2.8.0/userguide/hygon-device/examples/allocate-core-and-memory.md b/versioned_docs/version-v2.8.0/userguide/hygon-device/examples/allocate-core-and-memory.md index b789ec50..4681d012 100644 --- a/versioned_docs/version-v2.8.0/userguide/hygon-device/examples/allocate-core-and-memory.md +++ b/versioned_docs/version-v2.8.0/userguide/hygon-device/examples/allocate-core-and-memory.md @@ -2,7 +2,7 @@ title: Allocate device core and memory resource --- -To allocate a certain part of device core resource, you need only to assign the `hygon.com/dcucores` and `hygon.com/dcumem` along with the number of Hygon DCUs you requested in the container using `hygon.com/dcunum`. +To allocate a certain part of device core resource, you need only to assign the `hygon.com/dcucores` and `hygon.com/dcumem` along with the number of hygon DCUs you requested in the container using `hygon.com/dcunum` ```yaml apiVersion: v1 diff --git a/versioned_docs/version-v2.8.0/userguide/hygon-device/examples/specify-certain-cards.md b/versioned_docs/version-v2.8.0/userguide/hygon-device/examples/specify-certain-cards.md index c29c1d81..364046d4 100644 --- a/versioned_docs/version-v2.8.0/userguide/hygon-device/examples/specify-certain-cards.md +++ b/versioned_docs/version-v2.8.0/userguide/hygon-device/examples/specify-certain-cards.md @@ -14,7 +14,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/userguide/hygon-device/specify-device-core-usage.md b/versioned_docs/version-v2.8.0/userguide/hygon-device/specify-device-core-usage.md index 7e03433a..b0f9f0e5 100644 --- a/versioned_docs/version-v2.8.0/userguide/hygon-device/specify-device-core-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/hygon-device/specify-device-core-usage.md @@ -3,8 +3,8 @@ title: Allocate device core to container linktitle: Allocate device core usage --- -Allocate a percentage of device core resources by specify resource `hygon.com/dcucores`. -Optional, each unit of `hygon.com/dcucores` equals to 1% device cores. +Allocate a percentage of device core resources by specifying resource `hygon.com/dcucores`. +Optional, each unit of `hygon.com/dcucores` equals 1% of device cores. ```yaml resources: diff --git a/versioned_docs/version-v2.8.0/userguide/hygon-device/specify-device-memory-usage.md b/versioned_docs/version-v2.8.0/userguide/hygon-device/specify-device-memory-usage.md index 51e7ebc0..fee66fea 100644 --- a/versioned_docs/version-v2.8.0/userguide/hygon-device/specify-device-memory-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/hygon-device/specify-device-memory-usage.md @@ -2,8 +2,8 @@ title: Allocate device memory --- -Allocate a percentage size of device memory by specify resources such as `hygon.com/dcumem`. -Optional, Each unit of `hygon.com/dcumem` equals to 1M device memory. +Allocate a percentage size of device memory by specifying resources such as `hygon.com/dcumem`. +Optional, each unit of `hygon.com/dcumem` equals 1 MiB of device memory. ```yaml resources: diff --git a/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-bi-v150.md b/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-bi-v150.md index 51cc7593..6344f138 100644 --- a/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-bi-v150.md +++ b/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-bi-v150.md @@ -36,4 +36,4 @@ spec: iluvatar.ai/BI-V150.vMem: 64 ``` -> **NOTE:** *Each `iluvatar.ai/<card-type>.vCore` unit represents 1% of an available compute core, and each `iluvatar.ai/<card-type>.vMem` unit represents 256MB of device memory* \ No newline at end of file +> **NOTICE:** *Each `iluvatar.ai/<card-type>.vCore` unit represents 1% of an available compute core, and each `iluvatar.ai/<card-type>.vMem` unit represents 256MB of device memory* diff --git a/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-exclusive-bi-v150.md b/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-exclusive-bi-v150.md index d09e9e89..809b7181 100644 --- a/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-exclusive-bi-v150.md +++ b/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-exclusive-bi-v150.md @@ -14,7 +14,7 @@ spec: containers: - name: BI-V150-poddemo image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 - command: + command: - bash args: - -c @@ -31,4 +31,5 @@ spec: limits: iluvatar.ai/BI-V150-vgpu: 2 ``` -> **Note:** *When applying for exclusive use of a GPU, `iluvatar.ai/<card-type>-vgpu=1`, you need to set the values ​​of `iluvatar.ai/<card-type>.vCore` and `iluvatar.ai/<card-type>.vMem` to the maximum number of GPU resources. `iluvatar.ai/<card-type>-vgpu>1` no longer supports the vGPU function, so you don't need to fill in the core and memory values* \ No newline at end of file + +> **NOTICE:** *When applying for exclusive use of a GPU, `iluvatar.ai/<card-type>-vgpu=1`, you need to set the values of `iluvatar.ai/<card-type>.vCore` and `iluvatar.ai/<card-type>.vMem` to the maximum number of GPU resources. `iluvatar.ai/<card-type>-vgpu>1` no longer supports the vGPU function, so you don't need to fill in the core and memory values* diff --git a/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-exclusive-mr-v100.md b/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-exclusive-mr-v100.md index 632a229c..345db925 100644 --- a/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-exclusive-mr-v100.md +++ b/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-exclusive-mr-v100.md @@ -1,8 +1,8 @@ --- -title: Allocate exclusive BI-V100 device +title: Allocate exclusive MR-V100 device --- -To allocate multiple BI-V100 devices, you only need to assign `iluvatar.ai/BI-V150-vgpu` with no other fields required. +To allocate multiple MR-V100 devices, you only need to assign `iluvatar.ai/MR-V100-vgpu` with no other fields required. ```yaml apiVersion: v1 @@ -14,7 +14,7 @@ spec: containers: - name: MR-V100-poddemo image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 - command: + command: - bash args: - -c @@ -31,4 +31,5 @@ spec: limits: iluvatar.ai/MR-V100-vgpu: 2 ``` -> **Note:** *When applying for exclusive use of a GPU, `iluvatar.ai/<card-type>-vgpu=1`, you need to set the values ​​of `iluvatar.ai/<card-type>.vCore` and `iluvatar.ai/<card-type>.vMem` to the maximum number of GPU resources. `iluvatar.ai/<card-type>-vgpu>1` no longer supports the vGPU function, so you don't need to fill in the core and memory values* \ No newline at end of file + +> **NOTICE:** *When applying for exclusive use of a GPU, `iluvatar.ai/<card-type>-vgpu=1`, you need to set the values of `iluvatar.ai/<card-type>.vCore` and `iluvatar.ai/<card-type>.vMem` to the maximum number of GPU resources. `iluvatar.ai/<card-type>-vgpu>1` no longer supports the vGPU function, so you don't need to fill in the core and memory values* diff --git a/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-mr-v100.md b/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-mr-v100.md index 985d4b65..df16ea90 100644 --- a/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-mr-v100.md +++ b/versioned_docs/version-v2.8.0/userguide/iluvatar-device/examples/allocate-mr-v100.md @@ -15,7 +15,7 @@ spec: containers: - name: MR-V100-poddemo image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 - command: + command: - bash args: - -c @@ -37,4 +37,4 @@ spec: iluvatar.ai/MR-V100.vMem: 64 ``` -> **NOTE:** *Each `iluvatar.ai/<card-type>.vCore` unit represents 1% of an available compute core, and each `iluvatar.ai/<card-type>.vMem` unit represents 256MB of device memory* \ No newline at end of file +> **NOTICE:** *Each `iluvatar.ai/<card-type>.vCore` unit represents 1% of an available compute core, and each `iluvatar.ai/<card-type>.vMem` unit represents 256MB of device memory* diff --git a/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/enable-kunlunxin-schedule.md b/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/enable-kunlunxin-schedule.md index e6132a1a..46842546 100644 --- a/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/enable-kunlunxin-schedule.md +++ b/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/enable-kunlunxin-schedule.md @@ -15,13 +15,13 @@ Kubernetes schedules the pods onto appropriate nodes with the goal of minimizing and maximizing performance. The `xpu-device` then performs fine-grained allocation of the requested resources on the selected node, following these rules: -1. Only 1, 2, 4, or 8-card allocations are allowed. -2. Allocations of 1, 2, or 4 XPUs must not span across NUMA nodes. +1. Only 1, 2, 4, or 8-card allocations are allowed. +2. Allocations of 1, 2, or 4 XPUs must not span across NUMA nodes. 3. Fragmentation should be minimized after allocation. ## Important Notes -1. Device sharing is **not** supported at this time. +1. Device sharing is **not** supported at this time. 2. These features have been tested on Kunlunxin P800 hardware. ## Prerequisites @@ -33,7 +33,7 @@ of the requested resources on the selected node, following these rules: ## Enabling Topology-Aware Scheduling * Deploy the Kunlunxin device plugin on P800 nodes. - (Please contact your device vendor to obtain the appropriate package and documentation.) + (Please contact your device vendor to obtain the appropriate package and documentation.) * Deploy HAMi according to the instructions in `README.md`. ## Running Kunlunxin Jobs @@ -49,7 +49,7 @@ metadata: spec: containers: - name: ubuntu-container - image: docker.io/library/ubuntu:latest + image: ubuntu:22.04 imagePullPolicy: IfNotPresent command: ["sleep", "infinity"] resources: diff --git a/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/enable-kunlunxin-vxpu.md b/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/enable-kunlunxin-vxpu.md index 33f78867..10d1be9e 100644 --- a/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/enable-kunlunxin-vxpu.md +++ b/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/enable-kunlunxin-vxpu.md @@ -176,7 +176,7 @@ spec: # ... rest of Pod configuration ``` -> **Note:** Device ID format is `{BusID}`. You can find available device IDs in the node status. +> **NOTICE:** Device ID format is `{BusID}`. You can find available device IDs in the node status. ### Finding Device UUIDs diff --git a/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/examples/allocate-whole-xpu.md b/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/examples/allocate-whole-xpu.md index ab101943..a0d004cb 100644 --- a/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/examples/allocate-whole-xpu.md +++ b/versioned_docs/version-v2.8.0/userguide/kunlunxin-device/examples/allocate-whole-xpu.md @@ -18,4 +18,4 @@ spec: resources: limits: kunlunxin.com/xpu: 1 # requesting 1 XPU -``` \ No newline at end of file +``` diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/enable-metax-gpu-schedule.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/enable-metax-gpu-schedule.md index 8ccde26e..2530debb 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/enable-metax-gpu-schedule.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/enable-metax-gpu-schedule.md @@ -19,11 +19,11 @@ the GPU device plugin (gpu-device) handles fine-grained allocation based on the - A connection is considered a MetaXLink connection when there is a MetaXLink connection and a PCIe Switch connection between the two cards. - When both the MetaXLink and the PCIe Switch can meet the job request, equipped with MetaXLink interconnected resources. -2. When using `node-scheduler-policy=spread`, allocate Metax resources to be under the same Metaxlink or Paiswich as much as possible, as shown below: +2. When using `node-scheduler-policy=spread`, allocate Metax resources to be under the same MetaXLink or PCIe Switch as much as possible, as shown below: ![Metax spread scheduling policy diagram showing resource allocation](/img/docs/common/userguide/metax-device/metax-gpu/metax-spread.jpg) -3. When using `node-scheduler-policy=binpack`, assign GPU resources, so minimize the damage to MetaxXLink topology, as shown below: +3. When using `node-scheduler-policy=binpack`, assign GPU resources to minimize the damage to MetaXLink topology, as shown below: ![Metax binpack scheduling policy diagram showing topology-aware allocation](/img/docs/common/userguide/metax-device/metax-gpu/metax-binpack.jpg) @@ -54,11 +54,12 @@ apiVersion: v1 kind: Pod metadata: name: gpu-pod1 - annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task. + annotations: + hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task. spec: containers: - name: ubuntu-container - image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 + image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 imagePullPolicy: IfNotPresent command: ["sleep","infinity"] resources: @@ -66,4 +67,4 @@ spec: metax-tech.com/gpu: 1 # requesting 1 GPU ``` -> **NOTICE:** *You can find more examples in [examples/metax folder](https://github.com/Project-HAMi/HAMi/tree/release-v2.6/examples/metax/gpu)* +> **NOTICE:** *You can find more examples in [examples/metax folder](https://github.com/Project-HAMi/HAMi/tree/master/examples/metax/gpu)* diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/allocate-binpack.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/allocate-binpack.md index 24b4964f..16b04099 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/allocate-binpack.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/allocate-binpack.md @@ -9,12 +9,12 @@ apiVersion: v1 kind: Pod metadata: name: gpu-pod1 - annotations: + annotations: hami.io/node-scheduler-policy: "binpack" # when this parameter is set to binpack, the scheduler will try to minimize the topology loss. spec: containers: - name: ubuntu-container - image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 + image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 imagePullPolicy: IfNotPresent command: ["sleep","infinity"] resources: diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/allocate-spread.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/allocate-spread.md index 94c5a0f4..0b2b47e1 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/allocate-spread.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/allocate-spread.md @@ -9,12 +9,12 @@ apiVersion: v1 kind: Pod metadata: name: gpu-pod1 - annotations: + annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task. spec: containers: - name: ubuntu-container - image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 + image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 imagePullPolicy: IfNotPresent command: ["sleep","infinity"] resources: diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/default-use.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/default-use.md index 35006b07..2750f77a 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/default-use.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/examples/default-use.md @@ -12,7 +12,7 @@ metadata: spec: containers: - name: ubuntu-container - image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 + image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 imagePullPolicy: IfNotPresent command: ["sleep","infinity"] resources: diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/specify-binpack-task.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/specify-binpack-task.md index a5baa165..c957497c 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/specify-binpack-task.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/specify-binpack-task.md @@ -6,6 +6,6 @@ To allocate metax device with minimum damage to topology, you need to only assig ```yaml metadata: - annotations: + annotations: hami.io/node-scheduler-policy: "binpack" # when this parameter is set to binpack, the scheduler will try to minimize the topology loss. ``` diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/specify-spread-task.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/specify-spread-task.md index cc3d6c4b..15bf5e08 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/specify-spread-task.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-gpu/specify-spread-task.md @@ -6,6 +6,6 @@ To allocate metax device with best performance, you need to only assign `metax-t ```yaml metadata: - annotations: + annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task. ``` diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/enable-metax-gpu-sharing.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/enable-metax-gpu-sharing.md index a3bd23b7..c4de98ab 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/enable-metax-gpu-sharing.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/enable-metax-gpu-sharing.md @@ -13,15 +13,15 @@ translated: true ## Prerequisites -* Metax Driver >= 2.32.0 -* Metax GPU Operator >= 0.10.2 -* Kubernetes >= 1.23 +- Metax Driver >= 2.32.0 +- Metax GPU Operator >= 0.10.2 +- Kubernetes >= 1.23 ## Enabling GPU-sharing support -* Deploy Metax GPU Operator on metax nodes (Please consult your device provider to obtain the installation package and documentation) +- Deploy Metax GPU Operator on metax nodes (Please consult your device provider to obtain the installation package and documentation) -* Deploy HAMi according to README.md +- Deploy HAMi according to README.md ## Running Metax jobs @@ -41,9 +41,9 @@ spec: command: ["sleep","infinity"] resources: limits: - metax-tech.com/sgpu: 1 # requesting 1 GPU + metax-tech.com/sgpu: 1 # requesting 1 GPU metax-tech.com/vcore: 60 # each GPU use 60% of total compute cores metax-tech.com/vmemory: 4 # each GPU require 4 GiB device memory ``` -> **NOTICE:** *You can find more examples in [examples/metax folder](https://github.com/Project-HAMi/HAMi/tree/release-v2.6/examples/metax/sgpu)* +> **NOTICE:** *You can find more examples in [examples/metax folder](https://github.com/Project-HAMi/HAMi/tree/master/examples/metax/sgpu)* diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/examples/allocate-qos-policy.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/examples/allocate-qos-policy.md index b6c55939..c9f174f2 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/examples/allocate-qos-policy.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/examples/allocate-qos-policy.md @@ -5,10 +5,10 @@ translated: true Users can configure the QoS policy for tasks using the `metax-tech.com/sgpu-qos-policy` annotation to specify the scheduling policy used by the shared GPU (sGPU). The available sGPU scheduling policies are described in the table below: -| Scheduling Policy | Description | -|-------------------|-------------| -| `best-effort` | The sGPU has no restriction on compute usage. | -| `fixed-share` | The sGPU is assigned a fixed compute quota and cannot exceed this limit. | +| Scheduling Policy | Description | +|-------------------|------------------------------------------------------------------------------------------------------------------| +| `best-effort` | The sGPU has no restriction on compute usage. | +| `fixed-share` | The sGPU is assigned a fixed compute quota and cannot exceed this limit. | | `burst-share` | The sGPU is assigned a fixed compute quota, but may utilize additional GPU compute resources when they are idle. | ```yaml diff --git a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/examples/default-use.md b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/examples/default-use.md index bcd61109..e9ecd093 100644 --- a/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/examples/default-use.md +++ b/versioned_docs/version-v2.8.0/userguide/metax-device/metax-sgpu/examples/default-use.md @@ -13,12 +13,12 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:22.04 + image: ubuntu:22.04 imagePullPolicy: IfNotPresent command: ["sleep","infinity"] resources: limits: - metax-tech.com/sgpu: 1 # requesting 1 GPU + metax-tech.com/sgpu: 1 # requesting 1 GPU metax-tech.com/vcore: 60 # each GPU use 60% of total compute cores metax-tech.com/vmemory: 4 # each GPU require 4 GiB device memory ``` diff --git a/versioned_docs/version-v2.8.0/userguide/monitoring/device-allocation.md b/versioned_docs/version-v2.8.0/userguide/monitoring/device-allocation.md index 3dee7c8b..80009229 100644 --- a/versioned_docs/version-v2.8.0/userguide/monitoring/device-allocation.md +++ b/versioned_docs/version-v2.8.0/userguide/monitoring/device-allocation.md @@ -11,8 +11,8 @@ curl {scheduler node ip}:31993/metrics It contains the following metrics: -| Metrics | Description | Example | -|----------|-------------|---------| +| Metrics | Description | Example | +| -------- | ----------- | ------- | | GPUDeviceCoreLimit | GPUDeviceCoreLimit Device memory core limit for a certain GPU | `{deviceidx="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",nodeid="aio-node67",zone="vGPU"}` 100 | | GPUDeviceMemoryLimit | GPUDeviceMemoryLimit Device memory limit for a certain GPU | `{deviceidx="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",nodeid="aio-node67",zone="vGPU"}` 3.4359738368e+10 | | GPUDeviceCoreAllocated | Device core allocated for a certain GPU | `{deviceidx="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",nodeid="aio-node67",zone="vGPU"}` 45 | @@ -21,6 +21,16 @@ It contains the following metrics: | vGPUCoreAllocated | vGPU core allocated from a container | `{containeridx="Ascend310P",deviceuuid="aio-node74-arm-Ascend310P-0",nodename="aio-node74-arm",podname="ascend310p-pod",podnamespace="default",zone="vGPU"}` 50 | | vGPUMemoryAllocated | vGPU memory allocated from a container | `{containeridx="Ascend310P",deviceuuid="aio-node74-arm-Ascend310P-0",nodename="aio-node74-arm",podname="ascend310p-pod",podnamespace="default",zone="vGPU"}` 3.221225472e+09 | | QuotaUsed | resourcequota usage for a certain device | `{quotaName="nvidia.com/gpucores", quotanamespace="default",limit="200",zone="vGPU"}` 100 | -| vGPUPodsDeviceAllocated | vGPU Allocated from pods (This metric will be deprecated in v2.8.0, use vGPUMemoryAllocated and vGPUCoreAllocated instead.)| `{containeridx="Ascend310P",deviceusedcore="0",deviceuuid="aio-node74-arm-Ascend310P-0",nodename="aio-node74-arm",podname="ascend310p-pod",podnamespace="default",zone="vGPU"}` 3.221225472e+09 | -> **Note** This is the overview about device allocation, it is NOT device real-time usage metrics. For that part, see real-time device usage. +If you are using [HAMi DRA](../../installation/how-to-use-hami-dra), the metrics will be: + +| Metrics | Description | Example | +| -------- | ----------- | ------- | +| GPUDeviceCoreLimit | GPUDeviceCoreLimit Device memory core limit for a certain GPU | `{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-1",deviceproductname="Tesla P4",deviceuuid="GPU-3ab1-179d-d6dd",nodeid="k8s-node01"}` 100 | +| GPUDeviceMemoryLimit | GPUDeviceMemoryLimit Device memory limit for a certain GPU | `{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-1",deviceproductname="Tesla P4",deviceuuid="GPU-3ab1-179d-d6dd",nodeid="k8s-node01"}` 8192 | +| GPUDeviceCoreAllocated | Device core allocated for a certain GPU | `{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-1",deviceproductname="Tesla P4",deviceuuid="GPU-3ab1-179d-d6dd",nodeid="k8s-node01"}` 0 | +| GPUDeviceMemoryAllocated | Device memory allocated for a certain GPU | `{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-1",deviceproductname="Tesla P4",deviceuuid="GPU-3ab1-179d-d6dd",nodeid="k8s-node01"}` 0 | +| vGPUDeviceCoreAllocated | vGPU core allocated from a container | `{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-0",deviceproductname="Tesla P4",deviceuuid="GPU-82be-83fe-3068",nodeid="k8s-node01",podname="pod-0",podnamespace="default"}` 100 | +| vGPUDeviceMemoryAllocated | vGPU memory allocated from a container | `{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-0",deviceproductname="Tesla P4",deviceuuid="GPU-82be-83fe-3068",nodeid="k8s-node01",podname="pod-0",podnamespace="default"}` 4000 | + +> **NOTICE:** This is the overview of device allocation, it is NOT device real-time usage metrics. For that part, see real-time device usage. diff --git a/versioned_docs/version-v2.8.0/userguide/monitoring/real-time-device-usage.md b/versioned_docs/version-v2.8.0/userguide/monitoring/real-time-device-usage.md index 0f8d75f9..65f72202 100644 --- a/versioned_docs/version-v2.8.0/userguide/monitoring/real-time-device-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/monitoring/real-time-device-usage.md @@ -3,7 +3,7 @@ title: Real-time device usage endpoint linktitle: Real-time device usage --- -You can get the real-time device memory and core utilization by visiting `{GPU node node ip}:31992/metrics`, or add it to a prometheus endpoint, as the command below: +You can get the real-time device memory and core utilization by visiting `{GPU node ip}:31992/metrics`, or add it to a prometheus endpoint, as the command below: ```bash curl {GPU node ip}:31992/metrics @@ -11,8 +11,8 @@ curl {GPU node ip}:31992/metrics It contains the following metrics: -| Metrics | Description | Example | -|----------|-------------|---------| +| Metrics | Description | Example | +| ---------- | ----------- | ------- | | Device_memory_desc_of_container | Container device memory real-time usage | `{context="0",ctrname="2-1-3-pod-1",data="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",module="0",offset="0",podname="2-1-3-pod-1",podnamespace="default",vdeviceid="0",zone="vGPU"}` 0 | | Device_utilization_desc_of_container | Container device real-time utilization | `{ctrname="2-1-3-pod-1",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podname="2-1-3-pod-1",podnamespace="default",vdeviceid="0",zone="vGPU"}` 0 | | HostCoreUtilization | GPU real-time utilization on host | `{deviceidx="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",zone="vGPU"}` 0 | diff --git a/versioned_docs/version-v2.8.0/userguide/monitoring/real-time-usage.md b/versioned_docs/version-v2.8.0/userguide/monitoring/real-time-usage.md index c02a1d3d..cc87fd94 100644 --- a/versioned_docs/version-v2.8.0/userguide/monitoring/real-time-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/monitoring/real-time-usage.md @@ -1,5 +1,115 @@ --- -title: Real-time usage +title: Real-time GPU Usage +sidebar_label: Real-time Usage --- -To be improved. +Real-time monitoring allows you to track GPU utilization, memory usage, and resource allocation across your Kubernetes cluster as workloads run. HAMi provides tools to observe GPU behavior dynamically. + +## Monitoring with kubectl + +### Check Node GPU Resources + +View current GPU capacity and allocatable resources on a node: + +```bash +kubectl get node <node-name> -o json | jq '.status.allocatable' | grep -i gpu +``` + +### Inspect Pod GPU Allocation + +See which GPUs are allocated to a specific pod: + +```bash +kubectl get pod <pod-name> -o json | jq '.metadata.annotations' | grep -i gpu +``` + +Or view all GPU-related information: + +```bash +kubectl describe pod <pod-name> +``` + +## Monitoring Inside Containers + +### Check Allocated GPU Inside Pod + +Inside a running container, you can check which GPUs are visible: + +```bash +kubectl exec -it <pod-name> -- nvidia-smi +``` + +This shows the virtual GPU configuration as seen by the container, including allocated memory and cores. + +### Real-time GPU Usage + +To monitor GPU usage while a workload runs: + +```bash +kubectl exec -it <pod-name> -- watch -n 1 nvidia-smi +``` + +This updates the GPU metrics every second, showing: +- GPU utilization percentage +- Memory usage and limits +- Running processes + +## Node-Level Monitoring + +### Monitor All GPUs on a Node + +SSH into the node and run: + +```bash +nvidia-smi +``` + +For continuous monitoring: + +```bash +watch -n 1 nvidia-smi +``` + +### Check HAMi Device Plugin Status + +Verify the HAMi device plugin is running and reporting resources: + +```bash +kubectl get pods -n kube-system | grep hami +kubectl logs -n kube-system -l app=hami-scheduler -f +``` + +## Resource Annotation Tracking + +HAMi stores GPU information in node annotations. View them with: + +```bash +kubectl get node <node-name> -o yaml | grep -A 10 "hami.io/node" +``` + +This shows detailed GPU information including: +- GPU UUIDs +- Memory capacity +- Compute core count +- Device models + +## Integration with Monitoring Tools + +For production environments, integrate HAMi with tools like: + +- **Prometheus**: Scrape kubelet metrics for GPU resource data +- **Grafana**: Visualize GPU utilization trends over time +- **Kubernetes Dashboard**: View GPU resources in the web UI + +Refer to the Kubernetes documentation for setting up monitoring with these tools. + +## Troubleshooting + +If you notice GPU allocation inconsistencies: + +1. Check pod resource requests/limits match HAMi annotations +2. Verify the HAMi scheduler is running +3. Check device plugin logs for errors +4. Ensure nodes have the required GPU labels + +For more details, see the [troubleshooting guide](../../troubleshooting/troubleshooting.md). diff --git a/versioned_docs/version-v2.8.0/userguide/mthreads-device/enable-mthreads-gpu-sharing.md b/versioned_docs/version-v2.8.0/userguide/mthreads-device/enable-mthreads-gpu-sharing.md index 792a519e..0c787bcd 100644 --- a/versioned_docs/version-v2.8.0/userguide/mthreads-device/enable-mthreads-gpu-sharing.md +++ b/versioned_docs/version-v2.8.0/userguide/mthreads-device/enable-mthreads-gpu-sharing.md @@ -8,17 +8,17 @@ title: Enable Mthreads GPU sharing **GPU sharing**: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks. -**Device Memory Control**: GPUs can be allocated with certain device memory size on certain type(i.e MTT S4000) and have made it that it does not exceed the boundary. +**Device Memory Control**: GPUs can be allocated with a specific device memory size on certain types (e.g., MTT S4000), with hard limits enforced to prevent exceeding the allocation. -**Device Core Control**: GPUs can be allocated with limited compute cores on certain type(i.e MTT S4000) and have made it that it does not exceed the boundary. +**Device Core Control**: GPUs can be allocated with limited compute cores on certain types (e.g., MTT S4000), with hard limits enforced to prevent exceeding the allocation. ## Important Notes 1. Device sharing for multi-cards is not supported. -2. Only one mthreads device can be shared in a pod(even there are multiple containers). +2. Only one Mthreads device can be shared in a pod (even if there are multiple containers). -3. Support allocating exclusive mthreads GPU by specifying mthreads.com/vgpu only. +3. Support allocating exclusive Mthreads GPU by specifying mthreads.com/vgpu only. 4. These features are tested on MTT S4000 @@ -29,20 +29,20 @@ title: Enable Mthreads GPU sharing ## Enabling GPU-sharing Support -* Deploy MT-CloudNative Toolkit on mthreads nodes (Please consult your device provider to acquire its package and document) +* Deploy MT-CloudNative Toolkit on Mthreads nodes (Please consult your device provider to acquire its package and document) > **NOTICE:** *You can remove mt-mutating-webhook and mt-gpu-scheduler after installation(optional).* * set the 'devices.mthreads.enabled = true' when installing hami ```bash -helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set device.mthreads.enabled=true -n kube-system +helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set devices.mthreads.enabled=true -n kube-system ``` ## Running Mthreads jobs Mthreads GPUs can now be requested by a container -using the `mthreads.com/vgpu`, `mthreads.com/sgpu-memory` and `mthreads.com/sgpu-core` resource type: +using the `mthreads.com/vgpu`, `mthreads.com/sgpu-memory` and `mthreads.com/sgpu-core` resource type: ```yaml apiVersion: v1 @@ -64,5 +64,5 @@ spec: mthreads.com/sgpu-core: 8 ``` -> **NOTICE1:** *Each unit of sgpu-memory indicates 512M device memory* -> **NOTICE2:** *You can find more examples in [examples/mthreads folder](https://github.com/Project-HAMi/HAMi/tree/release-v2.6/examples/mthreads/)* +> **NOTICE:** *Each unit of sgpu-memory indicates 512M device memory* +> **NOTICE:** *You can find more examples in [examples/mthreads folder](https://github.com/Project-HAMi/HAMi/tree/master/examples/mthreads/)* diff --git a/versioned_docs/version-v2.8.0/userguide/mthreads-device/examples/allocate-core-and-memory.md b/versioned_docs/version-v2.8.0/userguide/mthreads-device/examples/allocate-core-and-memory.md index cd6fc65c..7dafbd52 100644 --- a/versioned_docs/version-v2.8.0/userguide/mthreads-device/examples/allocate-core-and-memory.md +++ b/versioned_docs/version-v2.8.0/userguide/mthreads-device/examples/allocate-core-and-memory.md @@ -2,7 +2,7 @@ title: Allocate device core and memory resource --- -To allocate a certain part of device core resource, you need only to assign the `mthreads.com/sgpu-memory` and `mthreads.com/sgpu-core` along with the number of cambricon MLUs you requested in the container using `mthreads.com/vgpu` +To allocate a certain part of device core resource, you need only to assign the `mthreads.com/sgpu-memory` and `mthreads.com/sgpu-core` along with the number of Mthreads GPUs you requested in the container using `mthreads.com/vgpu`. ```yaml apiVersion: v1 @@ -12,7 +12,7 @@ metadata: spec: restartPolicy: OnFailure containers: - - image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc + - image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc imagePullPolicy: IfNotPresent name: gpushare-pod-1 command: ["sleep"] diff --git a/versioned_docs/version-v2.8.0/userguide/mthreads-device/examples/allocate-exclusive.md b/versioned_docs/version-v2.8.0/userguide/mthreads-device/examples/allocate-exclusive.md index 140221fe..efc53016 100644 --- a/versioned_docs/version-v2.8.0/userguide/mthreads-device/examples/allocate-exclusive.md +++ b/versioned_docs/version-v2.8.0/userguide/mthreads-device/examples/allocate-exclusive.md @@ -2,7 +2,7 @@ title: Allocate exclusive device --- -To allocate a whole cambricon device, you need to only assign `mthreads.com/vgpu` without other fields. You can allocate multiple GPUs for a container. +To allocate a whole Mthreads device, you need to only assign `mthreads.com/vgpu` without other fields. You can allocate multiple GPUs for a container. ```yaml apiVersion: v1 @@ -12,7 +12,7 @@ metadata: spec: restartPolicy: OnFailure containers: - - image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc + - image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc imagePullPolicy: IfNotPresent name: gpushare-pod-1 command: ["sleep"] diff --git a/versioned_docs/version-v2.8.0/userguide/mthreads-device/specify-device-core-usage.md b/versioned_docs/version-v2.8.0/userguide/mthreads-device/specify-device-core-usage.md index 197ec80c..5df0b62d 100644 --- a/versioned_docs/version-v2.8.0/userguide/mthreads-device/specify-device-core-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/mthreads-device/specify-device-core-usage.md @@ -3,8 +3,8 @@ title: Allocate device core to container linktitle: Allocate device core usage --- -Allocate a part of device core resources by specify resource `mthreads.com/sgpu-core`. -Optional, each unit of `mthreads.com/sgpu-core` equals to 1/16 device cores. +Allocate a part of device core resources by specifying resource `mthreads.com/sgpu-core`. +Optional, each unit of `mthreads.com/sgpu-core` equals 1/16 of device cores. ```yaml resources: diff --git a/versioned_docs/version-v2.8.0/userguide/mthreads-device/specify-device-memory-usage.md b/versioned_docs/version-v2.8.0/userguide/mthreads-device/specify-device-memory-usage.md index fe2629b1..94b752e9 100644 --- a/versioned_docs/version-v2.8.0/userguide/mthreads-device/specify-device-memory-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/mthreads-device/specify-device-memory-usage.md @@ -3,8 +3,8 @@ title: Allocate device memory to container linktitle: Allocate device memory --- -Allocate a percentage size of device memory by specify resources such as `mthreads.com/sgpu-memory`. -Optional, Each unit of `mthreads.com/sgpu-memory` equals to 512M of device memory. +Allocate a percentage size of device memory by specifying resources such as `mthreads.com/sgpu-memory`. +Optional, each unit of `mthreads.com/sgpu-memory` equals 512 MiB of device memory. ```yaml resources: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/dynamic-mig-support.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/dynamic-mig-support.md index 954348ee..ada09970 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/dynamic-mig-support.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/dynamic-mig-support.md @@ -64,18 +64,19 @@ You can customize the MIG configuration by following the steps below: ```yaml nvidia: - resourceCountName: { { .Values.resourceName } } - resourceMemoryName: { { .Values.resourceMem } } - resourceMemoryPercentageName: { { .Values.resourceMemPercentage } } - resourceCoreName: { { .Values.resourceCores } } - resourcePriorityName: { { .Values.resourcePriority } } + resourceCountName: {{ .Values.resourceName }} + resourceMemoryName: {{ .Values.resourceMem }} + resourceMemoryPercentageName: {{ .Values.resourceMemPercentage }} + resourceCoreName: {{ .Values.resourceCores }} + resourcePriorityName: {{ .Values.resourcePriority }} overwriteEnv: false defaultMemory: 0 defaultCores: 0 defaultGPUNum: 1 - deviceSplitCount: { { .Values.devicePlugin.deviceSplitCount } } - deviceMemoryScaling: { { .Values.devicePlugin.deviceMemoryScaling } } - deviceCoreScaling: { { .Values.devicePlugin.deviceCoreScaling } } + memoryFactor: 1 + deviceSplitCount: {{ .Values.devicePlugin.deviceSplitCount }} + deviceMemoryScaling: {{ .Values.devicePlugin.deviceMemoryScaling }} + deviceCoreScaling: {{ .Values.devicePlugin.deviceCoreScaling }} knownMigGeometries: - models: ["A30"] allowedGeometries: @@ -130,7 +131,7 @@ nvidia: :::note Helm installations and updates will follow the configuration specified in this file, overriding the default Helm settings. -HAMi identifies and use the first MIG template that matches the job, in the order defined in this configMap. +HAMi uses the first MIG template that matches the job, in the order defined in this configMap. ::: ## Running MIG jobs @@ -148,7 +149,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-core.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-core.md index 442f0b3f..8df0002f 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-core.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-core.md @@ -13,7 +13,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-memory.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-memory.md index 37a30eb4..966493a0 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-memory.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-memory.md @@ -13,7 +13,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-memory2.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-memory2.md index 5a318774..68d40410 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-memory2.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/allocate-device-memory2.md @@ -13,7 +13,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/dynamic-mig-example.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/dynamic-mig-example.md index bfd23b32..92f612cd 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/dynamic-mig-example.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/dynamic-mig-example.md @@ -2,7 +2,7 @@ title: Assign task to mig instance --- -This example will allocate `2g.10gb x 2` for A100-40GB-PCIE device or `1g.10gb x 2` for A100-80GB-SXM device. +This example will allocate `2g.10gb * 2` for A100-40GB-PCIE device or `1g.10gb * 2` for A100-80GB-SXM device. ```yaml apiVersion: v1 @@ -15,7 +15,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/specify-card-type-to-use.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/specify-card-type-to-use.md index 0cda86a7..14c124d8 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/specify-card-type-to-use.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/specify-card-type-to-use.md @@ -11,15 +11,15 @@ metadata: name: gpu-pod annotations: nvidia.com/use-gputype: "A100,V100" - #In this example, we want to run this job on A100 or V100 + #In this example, the job runs on A100 or V100 spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: nvidia.com/gpu: 2 # requesting 2 vGPUs ``` -> **NOTICE:** *You can assign this task to multiple GPU types, use comma to separate,In this example, the job targets A100 or V100* +> **NOTICE:** *You can assign this task to multiple GPU types, use comma to separate. In this example, the job runs on A100 or V100.* diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/specify-certain-card.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/specify-certain-card.md index cf623330..7ab6d180 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/specify-certain-card.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/specify-certain-card.md @@ -14,7 +14,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/use-exclusive-card.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/use-exclusive-card.md index 255b26e3..892a3140 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/use-exclusive-card.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/examples/use-exclusive-card.md @@ -1,5 +1,5 @@ --- -title: Allocate device core to container +title: Use Exclusive GPU linktitle: Use exclusive GPU --- @@ -13,7 +13,7 @@ metadata: spec: containers: - name: ubuntu-container - image: ubuntu:18.04 + image: ubuntu:22.04 command: ["bash", "-c", "sleep 86400"] resources: limits: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-core-usage.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-core-usage.md index 4ce9e11e..80ce42ca 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-core-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-core-usage.md @@ -3,8 +3,8 @@ title: Allocate device core to container linktitle: Allocate device core usage --- -Allocate a percentage of device core resources by specify resource `nvidia.com/gpucores`. -Optional, each unit of `nvidia.com/gpucores` equals to 1% device cores. +Allocate a percentage of device core resources by specifying resource `nvidia.com/gpucores`. +Optional, each unit of `nvidia.com/gpucores` equals 1% of device cores. ```yaml resources: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-memory-usage.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-memory-usage.md index e24b5c93..05f89120 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-memory-usage.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-memory-usage.md @@ -3,8 +3,8 @@ title: Allocate device memory to container linktitle: Allocate device memory --- -Allocate a certain size of device memory by specify resources such as `nvidia.com/gpumem`. -Optional, Each unit of `nvidia.com/gpumem` equals to 1M. +Allocate a certain size of device memory by specifying resources such as `nvidia.com/gpumem`. +Optional, each unit of `nvidia.com/gpumem` equals 1 MiB. ```yaml resources: @@ -13,8 +13,8 @@ Optional, Each unit of `nvidia.com/gpumem` equals to 1M. nvidia.com/gpumem: 3000 # Each GPU contains 3000m device memory ``` -Allocate a percentage of device memory by specify resource `nvidia.com/gpumem-percentage`. -Optional, each unit of `nvidia.com/gpumem-percentage` equals to 1% percentage of device memory. +Allocate a percentage of device memory by specifying resource `nvidia.com/gpumem-percentage`. +Optional, each unit of `nvidia.com/gpumem-percentage` equals 1% of device memory. ```yaml resources: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-type-to-use.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-type-to-use.md index b69736ba..f148ca80 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-type-to-use.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-type-to-use.md @@ -2,6 +2,8 @@ title: Assign to certain device type --- +## Overview + Sometimes a task may wish to run on a certain type of GPU, it can fill the `nvidia.com/use-gputype` field in pod annotation. HAMi scheduler will check if the device type returned from `nvidia-smi -L` contains the content of annotation. For example, a task with the following annotation will be assigned to A100 or V100 GPU @@ -12,7 +14,7 @@ metadata: nvidia.com/use-gputype: "A100,V100" # Specify the card type for this job, use comma to separate, will not launch job on non-specified card ``` -A task may use `nvidia.com/nouse-gputype` to evade certain type of GPU. In this following example, that job won't be assigned to 1080(include 1080Ti) or 2080(include 2080Ti) type of card. +A task may use `nvidia.com/nouse-gputype` to evade certain types of GPU. In the following example, that job will not be assigned to 1080 (including 1080Ti) or 2080 (including 2080Ti) type of card. ```yaml metadata: diff --git a/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-uuid-to-use.md b/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-uuid-to-use.md index 0247c504..ca63ec90 100644 --- a/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-uuid-to-use.md +++ b/versioned_docs/version-v2.8.0/userguide/nvidia-device/specify-device-uuid-to-use.md @@ -1,5 +1,5 @@ --- -title: Assign to certain device type +title: Assign to Certain Device UUID linktitle: Assign to certain device --- @@ -13,4 +13,4 @@ metadata: nvidia.com/use-gpuuuid: "GPU-123456" ``` -> **NOTICE:** *Each GPU UUID is unique in a cluster, so assign a certain UUID means assigning this task to certain node with that GPU* \ No newline at end of file +> **NOTICE:** *Each GPU UUID is unique in a cluster, so assign a certain UUID means assigning this task to certain node with that GPU* diff --git a/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/examples/default-use.md b/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/examples/default-use.md index 386b5607..0740a16d 100644 --- a/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/examples/default-use.md +++ b/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/examples/default-use.md @@ -1,8 +1,8 @@ --- -title: Default vGPU job +title: Default vGPU Job --- -vGPU can be requested by both set "volcano.sh/vgpu-number", "volcano.sh/vgpu-cores" and "volcano.sh/vgpu-memory" in resources.limits +vGPU can be requested by setting `volcano.sh/vgpu-number`, `volcano.sh/vgpu-cores` and `volcano.sh/vgpu-memory` in `resources.limits` ```yaml apiVersion: v1 @@ -13,7 +13,7 @@ spec: restartPolicy: OnFailure schedulerName: volcano containers: - - image: ubuntu:20.04 + - image: ubuntu:22.04 name: pod1-ctr command: ["sleep"] args: ["100000"] diff --git a/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/examples/use-exclusive-gpu.md b/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/examples/use-exclusive-gpu.md index f47185f4..b90058ca 100644 --- a/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/examples/use-exclusive-gpu.md +++ b/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/examples/use-exclusive-gpu.md @@ -13,7 +13,7 @@ spec: restartPolicy: OnFailure schedulerName: volcano containers: - - image: ubuntu:20.04 + - image: ubuntu:22.04 name: pod1-ctr command: ["sleep"] args: ["100000"] diff --git a/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/how-to-use-volcano-vgpu.md b/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/how-to-use-volcano-vgpu.md index 6933f54a..b318ed92 100644 --- a/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/how-to-use-volcano-vgpu.md +++ b/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/how-to-use-volcano-vgpu.md @@ -5,10 +5,10 @@ linktitle: Use Volcano vGPU :::note -You *DON'T* need to install HAMi when using volcano-vgpu, only use +You *DON'T* need to install HAMi when using volcano-vgpu, only use [Volcano vGPU device-plugin](https://github.com/Project-HAMi/volcano-vgpu-device-plugin) is good enough. It can provide device-sharing mechanism for NVIDIA devices managed by Volcano. -This is based on [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin), it uses [HAMi-core](https://github.com/Project-HAMi/HAMi-core) to support hard isolation of GPU card. +This is based on [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin), it uses [HAMi-core](https://github.com/Project-HAMi/HAMi-core) to support hard isolation of GPU cards. Volcano vGPU is only available in Volcano > v1.9. @@ -95,7 +95,7 @@ status: ### Running vGPU Jobs -vGPU can be requested by both set "volcano.sh/vgpu-number" , "volcano.sh/vgpu-cores" and "volcano.sh/vgpu-memory" in resource.limit +vGPU can be requested by setting `volcano.sh/vgpu-number`, `volcano.sh/vgpu-cores` and `volcano.sh/vgpu-memory` in `resources.limits`. ```shell cat <<EOF | kubectl apply -f - @@ -113,7 +113,7 @@ spec: limits: volcano.sh/vgpu-number: 2 # requesting 2 gpu cards volcano.sh/vgpu-memory: 3000 # (optional)each vGPU uses 3G device memory - volcano.sh/vgpu-cores: 50 # (optional)each vGPU uses 50% core + volcano.sh/vgpu-cores: 50 # (optional)each vGPU uses 50% core EOF ``` diff --git a/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/monitor.md b/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/monitor.md index 90f0309d..3f7a8b73 100644 --- a/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/monitor.md +++ b/versioned_docs/version-v2.8.0/userguide/volcano-vgpu/nvidia-gpu/monitor.md @@ -10,11 +10,11 @@ curl {volcano scheduler cluster ip}:8080/metrics It contains the following metrics: -| Metrics | Description | Example | -|----------|-------------|---------| -| volcano_vgpu_device_allocated_cores | The percentage of gpu compute cores allocated in this card | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"}` 0 | -| volcano_vgpu_device_allocated_memory | Vgpu memory allocated in this card | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"}` 32768| -| volcano_vgpu_device_core_allocation_for_a_vertain_pod| The vgpu device core allocated for a certain pod | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podName="resnet101-deployment-7b487d974d-jjc8p"}` 0| -| volcano_vgpu_device_memory_allocation_for_a_certain_pod | The vgpu device memory allocated for a certain pod | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podName="resnet101-deployment-7b487d974d-jjc8p"}` 16384 | +| Metrics | Description | Example | +| ------- | ----------- | ------- | +| volcano_vgpu_device_allocated_cores | The percentage of GPU compute cores allocated in this card | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"}` 0 | +| volcano_vgpu_device_allocated_memory | vGPU memory allocated in this card | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"}` 32768 | +| volcano_vgpu_device_core_allocation_for_a_certain_pod | The vGPU device core allocated for a certain pod | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podName="resnet101-deployment-7b487d974d-jjc8p"}` 0 | +| volcano_vgpu_device_memory_allocation_for_a_certain_pod | The vGPU device memory allocated for a certain pod | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podName="resnet101-deployment-7b487d974d-jjc8p"}` 16384 | | volcano_vgpu_device_memory_limit | The number of total device memory in this card | `{NodeName="m5-cloudinfra-online01",devID="GPU-a88b5d0e-eb85-924b-b3cd-c6cad732f745"}` 32768 | -| volcano_vgpu_device_shared_number | The number of vgpu tasks sharing this card | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"}` 2| \ No newline at end of file +| volcano_vgpu_device_shared_number | The number of vGPU tasks sharing this card | `{NodeName="aio-node67",devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"}` 2 |