[Issue]: Does NVSHMEM/DeepEP support SR-IOV Virtual Function devices?

### How is this issue impacting you?

Application hang

### Share Your Debug Logs

```
Successfully initialized the transport: ibrc
Successfully initialized the transport: IBGDA
NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=4
Skipping NVLINK SHARP resource initialized for team ID: 1
Skipping NVLINK SHARP resource initialized for team ID: 2
Skipping NVLINK SHARP resourceinitialized for team ID: 4
# HANGS HERE - no further output
```

### Steps to Reproduce the Issue

## Pod YAML

```yaml
spec:
  hostIPC: true
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
```
## Launch Command

```bash
export NVSHMEM_HCA_LIST=$(ls /sys/class/infiniband/ | tr '\n' ',' | sed 's/,$//' | sed 's/,/:1,/g' | sed 's/$/:1/')
export NVSHMEM_IBGDA_SUPPORT=0
export NVSHMEM_ENABLE_NIC_PE_MAPPING=1
export NVSHMEM_IB_ADDR_RANGE=100.0.0.0/8
export NVSHMEM_IB_ADDR_FAMILY=AF_INET


python3 -m sglang.launch_server \
    --model-path $MODELPATH \
    --host 0.0.0.0 \
    --port 30000 \
    --disaggregation-mode prefill \
    --disaggregation-bootstrap-port 8998 \
    --disaggregation-transfer-backend mooncake \
    --moe-dense-tp-size 1 \
    --moe-a2a-backend deepep \
    --deepep-mode normal \
    --page-size 64 \
    --dist-init-addr $HEADPODIP:6379 \
    --enable-dp-lm-head \
    --enable-dp-attention \
    --dp-size 32 \
    --tp-size 32 \
    --mem-fraction-static 0.75 \
    --max-total-tokens 10000 \
    --nnodes 4 \
    --node-rank $LOCAL_RANK \
    --watchdog-timeout 100000 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-eplb \
    --ep-dispatch-algorithm dynamic \
    --chunked-prefill-size 32768 \
    --served-model-name GLM-5.1-FP8
```

## Symptom

NVSHMEM initializes successfully (ibrc + IBGDA), creates teams, completes SHARP cleanup, but then **hangs indefinitely** without proceeding to `cuIpcOpenMemHandle`. All 4 nodes hang at the same point.

Key log lines:
```
Successfully initialized the transport: ibrc
Successfully initialized the transport: IBGDA
NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=4
Skipping NVLINK SHARP resource initialized for team ID: 1
Skipping NVLINK SHARP resource initialized for team ID: 2
Skipping NVLINK SHARP resource initialized for team ID: 4
# HANGS HERE - no further output
```

With PF devices on bare metal, the log continues with `cuIpcOpenMemHandle` and DeepGEMM compilation.

### NVSHMEM Version

nvidia-nvshmem-cu12 3.3.20

### Your platform details

- 4 nodes, each with 8x Hopper + 8x ConnectX-6 HCA (SR-IOV enabled, 8 VFs per HCA)
- Kubernetes pods with SR-IOV VF devices allocated
- sglang with DeepEP: `--moe-a2a-backend deepep --deepep-mode normal`
- Same config works on bare metal with PF devices

### Error Message & Behavior

Does NVSHMEM (used by DeepEP) work with SR-IOV Virtual Function (VF) devices in Kubernetes? We're running sglang with `--moe-a2a-backend deepep` on 4 nodes with SR-IOV VF devices, and NVSHMEM hangs after initialization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Does NVSHMEM/DeepEP support SR-IOV Virtual Function devices? #89

How is this issue impacting you?

Share Your Debug Logs

Steps to Reproduce the Issue

Pod YAML

Launch Command

Symptom

NVSHMEM Version

Your platform details

Error Message & Behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Issue]: Does NVSHMEM/DeepEP support SR-IOV Virtual Function devices? #89

Description

How is this issue impacting you?

Share Your Debug Logs

Steps to Reproduce the Issue

Pod YAML

Launch Command

Symptom

NVSHMEM Version

Your platform details

Error Message & Behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions