How is this issue impacting you?
Application hang
Share Your Debug Logs
Successfully initialized the transport: ibrc
Successfully initialized the transport: IBGDA
NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=4
Skipping NVLINK SHARP resource initialized for team ID: 1
Skipping NVLINK SHARP resource initialized for team ID: 2
Skipping NVLINK SHARP resourceinitialized for team ID: 4
# HANGS HERE - no further output
Steps to Reproduce the Issue
Pod YAML
spec:
hostIPC: true
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
Launch Command
export NVSHMEM_HCA_LIST=$(ls /sys/class/infiniband/ | tr '\n' ',' | sed 's/,$//' | sed 's/,/:1,/g' | sed 's/$/:1/')
export NVSHMEM_IBGDA_SUPPORT=0
export NVSHMEM_ENABLE_NIC_PE_MAPPING=1
export NVSHMEM_IB_ADDR_RANGE=100.0.0.0/8
export NVSHMEM_IB_ADDR_FAMILY=AF_INET
python3 -m sglang.launch_server \
--model-path $MODELPATH \
--host 0.0.0.0 \
--port 30000 \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8998 \
--disaggregation-transfer-backend mooncake \
--moe-dense-tp-size 1 \
--moe-a2a-backend deepep \
--deepep-mode normal \
--page-size 64 \
--dist-init-addr $HEADPODIP:6379 \
--enable-dp-lm-head \
--enable-dp-attention \
--dp-size 32 \
--tp-size 32 \
--mem-fraction-static 0.75 \
--max-total-tokens 10000 \
--nnodes 4 \
--node-rank $LOCAL_RANK \
--watchdog-timeout 100000 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-eplb \
--ep-dispatch-algorithm dynamic \
--chunked-prefill-size 32768 \
--served-model-name GLM-5.1-FP8
Symptom
NVSHMEM initializes successfully (ibrc + IBGDA), creates teams, completes SHARP cleanup, but then hangs indefinitely without proceeding to cuIpcOpenMemHandle. All 4 nodes hang at the same point.
Key log lines:
Successfully initialized the transport: ibrc
Successfully initialized the transport: IBGDA
NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=4
Skipping NVLINK SHARP resource initialized for team ID: 1
Skipping NVLINK SHARP resource initialized for team ID: 2
Skipping NVLINK SHARP resource initialized for team ID: 4
# HANGS HERE - no further output
With PF devices on bare metal, the log continues with cuIpcOpenMemHandle and DeepGEMM compilation.
NVSHMEM Version
nvidia-nvshmem-cu12 3.3.20
Your platform details
- 4 nodes, each with 8x Hopper + 8x ConnectX-6 HCA (SR-IOV enabled, 8 VFs per HCA)
- Kubernetes pods with SR-IOV VF devices allocated
- sglang with DeepEP:
--moe-a2a-backend deepep --deepep-mode normal
- Same config works on bare metal with PF devices
Error Message & Behavior
Does NVSHMEM (used by DeepEP) work with SR-IOV Virtual Function (VF) devices in Kubernetes? We're running sglang with --moe-a2a-backend deepep on 4 nodes with SR-IOV VF devices, and NVSHMEM hangs after initialization.
How is this issue impacting you?
Application hang
Share Your Debug Logs
Steps to Reproduce the Issue
Pod YAML
Launch Command
Symptom
NVSHMEM initializes successfully (ibrc + IBGDA), creates teams, completes SHARP cleanup, but then hangs indefinitely without proceeding to
cuIpcOpenMemHandle. All 4 nodes hang at the same point.Key log lines:
With PF devices on bare metal, the log continues with
cuIpcOpenMemHandleand DeepGEMM compilation.NVSHMEM Version
nvidia-nvshmem-cu12 3.3.20
Your platform details
--moe-a2a-backend deepep --deepep-mode normalError Message & Behavior
Does NVSHMEM (used by DeepEP) work with SR-IOV Virtual Function (VF) devices in Kubernetes? We're running sglang with
--moe-a2a-backend deepepon 4 nodes with SR-IOV VF devices, and NVSHMEM hangs after initialization.