Glm5 cp benchmark data readme by ltdo111 · Pull Request #1287 · jd-opensource/xllm

…jd-opensource#1170) Signed-off-by: Super User <panxuanyu1@huawei.com> Signed-off-by: fems14 <panxuanyu1@huawei.com> Co-authored-by: Super User <panxuanyu1@huawei.com> Co-authored-by: ext.wangqingyu17 <ext.wangqingyu17@jd.com>

…ensource#1136)

jd-opensource#1281)

…modules. (jd-opensource#1270)

…rce#1274)

gemini-code-assist

Code Review

This pull request introduces significant changes to the codebase, including the addition of TileLang Ascend kernel development support, infrastructure for generative recommendation (REC) inference, and various refactorings to improve code quality and maintainability. Key updates include the introduction of ChatJsonParser for robust JSON handling, a new ResidualCache policy for DiT models, and the refactoring of ServiceImplFactory to support multiple serving modes. I have reviewed the changes and provided feedback on critical issues such as exception safety in destructors, code duplication in output target refreshing, and missing validation in communication logic. Please address these issues to ensure system stability and maintainability.

gemini-code-assist · 2026-04-15T14:17:40Z

-  // zero_page_ is not owned, don't delete it
-
-  unmap_and_release_virtual_mem(vaddr_, size_, page_size_);
+  cleanup_pages_and_vmem(vaddr_, size_, page_size_, mapping_);


The cleanup_pages_and_vmem function is called in the destructor. If cleanup_pages_and_vmem throws an exception, it will lead to std::terminate during object destruction. Destructors should be noexcept and handle potential errors gracefully without throwing.

gemini-code-assist · 2026-04-15T14:17:40Z

+  if (rec_type == RecType::kOneRec) {
+    if (!sequence_groups_.empty()) {
+      // OneRec REC batches are tracked via sequence_groups_, while output
+      // target generation still walks sequences_. Refresh the flattened
+      // sequence view on every step so token writeback stays aligned after
+      // beam search expands or replaces the group-owned Sequence instances.
+      refresh_sequences_from_groups();
+    }
+    if (FLAGS_enable_rec_prefill_only) {
+      refresh_onerec_prefill_output_targets();
+    } else {
+      refresh_output_targets();
+    }
+  }


The logic for refreshing output targets in OneRec is duplicated across refresh_onerec_prefill_output_targets and refresh_output_targets. This duplication increases maintenance burden and risk of divergence. Please refactor this logic into a single method or a shared helper function.

gemini-code-assist · 2026-04-15T14:17:40Z

+  CHECK(pg_ != nullptr) << "Process group is not initialized.";
+  CHECK(output.defined())
+      << "Output of all_to_all_single function is not defined";
+  CHECK(input.defined())
+      << "Input of all_to_all_single function is not defined";
+  if (input.is_complex()) {
+    input = torch::view_as_real(input);
+  }
+  if (output.is_complex()) {
+    output = torch::view_as_real(output);
+  }
+
+  auto opts = c10d::AllToAllOptions();
+  auto work = pg_->alltoall_base(
+      output, input, output_split_sizes, input_split_sizes, opts);
+  if (async_op) {
+    *async_work = work;
+  } else {
+    work->wait();
+  }
+}


The all_to_all_single implementation lacks error handling for cases where input or output tensors are not on the same device as the process group, or when they are not contiguous. While alltoall_base might handle some of this, explicit validation is required for robustness.

gemini-code-assist · 2026-04-15T14:17:40Z

+  return std::make_unique<ProcessGroupImpl>(global_rank,
+                                            local_rank,
+                                            group_ranks,
+                                            world_size,
+                                            rank_size,
+                                            port,
+                                            host,
+                                            group_name,
+                                            device);
+}


The create_process_group overload for DiT models lacks validation for the port and host parameters. If port is invalid or host is unreachable, this will lead to runtime failures during communication initialization. Please add validation for these parameters.

gemini-code-assist · 2026-04-15T14:17:40Z

+  if (FLAGS_enable_convert_tokens_to_item &&
+      output.token_ids.size() == rec_token_size) {
+    std::vector<int64_t> item_ids;
+    const bool ok = tokenizer.decode(
+        Slice<int32_t>{output.token_ids.data(), output.token_ids.size()},
+        sequence_params_.skip_special_tokens,
+        &item_ids);
+    if (ok && !item_ids.empty()) {
+      output.item_ids_list = normalize_rec_item_ids(item_ids, index_);
+      if (!output.item_ids_list.empty()) {
+        output.item_ids = output.item_ids_list.front();
+      }
+    }
+  }


The token-to-item conversion logic uses tokenizer.decode inside a loop, which can be inefficient if the tokenizer is not optimized for single-item decoding. Additionally, the normalize_rec_item_ids function performs shuffling and resizing, which might be expensive if called frequently. Consider optimizing the token-to-item mapping and normalization process.

gemini-code-assist · 2026-04-15T14:17:40Z

+  if (!(params.use_beam_search && params.all_random_sample && params.logprobs &&
+        params.max_top_logprobs > 0)) {
+    return sampler_.forward(logits, params, filter_mask);
+  }
+
+  if (params.frequency_penalties.defined()) {
+    apply_frequency_presence_penalties(logits,
+                                       params.unique_token_ids,
+                                       params.unique_token_counts,
+                                       params.frequency_penalties,
+                                       params.presence_penalties);
+  }
+
+  if (params.repetition_penalties.defined()) {
+    apply_repetition_penalties(
+        logits, params.unique_token_ids, params.repetition_penalties);
+  }
+
+  torch::Tensor sample_logits = logits;
+  torch::Tensor sample_temperatures = params.temperatures;
+  torch::Tensor sample_top_k = params.top_k;
+  torch::Tensor sample_top_p = params.top_p;
+  const bool use_sample_indices =
+      params.selected_token_idxes.numel() != params.sample_idxes.numel();
+  if (use_sample_indices) {
+    sample_logits = logits.index_select(/*dim=*/0, params.sample_idxes);
+    if (params.temperatures.defined()) {
+      sample_temperatures =
+          params.temperatures.index_select(/*dim=*/0, params.sample_idxes);
+    }
+    if (params.top_k.defined()) {
+      sample_top_k = params.top_k.index_select(/*dim=*/0, params.sample_idxes);
+    }
+    if (params.top_p.defined()) {
+      sample_top_p = params.top_p.index_select(/*dim=*/0, params.sample_idxes);
+    }
+  }
+
+  if (filter_mask.defined()) {
+    CHECK_EQ(filter_mask.dim(), 2)
+        << "filter_mask must be 2-D, dim=" << filter_mask.dim();
+    CHECK_EQ(filter_mask.size(0), sample_logits.size(0))
+        << "filter_mask batch mismatch, filter_mask.size(0)="
+        << filter_mask.size(0)
+        << ", sample_logits.size(0)=" << sample_logits.size(0);
+    CHECK_EQ(filter_mask.size(1), sample_logits.size(1))
+        << "filter_mask vocab mismatch, filter_mask.size(1)="
+        << filter_mask.size(1)
+        << ", sample_logits.size(1)=" << sample_logits.size(1);
+    sample_logits = sample_logits + filter_mask;
+  }
+
+  apply_top_k_top_p(
+      sample_logits, sample_temperatures, sample_top_k, sample_top_p);
+  if (use_sample_indices) {
+    logits.index_copy_(/*dim=*/0, params.sample_idxes, sample_logits);
+  }
+
+  CHECK(params.do_sample.defined()) << "params.do_sample must be defined";
+  CHECK_EQ(params.do_sample.dim(), 1)
+      << "params.do_sample must be 1D [num_seqs], got "
+      << params.do_sample.sizes();
+  CHECK_EQ(sample_logits.size(0), params.do_sample.size(0));
+
+  SampleOutput output;
+  auto probs =
+      torch::softmax(sample_logits, /*dim=*/-1, /*dtype=*/torch::kFloat32);
+  output.probs = probs.to(logits.dtype());
+  auto logprobs =
+      torch::log_softmax(sample_logits, /*dim=*/-1, /*dtype=*/torch::kFloat32);
+
+  const int64_t vocab_size = probs.size(-1);
+  const int64_t top_count = std::min<int64_t>(params.max_top_logprobs,
+                                              static_cast<int64_t>(vocab_size));
+  sample_top_candidates(
+      probs, logprobs, top_count, &output.top_tokens, &output.top_logprobs);
+  output.next_tokens =
+      output.top_tokens.select(/*dim=*/1, /*index=*/0).to(torch::kLong);
+  output.logprobs =
+      output.top_logprobs.select(/*dim=*/1, /*index=*/0).contiguous();
+  return output;
 }


The OneRecConstrainedSamplingStrategy::forward method performs extensive tensor operations, including index_select, index_copy_, and softmax on the device. These operations are performed for every step of the sampling process, which can significantly impact latency. Please optimize these operations, for example by minimizing tensor copies or using fused kernels where possible.

gemini-code-assist · 2026-04-15T14:17:40Z

+    constexpr bool kHasVecType = std::is_same_v<scalar_t, float> ||
+                                 std::is_same_v<scalar_t, c10::Half> ||
+                                 std::is_same_v<scalar_t, c10::BFloat16>;
+
+    if constexpr (kHasVecType) {
+      constexpr int32_t kVecWidth = VecType<scalar_t>::vec_width;
+      if (numel_per_block % kVecWidth == 0) {
+        const int64_t tiles_per_block =
+            ceil_div<int64_t>(numel_per_block / kVecWidth, kThreadsPerBlock);
+        const dim3 grid(num_layers, num_dst_blocks, tiles_per_block);
+        block_copy_kernel<scalar_t, true>
+            <<<grid, kThreadsPerBlock, 0, stream>>>(
+                key_cache_ptrs.data_ptr<int64_t>(),
+                value_cache_ptrs.data_ptr<int64_t>(),
+                src_block_indices.data_ptr<int32_t>(),
+                dst_block_indices.data_ptr<int32_t>(),
+                cum_sum.data_ptr<int32_t>(),
+                num_groups,
+                numel_per_block);
+        C10_CUDA_KERNEL_LAUNCH_CHECK();
+        return;
+      }
+    }
+
+    const int64_t tiles_per_block =
+        ceil_div<int64_t>(numel_per_block, kThreadsPerBlock);
+    const dim3 grid(num_layers, num_dst_blocks, tiles_per_block);
+    block_copy_kernel<scalar_t, false><<<grid, kThreadsPerBlock, 0, stream>>>(
+        key_cache_ptrs.data_ptr<int64_t>(),
+        value_cache_ptrs.data_ptr<int64_t>(),
+        src_block_indices.data_ptr<int32_t>(),
+        dst_block_indices.data_ptr<int32_t>(),
+        cum_sum.data_ptr<int32_t>(),
+        num_groups,
+        numel_per_block);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
+  });


The block_copy kernel launch logic uses DISPATCH_FLOATING_TYPES but does not explicitly handle torch::kFloat32 in the vectorized path, even though VecType<float> is defined. This might lead to suboptimal performance for float32 tensors. Please ensure the vectorized path is correctly utilized for all supported types.

yingxudeng · 2026-04-15T14:20:34Z

There seem to be many unrelated commits in this PR. Please rebase onto the latest base branch.

phantomlei3 and others added 30 commits March 30, 2026 10:55

bugfix: remove enable_mla in unit test to pass compiling. (jd-opensou…

2e763ec

…rce#1129)

feat: add an interface for the thread pool to bind cpu-core. (jd-open…

c806739

…source#1112)

fix: remove sensitive information. (jd-opensource#1132)

fd45a37

bugfix: disable block copy kernel by default on non-NPU builds. (jd-o…

1ca7807

…pensource#1133)

bugfix: ensure multi-stream initialization takes effect in RecMaster. (…

c9ecb7e

…jd-opensource#1139)

bugfix: remove sensitive information. (jd-opensource#1137)

a582b36

Signed-off-by: pengtao <pengtao.156@jd.com>

perf: reserve vector capacity before batch push_back. (jd-opensource#…

5799823

…1089)

bugfix: unify model name extraction for model_id and model_version (j…

c475cc6

…d-opensource#1125)

perf: replace push_back with emplace_back to eliminate temporary obje…

2912904

…ct construction. (jd-opensource#1097)

bugfix: fix mla option propagation in speculative engine path. (jd-op…

4c681cf

…ensource#1141)

feat: add rope_in_place tilelang kernel for npu device. (jd-opensourc…

1df961a

…e#964)

bugfix: profile TPOT with real decode-state requests. (jd-opensource#…

4a31645

…1116)

bugfix: align rec initialization flags with options. (jd-opensource#1142

fd0ed3d

)

refactor: extract multi-modal input processors to processors dir. (jd…

6bf3203

…-opensource#1022)

feat: add CLAUDE.md/AGENTS.md and code review skills. (jd-opensource#…

31cc2ec

…1120)

docs: add xLLM git workflow skill. (jd-opensource#1148)

93245b3

chore: update submodule configuration and initialization commands to …

d44bbdf

…use recursive fetching. (jd-opensource#1144) Co-authored-by: Zhang Minchao <zhminchao@163.com>

feat: add flashinfer version sampling ops. (jd-opensource#1156)

63c727f

bugfix: compute q_cu_seq_lens during SHM deserialization to fix NPU t…

e85c131

…ensor crash. (jd-opensource#1154) Signed-off-by: pengtao <pengtao.156@jd.com>

feat: support oxygenvlm model on mlu device. (jd-opensource#1131)

fcfd12c

Co-authored-by: chenxb002 <chenxb002@whu.edu.cn>

feat: support joyai-llm-flash model on npu device. (jd-opensource#1121)

e539c93

bugfix: reslove prefill coredump when enable piecewisegraph and tp on…

f3e59ac

… cuda device. (jd-opensource#1164)

feat: support torch2.10.0+cu130 for cuda device. (jd-opensource#1166)

5b3a9a4

feat: implement column parallel for lm head to improve performance. (j…

87d9e35

…d-opensource#1145)

feat: support QwenImageEditPlus pipeline on npu deivce. (jd-opensourc…

7cb0377

…e#1163)

bugfix: fix undefined tensor device crash in DP empty-input path. (jd…

c728037

…-opensource#1176)

bugfix: rollback shared prefix blocks on allocate failure. (jd-openso…

80167b8

…urce#1146) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

feat: support etcd auth in etcd client via environment variables. (jd…

6fc6cce

…-opensource#1175)

perf: optimize qwen3.5 hybrid linear cache flow[4/N]. (jd-opensource#…

34c9a59

…1160)

feat: support video inference for Qwen3-VL on NPU device. (jd-opensou…

8eb6391

…rce#1151)

DragonFive and others added 21 commits April 10, 2026 11:51

refactor: move rec model utils and align OneRec style. (jd-opensource…

5c44d8a

…#1236)

bugfix: forward REC tokenizer methods through proxy. (jd-opensource#1249

47acd9f

)

bugfix: fix build error on cuda and ilu device. (jd-opensource#1252)

8a4cc1a

refactor: unify mooncake kv cache transfer naming and helper APIs. (j…

9a3da8c

…d-opensource#1216)

feat: add OneRec REC logprobs and multi-item outputs. (jd-opensource#…

713031b

…1184)

refactor: simplify xllm server startup routing and lifecycle helpers. (…

d2f8822

…jd-opensource#1243)

bugfix: correctly reuse residuals in FBCache. (jd-opensource#1265)

6b311e0

feat: add fused_gdn_gating kernel in tilelang-ascend. (jd-opensource#…

c326cb7

…1267)

feat: add mlu mooncake pd push support. (jd-opensource#1246)

eae1623

feat: improve cuda shared memory tensor handling. (jd-opensource#1222)

a7a52bb

feat: support embedding interface for all generate VLM models. (jd-op…

fe14046

…ensource#1136)

bugfix: align HTTP content types with vLLM. (jd-opensource#1275)

2f46b55

refactor: change torch::empty to torch::zeros when allocating kvcache. (

17f4427

jd-opensource#1281)

refactor: clean up repository root layout. (jd-opensource#1279)

43b9833

refactor: separate kv_cache_transfer from kv_cache. (jd-opensource#1283)

d1fd2dd

feat: support Qwen down_proj fallback for compressed-tensors ignored …

7474fe0

…modules. (jd-opensource#1270)

bugfix: fix llm default stream execution without overlap. (jd-opensou…

460bdd9

…rce#1274)

add glm5.0 cp配置

7f42ffa

add GLM/CP handle longseq benchmark data

324510d

update some data

c4418c6

ltdo111 requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners April 15, 2026 14:13

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glm5 cp benchmark data readme#1287

Glm5 cp benchmark data readme#1287
ltdo111 wants to merge 74 commits intojd-opensource:release/v0.9.0from
ltdo111:glm5_cp_readme

ltdo111 commented Apr 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

yingxudeng commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

ltdo111 commented Apr 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

yingxudeng commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants