Skip to content

[TEST](Counter) counter test#62475

Open
jacktengg wants to merge 12 commits intoapache:masterfrom
jacktengg:wt-counter-test
Open

[TEST](Counter) counter test#62475
jacktengg wants to merge 12 commits intoapache:masterfrom
jacktengg:wt-counter-test

Conversation

@jacktengg
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 14, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@jacktengg
Copy link
Copy Markdown
Contributor Author

run buildall

@jacktengg
Copy link
Copy Markdown
Contributor Author

run buildall

1 similar comment
@jacktengg
Copy link
Copy Markdown
Contributor Author

run buildall

yiguolei and others added 10 commits April 15, 2026 11:38
…U timer

Problem Summary: Scanner::_cpu_watch (ThreadCpuStopWatch using CLOCK_THREAD_CPUTIME_ID)
was started via resume() on a scanner worker thread but read via pause() on the pipeline
task thread. Since CLOCK_THREAD_CPUTIME_ID is a per-thread CPU clock, reading it on a
different thread produces garbage/negative values, triggering the DCHECK:

  Check failed: _value.load() > -1L (-39943795 vs. -1) delta: -252570258

In the non-EOS path of _scanner_scan(), update_scanner_profile() (which calls pause())
was only called for the EOS path. The non-EOS path left _cpu_watch running and later
ScannerScheduler::submit() called pause() from the pipeline task thread.

Fix:
1. Always call update_scanner_profile() before push_back_scan_task() in _scanner_scan(),
   ensuring pause() runs on the scanner worker thread for both EOS and non-EOS paths.
2. Reinitialize _cpu_watch after reading in _update_scan_cpu_timer() so that any
   subsequent cross-thread pause() call in submit() safely reads 0.

None

- Test: Manual test - verified the logic by code analysis
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem Summary: MonotonicStopWatch::elapsed_time() can return a small
negative value (e.g. -203 ns) due to rare CLOCK_MONOTONIC rollbacks.
When stop() accumulates this into _total_time and _fresh_profile_counter()
sets it on a RuntimeProfile::Counter, the DCHECK asserting value > -1
fires and crashes the process.

Stack trace:
  RuntimeProfile::Counter::set() at runtime_profile.h:222
  PipelineTask::close() at pipeline_task.cpp:925
  close_task() at task_scheduler.cpp:86

The fix clamps the running-case delta in elapsed_time() and
elapsed_time_seconds() to max(0, delta). Since stop() calls
elapsed_time() while _running is still true, _total_time can never
accumulate a negative value, preventing the crash for all downstream
callers.

None

- Test: No need to test (clock rollback is non-deterministic hardware behavior; the fix is a trivial arithmetic clamp)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Problem Summary: ParquetReader::_total_groups is declared as `size_t _total_groups;`
without a default member initializer. When init_reader() fails before reaching
the assignment `_total_groups = _t_metadata->row_groups.size()` (e.g., because
_open_file() fails), the field remains uninitialized. ASAN fills freshly
allocated memory with 0xBE, so _total_groups becomes 0xBEBEBEBEBEBEBEBE.

When _collect_profile() later reads _total_groups via COUNTER_UPDATE, this
garbage value is cast to int64_t (-4702111234474983746) and triggers:
  Check failed: _value.load() > -1L (-4702111234474983746 vs. -1)

### Release note

None

### Check List (For Author)

- Test: No need to test - trivial default member initializer addition
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation

### What problem does this PR solve?

Problem Summary: MemoryReclamation::revoke_tasks_memory() updates a
freed_memory_counter with current_memory_bytes() from task memory trackers.
Due to concurrent batched memory tracking, current_memory_bytes() can return
small negative values (e.g., -96 bytes). This negative delta triggers the
DCHECK in Counter::update(): Check failed: _value.load() > -1L (-96 vs. -1).

The fix clamps current_memory_bytes() to std::max(int64_t(0), ...) at both
COUNTER_UPDATE sites in revoke_tasks_memory(), since freed_memory is a
logically non-negative quantity and slightly negative tracker consumption
indicates no reclaimable memory.

### Release note

None

### Check List (For Author)

- Test: No need to test - memory tracker going slightly negative is a known
  transient condition from concurrent batched tracking; the fix is a trivial
  clamp on a profiling counter
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Problem Summary: OrcReader has three members without default initializers:
- _decimal_scale_params_index (size_t): used as an index into
  _decimal_scale_params vector during decimal column decoding. Although
  currently reset to 0 at the start of each batch read, leaving it
  uninitialized is a latent bug if code paths change.
- _orc_once_max_read_bytes (int64_t): used in _create_file_reader() to
  configure ORCFileInputStream read buffer size.
- _orc_max_merge_distance_bytes (int64_t): used in _create_file_reader()
  to configure ORCFileInputStream merge distance.

All three follow the same pattern as the ParquetReader _total_groups bug:
members that are assigned during init_reader() but could be read while
still uninitialized if an error occurs before the assignment.

### Release note

None

### Check List (For Author)

- Test: No need to test - trivial default member initializer additions
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jacktengg
Copy link
Copy Markdown
Contributor Author

run buildall

@jacktengg
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings:

  1. be/src/common/config.cpp: changing enable_task_executor_in_external_table default to false is an unrelated behavior change that disables the task-executor path for all external table scans.
  2. be/src/common/daemon.cpp: the new counter-check suppression is thread-local and only applied on memory_maintenance_thread(). /profile refreshes the same process-memory counters on HTTP threads, so debug BEs can still hit the new DCHECK there.

Critical checkpoint conclusions:

  • Goal / correctness: Partially achieved. The scanner stopwatch and uninitialized-field fixes make sense, but the process-profile path is still incomplete and the config default change introduces a regression.
  • Minimal / focused: No. The config default flip is unrelated to the stated bugfixes.
  • Concurrency: The scanner CPU timer change keeps pause() on the worker thread and I did not find a new lock-order issue in the touched scan path.
  • Lifecycle / static init: No cross-TU static-init issue found; enable_profile_counter_check being thread_local is the problem because the workaround only affects one thread.
  • Config changes: Yes. One config default changes user-visible behavior without justification or coverage.
  • Parallel code paths: Incomplete. /profile -> ProcessProfile::refresh_profile() still reaches the new counter checks outside the daemon thread.
  • Special conditions / invariants: The new generic non-negative assertion is not valid for every sampled process-memory counter path.
  • Tests: No regression coverage was added for the remaining profile-refresh path or for the external-scan scheduler behavior change.
  • Test results: The PR only claims manual analysis / no-need-to-test, so there is no automated proof that all crash paths are closed.
  • Observability: Sufficient for the touched code.
  • Transaction / persistence: Not applicable.
  • Data write / atomicity: Not applicable.
  • FE/BE variable passing: Not applicable.
  • Performance: No blocking issue in the scanner fix itself, but the config flip has broader runtime impact.
  • Other issues: None beyond the blockers above.

Comment thread be/src/common/config.cpp
DEFINE_Bool(enable_task_executor_in_internal_table, "true");
// Enable task executor in external table scan.
DEFINE_Bool(enable_task_executor_in_external_table, "true");
DEFINE_Bool(enable_task_executor_in_external_table, "false");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flips the default external-table scan scheduler from TaskExecutorSimplifiedScanScheduler to ThreadPoolSimplifiedScanScheduler for every workload group (be/src/runtime/workload_group/workload_group.cpp, lines 580-588), but none of the scanner/counter fixes in this PR depend on that behavior anymore. Merging this silently disables the task-executor path for all external scans without any justification or coverage in the PR.

Suggested change
DEFINE_Bool(enable_task_executor_in_external_table, "false");
DEFINE_Bool(enable_task_executor_in_external_table, "true");

Comment thread be/src/common/daemon.cpp
}

void Daemon::memory_maintenance_thread() {
doris::enable_profile_counter_check = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the check only on memory_maintenance_thread() does not cover the other path that refreshes the same process-memory counters: /profile calls ProcessProfile::refresh_profile() from an HTTP worker thread (be/src/service/http/default_path_handlers.cpp, line 192), which reaches MemoryProfile::refresh_memory_overview_profile(). That code sets UntrackedMemory = VmRSS - all_tracked_mem_sum, and there is no invariant that sampled RSS is always greater than or equal to tracked bytes. In debug builds the new HighWaterMarkCounter::set() DCHECK will still fire on that path, so this line only hides the crash on one thread instead of fixing the generic issue.

@jacktengg
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 73.91% (68/92) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.11% (20180/37996)
Line Coverage 36.68% (190122/518301)
Region Coverage 32.94% (147660/448251)
Branch Coverage 34.06% (64603/189688)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 85.12% (103/121) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.66% (27405/37207)
Line Coverage 57.30% (296099/516736)
Region Coverage 54.57% (246945/452502)
Branch Coverage 56.19% (106916/190284)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants