Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
9007a2e to
67273e9
Compare
|
run buildall |
1 similar comment
|
run buildall |
…U timer Problem Summary: Scanner::_cpu_watch (ThreadCpuStopWatch using CLOCK_THREAD_CPUTIME_ID) was started via resume() on a scanner worker thread but read via pause() on the pipeline task thread. Since CLOCK_THREAD_CPUTIME_ID is a per-thread CPU clock, reading it on a different thread produces garbage/negative values, triggering the DCHECK: Check failed: _value.load() > -1L (-39943795 vs. -1) delta: -252570258 In the non-EOS path of _scanner_scan(), update_scanner_profile() (which calls pause()) was only called for the EOS path. The non-EOS path left _cpu_watch running and later ScannerScheduler::submit() called pause() from the pipeline task thread. Fix: 1. Always call update_scanner_profile() before push_back_scan_task() in _scanner_scan(), ensuring pause() runs on the scanner worker thread for both EOS and non-EOS paths. 2. Reinitialize _cpu_watch after reading in _update_scan_cpu_timer() so that any subsequent cross-thread pause() call in submit() safely reads 0. None - Test: Manual test - verified the logic by code analysis - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem Summary: MonotonicStopWatch::elapsed_time() can return a small negative value (e.g. -203 ns) due to rare CLOCK_MONOTONIC rollbacks. When stop() accumulates this into _total_time and _fresh_profile_counter() sets it on a RuntimeProfile::Counter, the DCHECK asserting value > -1 fires and crashes the process. Stack trace: RuntimeProfile::Counter::set() at runtime_profile.h:222 PipelineTask::close() at pipeline_task.cpp:925 close_task() at task_scheduler.cpp:86 The fix clamps the running-case delta in elapsed_time() and elapsed_time_seconds() to max(0, delta). Since stop() calls elapsed_time() while _running is still true, _total_time can never accumulate a negative value, preventing the crash for all downstream callers. None - Test: No need to test (clock rollback is non-deterministic hardware behavior; the fix is a trivial arithmetic clamp) - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve? Problem Summary: ParquetReader::_total_groups is declared as `size_t _total_groups;` without a default member initializer. When init_reader() fails before reaching the assignment `_total_groups = _t_metadata->row_groups.size()` (e.g., because _open_file() fails), the field remains uninitialized. ASAN fills freshly allocated memory with 0xBE, so _total_groups becomes 0xBEBEBEBEBEBEBEBE. When _collect_profile() later reads _total_groups via COUNTER_UPDATE, this garbage value is cast to int64_t (-4702111234474983746) and triggers: Check failed: _value.load() > -1L (-4702111234474983746 vs. -1) ### Release note None ### Check List (For Author) - Test: No need to test - trivial default member initializer addition - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation ### What problem does this PR solve? Problem Summary: MemoryReclamation::revoke_tasks_memory() updates a freed_memory_counter with current_memory_bytes() from task memory trackers. Due to concurrent batched memory tracking, current_memory_bytes() can return small negative values (e.g., -96 bytes). This negative delta triggers the DCHECK in Counter::update(): Check failed: _value.load() > -1L (-96 vs. -1). The fix clamps current_memory_bytes() to std::max(int64_t(0), ...) at both COUNTER_UPDATE sites in revoke_tasks_memory(), since freed_memory is a logically non-negative quantity and slightly negative tracker consumption indicates no reclaimable memory. ### Release note None ### Check List (For Author) - Test: No need to test - memory tracker going slightly negative is a known transient condition from concurrent batched tracking; the fix is a trivial clamp on a profiling counter - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve? Problem Summary: OrcReader has three members without default initializers: - _decimal_scale_params_index (size_t): used as an index into _decimal_scale_params vector during decimal column decoding. Although currently reset to 0 at the start of each batch read, leaving it uninitialized is a latent bug if code paths change. - _orc_once_max_read_bytes (int64_t): used in _create_file_reader() to configure ORCFileInputStream read buffer size. - _orc_max_merge_distance_bytes (int64_t): used in _create_file_reader() to configure ORCFileInputStream merge distance. All three follow the same pattern as the ParquetReader _total_groups bug: members that are assigned during init_reader() but could be read while still uninitialized if an error occurs before the assignment. ### Release note None ### Check List (For Author) - Test: No need to test - trivial default member initializer additions - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d479d3a to
8add74a
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Findings:
be/src/common/config.cpp: changingenable_task_executor_in_external_tabledefault tofalseis an unrelated behavior change that disables the task-executor path for all external table scans.be/src/common/daemon.cpp: the new counter-check suppression is thread-local and only applied onmemory_maintenance_thread()./profilerefreshes the same process-memory counters on HTTP threads, so debug BEs can still hit the new DCHECK there.
Critical checkpoint conclusions:
- Goal / correctness: Partially achieved. The scanner stopwatch and uninitialized-field fixes make sense, but the process-profile path is still incomplete and the config default change introduces a regression.
- Minimal / focused: No. The config default flip is unrelated to the stated bugfixes.
- Concurrency: The scanner CPU timer change keeps
pause()on the worker thread and I did not find a new lock-order issue in the touched scan path. - Lifecycle / static init: No cross-TU static-init issue found;
enable_profile_counter_checkbeingthread_localis the problem because the workaround only affects one thread. - Config changes: Yes. One config default changes user-visible behavior without justification or coverage.
- Parallel code paths: Incomplete.
/profile->ProcessProfile::refresh_profile()still reaches the new counter checks outside the daemon thread. - Special conditions / invariants: The new generic non-negative assertion is not valid for every sampled process-memory counter path.
- Tests: No regression coverage was added for the remaining profile-refresh path or for the external-scan scheduler behavior change.
- Test results: The PR only claims manual analysis / no-need-to-test, so there is no automated proof that all crash paths are closed.
- Observability: Sufficient for the touched code.
- Transaction / persistence: Not applicable.
- Data write / atomicity: Not applicable.
- FE/BE variable passing: Not applicable.
- Performance: No blocking issue in the scanner fix itself, but the config flip has broader runtime impact.
- Other issues: None beyond the blockers above.
| DEFINE_Bool(enable_task_executor_in_internal_table, "true"); | ||
| // Enable task executor in external table scan. | ||
| DEFINE_Bool(enable_task_executor_in_external_table, "true"); | ||
| DEFINE_Bool(enable_task_executor_in_external_table, "false"); |
There was a problem hiding this comment.
This flips the default external-table scan scheduler from TaskExecutorSimplifiedScanScheduler to ThreadPoolSimplifiedScanScheduler for every workload group (be/src/runtime/workload_group/workload_group.cpp, lines 580-588), but none of the scanner/counter fixes in this PR depend on that behavior anymore. Merging this silently disables the task-executor path for all external scans without any justification or coverage in the PR.
| DEFINE_Bool(enable_task_executor_in_external_table, "false"); | |
| DEFINE_Bool(enable_task_executor_in_external_table, "true"); |
| } | ||
|
|
||
| void Daemon::memory_maintenance_thread() { | ||
| doris::enable_profile_counter_check = 0; |
There was a problem hiding this comment.
Disabling the check only on memory_maintenance_thread() does not cover the other path that refreshes the same process-memory counters: /profile calls ProcessProfile::refresh_profile() from an HTTP worker thread (be/src/service/http/default_path_handlers.cpp, line 192), which reaches MemoryProfile::refresh_memory_overview_profile(). That code sets UntrackedMemory = VmRSS - all_tracked_mem_sum, and there is no invariant that sampled RSS is always greater than or equal to tracked bytes. In debug builds the new HighWaterMarkCounter::set() DCHECK will still fire on that path, so this line only hides the crash on one thread instead of fixing the generic issue.
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)