[CONTP-1758] Improve DatadogGenericResource reconciliation at scale#3143
[CONTP-1758] Improve DatadogGenericResource reconciliation at scale#3143tbavelier wants to merge 12 commits into
Conversation
🛑 Gate Violations
ℹ️ Info🎯 Code Coverage (details) Useful? React with 👍 / 👎 This comment will be updated automatically if new data arrives.🔗 Commit SHA: f2af21b | Docs | Datadog PR Page | Give us feedback! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3143 +/- ##
==========================================
+ Coverage 43.79% 44.73% +0.93%
==========================================
Files 375 377 +2
Lines 30575 31508 +933
==========================================
+ Hits 13390 14094 +704
- Misses 16276 16489 +213
- Partials 909 925 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 13 files with indirect coverage changes Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
d08e2b9 to
b6051a5
Compare
b6051a5 to
29c92c9
Compare
d4a9c5c to
136adf5
Compare
drichards-87
left a comment
There was a problem hiding this comment.
Left some suggestions from Docs and approved the PR.
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
What does this PR do?
Improves
DatadogGenericResourcereconciliation behavior at high resource counts:DatadogGenericResourcecontroller concurrency with--datadogGenericResourceMaxConcurrentReconciles--datadogGenericResourceRequeuePeriodandDD_GENERIC_RESOURCE_REQUEUE_PERIODMotivation
At high CR counts, periodic status polling can dominate the controller queue and delay user-facing create/update/delete operations. This keeps DDGR status polling tunable and lower priority while leaving normal reconciliation and backend error retries at regular priority.
Should help in situations like #2816
Additional Notes
The low-priority path is limited to DDGR
refreshStatereconciles, which currently applies to resource types that fetch backend state such as monitors and SLOs. It intentionally does not lower the priority of create/update/delete retriesValidated manually that
workqueue_depth{controller="DatadogGenericResource", priority="-100"}exposes low-priority queue depth for status polling requeues. Created 1k DDGR with 10 concurrent reconciles, then changed to 1 concurrent reconcile to accumulate backlog in low prio queue for status update. Created an additional DDGR and verified it was treated in priority over the backlog (synced with backend)Minimum Agent Versions
No minimum Agent or Cluster Agent version changes.
Describe your test plan
2 things to test "separately":
--datadogGenericResourceMaxConcurrentReconciles=10datadogGenericResourceMaxConcurrentReconcilesback to 1 (can remove it since 1 is the default) and wait for a minute or two while the new operator pod acquires lease (or delete lease directly + restart the pod after)k apply -f examples/datadoggenericresource/dashboard-sample.yamlk delete ddgr ddgr-dashboard-sample: once again, ensure it is indeed deleted--datadogGenericResourceMaxConcurrentReconciles=10and clean up all the resources (again with a script/something)DD_GENERIC_RESOURCE_REQUEUE_PERIODenv var set to5mfoo){"level":"ERROR","ts":"2026-06-16T09:59:18.652Z","logger":"controllers.DatadogGenericResource","msg":"Invalid value for generic resource requeue period. Defaulting to 60 seconds.","error":"time: invalid duration \"foo\"","stacktrace":"github.com/DataDog/datadog-operator/internal/controller/datadoggenericresource.requeuePeriodFromEnv\n\t/workspace/internal/controller/datadoggenericresource/controller.go:91\ngithub.com/DataDog/datadog-operator/internal/controller/datadoggenericresource.requeuePeriod\n\t/workspace/internal/controller/datadoggenericresource/controller.go:82\ngithub.com/DataDog/datadog-operator/internal/controller/datadoggenericresource.NewReconciler\n\t/workspace/internal/controller/datadoggenericresource/controller.go:69\ngithub.com/DataDog/datadog-operator/internal/controller.(*DatadogGenericResourceReconciler).SetupWithManager\n\t/workspace/internal/controller/datadoggenericresource_controller.go:52\ngithub.com/DataDog/datadog-operator/internal/controller.startDatadogGenericResource\n\t/workspace/internal/controller/setup.go:234\ngithub.com/DataDog/datadog-operator/internal/controller.SetupControllers\n\t/workspace/internal/controller/setup.go:100\nmain.run\n\t/workspace/cmd/main.go:413\nmain.main\n\t/workspace/cmd/main.go:223\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:285"}Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel