Description
Related Objects
Event Timeline
This task as written is not very well defined. How do we define "too many"? Via what pybal says? I am against tying fundamental conftool actions to the logic of a specific software; we can however add warnings to the pool/depool scripts if we really want to.
There's a few more complications also to consider:
- Do we want the depool to still proceed or fail?
- Do we want to just consider the depool_threshold of pybal or some form of "cluster should be larger than X?" which we've never codified anywhere?
I would anyways consider this quite low-priority given most services are now on k8s where this is mostly irrelevant.
On further thoughts:
- confctl is considered a low-level army knife that should be allowed to modify everything, regardless of operational constraints. I don't think it should refuse a write action by itself
- For new, more complex stuff, we added extensions to conftool. Maybe we can create a lbctl extension that does, better, everything the pool/depool/safe_restart scripts do in puppet. That should refuse action in case something is looking iffy
- lbctl should also allow us to check the status of the pool and export it to a prometheus-format file, so that we can solve T245058 as well
the idea of creating something called lbctl also appeared while discussing the anti-stampede mechanism implementation for Liberica (T332027) in relation to the very same issue that this task aims to fix.
@Joe: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!
Removing the Sustainability (Incident Followup) tag.
If this is still a solution to a possible incident root cause please add the tag back and consider raising the task priority.