Page MenuHomePhabricator

depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled
Open, LowPublic

Event Timeline

jbond triaged this task as Medium priority.Feb 13 2020, 11:46 AM
akosiaris subscribed.

Triaging to serviceops because of conftool

Joe lowered the priority of this task from Medium to Low.Mar 21 2023, 8:34 AM
Joe subscribed.

This task as written is not very well defined. How do we define "too many"? Via what pybal says? I am against tying fundamental conftool actions to the logic of a specific software; we can however add warnings to the pool/depool scripts if we really want to.

There's a few more complications also to consider:

  • Do we want the depool to still proceed or fail?
  • Do we want to just consider the depool_threshold of pybal or some form of "cluster should be larger than X?" which we've never codified anywhere?

I would anyways consider this quite low-priority given most services are now on k8s where this is mostly irrelevant.

On further thoughts:

  • confctl is considered a low-level army knife that should be allowed to modify everything, regardless of operational constraints. I don't think it should refuse a write action by itself
  • For new, more complex stuff, we added extensions to conftool. Maybe we can create a lbctl extension that does, better, everything the pool/depool/safe_restart scripts do in puppet. That should refuse action in case something is looking iffy
  • lbctl should also allow us to check the status of the pool and export it to a prometheus-format file, so that we can solve T245058 as well

the idea of creating something called lbctl also appeared while discussing the anti-stampede mechanism implementation for Liberica (T332027) in relation to the very same issue that this task aims to fix.