Page MenuHomePhabricator

Add --min-uptime to cookbooks
Open, LowPublic

Description

In 1211089 we introduced the -min-uptime flag, which allows filtering hosts based on the uptime of a given service. This functionality should be useful when we want to resume rebooting/restarting operations in large host clusters, effectively skipping hosts that have already been restarted.

It would be great if we'd generalise this functionality make it available to all cookbooks

Additional features would be:

  • find and act on X servers with the longest uptime
  • print which servers we are acting on before rebooting

Event Timeline

jijiki renamed this task from Add --min-uptime forcookbooks to Add --min-uptime too cookbooks.Mar 13 2026, 11:25 AM
Aklapper renamed this task from Add --min-uptime too cookbooks to Add --min-uptime to cookbooks.Mar 14 2026, 3:10 PM
Ajuanca subscribed.

I'd love to take a swing at implementing this!

Just to confirm the approach before I start: the plan is basically to extract the --min-uptime argument, time-parsing, and filtering logic introduced in 1211089 and move it up into SREBatchBase and SREBatchRunnerBase.

To make it truly generic, I plan to add an optional argument (e.g., --uptime-service) so the framework can either check the uptime of a specific systemd service, or just fall back to standard OS uptime if no service is specified. Then I'll clean up the memcached cookbook so it inherits this new base logic.

Does this approach sound good? If so, I can assign the task to myself and get started.

Okay, but the task is what I said -moving the logic of that commit to the two classes SREBatchBase and SREBatchRunnerBase? Maybe the --uptime-service proposal can be done in another issue? If this sounds good to you, I could happily resolve it.

Reading the tips you linked specifically states that I can assign it to myself without asking but i want to make sure it is well defined. My main doubt is about the --uptime-service suggestion (whether I can do it in this task or i need to create another one).

I also was wondering about a resumable rolling reboot feature for cookbooks and found this task, and of course I'm +1! The way I understand the feature currently is the following:

  • rolling reboot starts at t0
  • user runs reboot-hosts --min-uptime 1d at t0, some reboots are done
  • a few days pass
  • not sure what's the next invocation of --min-uptime ?

What I thought about is having a timestamp of when the rolling reboots started (passed in by the user), anything with a boot time before that is yet to do and the opposite is true for hosts already rebooted. What do you think?

I think usability wise it might be more helpful to have an argument which takes the date and time after which a reboot is expected. So something like --not-rebootet-since "2026-03-13 10:47:00Z" for {T419960} (maybe bonus points for allowing a phab task ID and using it's creation date?).

What's task T419960 about? I don't enough privilegies to access it. Yes, I think a parameter with explicit reboot time is more robust than a relative behaviour.

Also about my first question, should I use system time (/proc/uptime) or create an aditional argument like --uptime-service to check for a specific service?

What's task T419960 about? I don't enough privilegies to access it. Yes, I think a parameter with explicit reboot time is more robust than a relative behaviour.

That task is related to rebooting some servers for maintenance. Just meant as a reference, but there is no additional information there.

Also about my first question, should I use system time (/proc/uptime) or create an aditional argument like --uptime-service to check for a specific service?

I would go with just server uptime for the first iteration thb. Feels like that's the most commonly used case. But maybe @jijiki had something else in mind?

start-datetime flag of 367592 acts exactly like the one we're discussing. IMHO, --not-rebooted-since is a better name. We should prob ditch --min-uptime if we're opting for the non-relative behavior.

Restarting X servers with the longest uptime is also relative and could cause accidental reboots (ie. What happens if two servers have the exact same uptime?). Anyway, we could call it --reboot-longest-uptime

What's the simplest cookbook I can run to check the changes? I have tried with sre.maps.roll-restart-reboot but I get missing /etc/cumin/config.yaml