Page MenuHomePhabricator

hadoop rolling reboot cookbook: add start-datetime flag
Open, HighPublic

Description

Two QoL improvements for this cookbook:

  • Add a start-datetime flag. Any host that has already been rebooted since the start-datetime should be skipped over. This allows resuming work when the cookbook fails or is paused by operator without unnecessarily rebooting hosts that already got rebooted.
  • Add a log line indicating to the operator when the cookbook can safely be killed (i.e. between batches)

We can use the sre.elasticsearch.rolling-operation cookbook as a model for these two changes.


Also tack on the following:

  • Fix (I believe) erroneous use of math.floor instead of math.ceil that results in batch size not being respected in some circumstances.

Details

Event Timeline

Change #1046780 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] sre.hadoop.reboot-workers: use ceil not floor

https://gerrit.wikimedia.org/r/1046780

Gehel triaged this task as High priority.Jun 21 2024, 1:33 PM

Change #1046780 merged by Ryan Kemper:

[operations/cookbooks@master] sre.hadoop.reboot-workers: use ceil not floor

https://gerrit.wikimedia.org/r/1046780

Moving to backlog for now. The start datetime flag might become a native feature from spicerack in the medium term.

Gehel edited projects, added Data-Platform-SRE; removed Data-Platform.
Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.