Task to track reboots for an*, kafka*, etc
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| hadoop.reboot-workers: make host override smarter | operations/cookbooks | master | +26 -11 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Restricted Task | |||||
| Open | RKemper | T411568 October 2025 Bullseye reboots: Data Platform Engineering-owned hosts |
Event Timeline
Change #1214664 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/cookbooks@master] hadoop.reboot-workers: make host override smarter
an-worker* partially done. made https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214664 to allow us to reboot a subset of a cluster's hosts while still handling the need to restart one journal node at a time properly. patch needs a bit of fixup.
stat hosts will be restarted tomorrow
Oh, with respect to the patch, we should also get https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/976163/1/cookbooks/sre/hadoop/reboot-workers.py reviewed and merged at the same time since it's directly relevant to this
Mentioned in SAL (#wikimedia-operations) [2025-12-04T22:11:08Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat[1008-1011].eqiad.wmnet with reason: T411568
Mentioned in SAL (#wikimedia-operations) [2025-12-04T22:20:44Z] <ryankemper> T411568 Rebooting stat*
Stat host reboots completed.
Shifting gears to rebooting an-test*. Note there's still lots of an-worker* hosts that need to be rebooted, but that will be easier once I finish the patch I mentioned above.