Page MenuHomePhabricator

October 2025 Bullseye reboots: Data Platform Engineering-owned hosts
Open, Needs TriagePublic

Description

Task to track reboots for an*, kafka*, etc

Details

Related Objects

Event Timeline

an-worker* reboots ongoing now

Change #1214664 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] hadoop.reboot-workers: make host override smarter

https://gerrit.wikimedia.org/r/1214664

an-worker* partially done. made https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214664 to allow us to reboot a subset of a cluster's hosts while still handling the need to restart one journal node at a time properly. patch needs a bit of fixup.

stat hosts will be restarted tomorrow

Oh, with respect to the patch, we should also get https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/976163/1/cookbooks/sre/hadoop/reboot-workers.py reviewed and merged at the same time since it's directly relevant to this

Mentioned in SAL (#wikimedia-operations) [2025-12-04T22:11:08Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat[1008-1011].eqiad.wmnet with reason: T411568

Stat host reboots completed.

Shifting gears to rebooting an-test*. Note there's still lots of an-worker* hosts that need to be rebooted, but that will be easier once I finish the patch I mentioned above.