Page MenuHomePhabricator

Upgrade firmware on db1136
Closed, ResolvedPublic

Description

While working on the reboots for the stretch kernel upgrade we ran into a known issue with some firmwares: T216240.
This affected db2073 (T276909) and by upgrading its firmware we got it back.

We need to switchover current s7 primary master to a candidate master, db1136 (T274336) but before doing so, we'd like to upgrade the kernel on that host, which would need a reboot.
@wiki_willy can we get sometime scheduled with eqiad dcops so the host can get upgraded?.

We'd like this to have some high priority as we need to replace s7 primary master soon (scheduled for 23rd March), as it is a host that might run into BBU issues (T258386) and could crash anytime.

Let us know which day/time could work so we can have the host with mysql stopped and powered off.

Thanks!

Event Timeline

@Marostegui can we schedule this for Monday next week? 1500/1600UTC timeframe please? Thanks

Sounds good @Cmjohnson - I will leave the host off beforehand so you can proceed as you wish. Once you are done, just power it back on and I will take it from there.
Thank you

Change 672337 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1136: Disable notifications

https://gerrit.wikimedia.org/r/672337

Change 672337 merged by Marostegui:
[operations/puppet@production] db1136: Disable notifications

https://gerrit.wikimedia.org/r/672337

Mentioned in SAL (#wikimedia-operations) [2021-03-15T08:54:10Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1136 T277007', diff saved to https://phabricator.wikimedia.org/P14829 and previous config saved to /var/cache/conftool/dbconfig/20210315-085409-marostegui.json

@Cmjohnson db1136 is now off, you can proceed as needed

@Marostegui updated the BIOS firmware

This host booted from PXE boot, and attempted to reimage itself.
Luckily the partman recipe we have didn't delete its data. Did the BIOS upgrade change the default boot method?

@Cmjohnson can you take a look to see if that was the case?

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1136.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103160539_marostegui_13173.log.

Completed auto-reimage of hosts:

['db1136.eqiad.wmnet']

and were ALL successful.

Change 672606 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1136: Enable notifications

https://gerrit.wikimedia.org/r/672606

Change 672606 merged by Marostegui:
[operations/puppet@production] db1136: Enable notifications

https://gerrit.wikimedia.org/r/672606

Host is being repooled.
Thanks!