Page MenuHomePhabricator

Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092
Open, LowPublic

Description

Some of these may have problems rebooting by getting stuck at Loading ramdisk... T214840

eqiad

  • db1096
  • db1097
  • db1098
  • db1099
  • db1100
  • db1101
  • db1102
  • db1103
  • db1104
  • db1105
  • db1106

codfw

  • db2071
  • db2072
  • db2073
  • db2074
  • db2075
  • db2076
  • db2077
  • db2078
  • db2079
  • db2080
  • db2081
  • db2082
  • db2083
  • db2084
  • db2085
  • db2086
  • db2087
  • db2088
  • db2089
  • db2090
  • db2091
  • db2092

For now this is still research, to decide which steps to take.

Event Timeline

jcrespo moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2019-02-15T12:30:51Z] <jynus> stop db2089 mysql instances for reboot testing T216240

Rebooting db2089:

PowerEdge R630
BIOS Version: 2.4.3
1st reboot: OK
2nd reboot: FAIL
3rd reboot: FAIL
4th reboot: OK
5th reboot: OK (with debug)
6th reboot: OK
7th reboot: OK (with debug)
8th reboot: OK (with debug)
9th reboot: OK (with debug)
10th reboot: OK (with debug)
11th reboot: OK
12th reboot: FAIL
13th reboot: FAIL
14th reboot: OK
jcrespo added a project: ops-codfw.
jcrespo added a subscriber: Papaul.

@Papaul from next week, please also upgrade firmware/BIOS of db2089 (only that one for now). I will put it back to production for now.

I have rebooted db2085 without debug option on kernel as part of (T216273) and I have taken the opportunity to upgrade its kernel too.

Mentioned in SAL (#wikimedia-operations) [2019-02-19T07:46:32Z] <marostegui> Reboot db1106 for kernel upgrade (and remove debug from kernel) T216240 T216273

db1106 has been rebooted (and kernel was upgraded)

Mentioned in SAL (#wikimedia-operations) [2019-02-19T14:16:04Z] <jynus> stop db2090 for reboot testing T216240

Can db2089 be depool please if it is not yet? Thanks

Rebooting db2090:

PowerEdge R630
BIOS Version: 2.4.3
1st reboot: OK
2nd reboot: FAIL
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: 
7th reboot: 
8th reboot: 
9th reboot: 
10th reboot: 
11th reboot: 
12th reboot: 
13th reboot: 
14th reboot:

Preparing db2089 for you, @Papaul give me 5 minutes.

Mentioned in SAL (#wikimedia-operations) [2019-02-19T14:53:11Z] <jynus> stopping db2089 for hw maintenance T216240

Rebooting db2090:

PowerEdge R630
BIOS Version: 2.4.3
1st reboot: OK
2nd reboot: FAIL
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: OK
7th reboot: OK
8th reboot: OK
9th reboot: OK
10th reboot: OK
11th reboot: OK
12th reboot: OK
13th reboot: OK
14th reboot: OK

db2089 upgrade complete
Upgrade
BIOS from 2.4.3 to 2.9.1
IDRAC from 2.40. to 2.61

Thanks, will ping you when/if tested more issues on that and other servers.

Rebooting db2089:

PowerEdge R630
BIOS Version: 2.9.1
1st reboot: OK
2nd reboot: OK
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: OK
7th reboot: OK
8th reboot: OK
9th reboot: OK
10th reboot: OK
11th reboot: OK
jcrespo changed the task status from Open to Stalled.Feb 19 2019, 5:47 PM
jcrespo triaged this task as Low priority.
jcrespo moved this task from In progress to Meta/Epic on the DBA board.

So I believe this is still an ongoing issue, but the remaining hosts may have a lower probability of failing (less than 1 out of 10), so I will stall this and do it only when other hw or sw maintenance is due.

@Papaul can we upgrade firmware and BIOS on db2080?, I was bitten by this today.

Mentioned in SAL (#wikimedia-operations) [2019-04-24T13:37:44Z] <marostegui> Poweroff db2080 for onsite maintenance - T216240

Before:
BIOS Version
2.4.3
Firmware Version
2.40.40.40
IP Address(es)
10.193.1.75
iDRAC MAC Address
84:7B:EB:F6:99:B2
DNS Domain Name
Lifecycle Controller Firmware
2.40.40.40

After
Service Tag
JCBSDH2
Express Service Code
42104258102
BIOS Version
2.9.1
Firmware Version
2.61.60.60
IP Address(es)
10.193.1.75
iDRAC MAC Address
84:7B:EB:F6:99:B2
DNS Domain Name
Lifecycle Controller Firmware
2.61.60.60

Complete.

Thanks @Papaul I am rebooting the server a few times to confirm it is indeed solved!

Change 506349 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool db2080

https://gerrit.wikimedia.org/r/506349

Change 506349 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Repool db2080

https://gerrit.wikimedia.org/r/506349

Mentioned in SAL (#wikimedia-operations) [2019-04-25T06:14:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2080 after onsite maintenance to upgrade BIOS and firmware - T216240 (duration: 00m 54s)

db2072 got stuck on Loading initial ramdisk ...

When doing a reboot for T273280 I just ran into this issue with db2073.
db2072 (which as a newer firmware rebooted just fine).

Marostegui changed the task status from Stalled to Open.Mar 10 2021, 5:35 PM
Marostegui moved this task from Meta/Epic to Ready on the DBA board.