Page MenuHomePhabricator

Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092
Open, Stalled, LowPublic

Description

Some of these may have problems rebooting by getting stuck at Loading ramdisk... T214840

eqiad

  • db1096
  • db1097
  • db1098
  • db1099
  • db1100
  • db1101
  • db1102
  • db1103
  • db1104
  • db1105
  • db1106

codfw

  • db2071
  • db2072
  • db2073
  • db2074
  • db2075
  • db2076
  • db2077
  • db2078
  • db2079
  • db2080
  • db2081
  • db2082
  • db2083
  • db2084
  • db2085
  • db2086
  • db2087
  • db2088
  • db2089
  • db2090
  • db2091
  • db2092

For now this is still research, to decide which steps to take.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-codfw.php: Repool db2080

Event Timeline

jcrespo created this task.Feb 15 2019, 12:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2019, 12:06 PM
jcrespo claimed this task.Feb 15 2019, 12:06 PM
jcrespo moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2019-02-15T12:30:51Z] <jynus> stop db2089 mysql instances for reboot testing T216240

jcrespo added a comment.EditedFeb 15 2019, 12:49 PM

Rebooting db2089:

PowerEdge R630
BIOS Version: 2.4.3
1st reboot: OK
2nd reboot: FAIL
3rd reboot: FAIL
4th reboot: OK
5th reboot: OK (with debug)
6th reboot: OK
7th reboot: OK (with debug)
8th reboot: OK (with debug)
9th reboot: OK (with debug)
10th reboot: OK (with debug)
11th reboot: OK
12th reboot: FAIL
13th reboot: FAIL
14th reboot: OK
jcrespo reassigned this task from jcrespo to Papaul.Feb 15 2019, 2:40 PM
jcrespo added a project: ops-codfw.
jcrespo added a subscriber: Papaul.

@Papaul from next week, please also upgrade firmware/BIOS of db2089 (only that one for now). I will put it back to production for now.

Restricted Application added a project: Operations. · View Herald TranscriptFeb 15 2019, 2:40 PM
Marostegui updated the task description. (Show Details)Feb 18 2019, 6:58 AM

I have rebooted db2085 without debug option on kernel as part of (T216273) and I have taken the opportunity to upgrade its kernel too.

Mentioned in SAL (#wikimedia-operations) [2019-02-19T07:46:32Z] <marostegui> Reboot db1106 for kernel upgrade (and remove debug from kernel) T216240 T216273

Marostegui updated the task description. (Show Details)Feb 19 2019, 7:56 AM

db1106 has been rebooted (and kernel was upgraded)

Mentioned in SAL (#wikimedia-operations) [2019-02-19T14:16:04Z] <jynus> stop db2090 for reboot testing T216240

Can db2089 be depool please if it is not yet? Thanks

Rebooting db2090:

PowerEdge R630
BIOS Version: 2.4.3
1st reboot: OK
2nd reboot: FAIL
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: 
7th reboot: 
8th reboot: 
9th reboot: 
10th reboot: 
11th reboot: 
12th reboot: 
13th reboot: 
14th reboot:

Preparing db2089 for you, @Papaul give me 5 minutes.

Mentioned in SAL (#wikimedia-operations) [2019-02-19T14:53:11Z] <jynus> stopping db2089 for hw maintenance T216240

jcrespo added a comment.EditedFeb 19 2019, 3:04 PM

Rebooting db2090:

PowerEdge R630
BIOS Version: 2.4.3
1st reboot: OK
2nd reboot: FAIL
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: OK
7th reboot: OK
8th reboot: OK
9th reboot: OK
10th reboot: OK
11th reboot: OK
12th reboot: OK
13th reboot: OK
14th reboot: OK

db2089 upgrade complete
Upgrade
BIOS from 2.4.3 to 2.9.1
IDRAC from 2.40. to 2.61

jcrespo claimed this task.Feb 19 2019, 3:39 PM

Thanks, will ping you when/if tested more issues on that and other servers.

jcrespo updated the task description. (Show Details)Feb 19 2019, 3:40 PM

Rebooting db2089:

PowerEdge R630
BIOS Version: 2.9.1
1st reboot: OK
2nd reboot: OK
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: OK
7th reboot: OK
8th reboot: OK
9th reboot: OK
10th reboot: OK
11th reboot: OK
jcrespo changed the task status from Open to Stalled.Feb 19 2019, 5:47 PM
jcrespo triaged this task as Low priority.
jcrespo moved this task from In progress to Meta/Epic on the DBA board.

So I believe this is still an ongoing issue, but the remaining hosts may have a lower probability of failing (less than 1 out of 10), so I will stall this and do it only when other hw or sw maintenance is due.

@Papaul can we upgrade firmware and BIOS on db2080?, I was bitten by this today.

Mentioned in SAL (#wikimedia-operations) [2019-04-24T13:37:44Z] <marostegui> Poweroff db2080 for onsite maintenance - T216240

Before:
BIOS Version
2.4.3
Firmware Version
2.40.40.40
IP Address(es)
10.193.1.75
iDRAC MAC Address
84:7B:EB:F6:99:B2
DNS Domain Name
Lifecycle Controller Firmware
2.40.40.40

After
Service Tag
JCBSDH2
Express Service Code
42104258102
BIOS Version
2.9.1
Firmware Version
2.61.60.60
IP Address(es)
10.193.1.75
iDRAC MAC Address
84:7B:EB:F6:99:B2
DNS Domain Name
Lifecycle Controller Firmware
2.61.60.60

Complete.

Thanks @Papaul I am rebooting the server a few times to confirm it is indeed solved!

Marostegui updated the task description. (Show Details)Apr 24 2019, 4:02 PM

Change 506349 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool db2080

https://gerrit.wikimedia.org/r/506349

Change 506349 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Repool db2080

https://gerrit.wikimedia.org/r/506349

Mentioned in SAL (#wikimedia-operations) [2019-04-25T06:14:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2080 after onsite maintenance to upgrade BIOS and firmware - T216240 (duration: 00m 54s)

Marostegui updated the task description. (Show Details)May 28 2019, 7:00 PM
jcrespo removed jcrespo as the assignee of this task.Jul 19 2019, 5:30 PM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Aug 6 2019, 2:38 PM
Marostegui updated the task description. (Show Details)Oct 22 2019, 12:22 PM

db2072 got stuck on Loading initial ramdisk ...

Marostegui updated the task description. (Show Details)Wed, Nov 13, 3:50 PM