Page MenuHomePhabricator

db2125 crashed - mgmt iface also not available
Closed, ResolvedPublic

Description

db2125 has reported as down:

[10:01:51]  <+icinga-wm>	PROBLEM - Host db2125 is DOWN: PING CRITICAL - Packet loss = 100%

The mgmt interface is also not available, so I cannot check what's going on.
@Papaul can you check its status and the mgmt interface?

Related Objects

StatusSubtypeAssignedTask
ResolvedPapaul
ResolvedKormat

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Papaul I guess we should do that...and then wait again for the next crash and start the whole loop again.
@Papaul let me know when you want me to stop mysql so you can proceed with this new upgrade

@Papaul doing it now, thanks - will ping you once it is ready for you

Mentioned in SAL (#wikimedia-operations) [2020-09-17T13:17:15Z] <marostegui> Stop MySQL on db2125 for on-site maintenance T260670

upgrade BIOS from BIOS Version 2.8.1 to 2.8.2
changed profile settings from <Performance Per Watt OS> to Performace

@Marostegui all yours

Thank you @Papaul - I will start mysql and start slowing repooling the host in production.

Change 628091 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2125: Enable notifications

https://gerrit.wikimedia.org/r/628091

Change 628091 merged by Marostegui:
[operations/puppet@production] db2125: Enable notifications

https://gerrit.wikimedia.org/r/628091

Mentioned in SAL (#wikimedia-operations) [2020-09-17T14:00:15Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12629 and previous config saved to /var/cache/conftool/dbconfig/20200917-140014-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-17T14:18:26Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12630 and previous config saved to /var/cache/conftool/dbconfig/20200917-141825-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-17T14:39:15Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12631 and previous config saved to /var/cache/conftool/dbconfig/20200917-143914-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-17T15:02:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12633 and previous config saved to /var/cache/conftool/dbconfig/20200917-150234-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-17T15:13:48Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12634 and previous config saved to /var/cache/conftool/dbconfig/20200917-151347-marostegui.json

This host is fully back in production. I still believe we should get the CPUs/mainboard replaced to be fully sure.
Going to leave this open for a bit longer

Closing per the internal email thread. If this happens again we'll reopen and contact Dell again.

This host crashed again:

--------------------------------------------------------------------------------
SeqNumber       = 890
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2020-09-21 11:31:52
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 889
Message ID      = SYS1000
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2020-09-21 11:31:34
Message         = System is turning on.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 888
Message ID      = SYS1001
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2020-09-21 11:31:24
Message         = System is turning off.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 887
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2020-09-21 11:31:24
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 886
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2020-09-21 11:31:07
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 885
Message ID      = CPU0000
Category        = System
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2020-09-21 11:30:58
Message         = Internal error has occurred check for additional logs.
--------------------------------------------------------------------------------

Change 628777 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2125: Disable notifications

https://gerrit.wikimedia.org/r/628777

Change 628777 merged by Marostegui:
[operations/puppet@production] db2125: Disable notifications

https://gerrit.wikimedia.org/r/628777

Dear Papaul Tshibamba,

This e-mail is to update you on the status of your Dell Service Request.

Current Status:

The Dell replacement part(s) for your POWEREDGE R440,ICE PE has been shipped by FEDX on tracking number XXXXXXXXXXXXX.

Looks like it arrived \o/:

Delivered Wednesday 9/23/2020 at 9:57 am

@Papaul can you coordinate with @Kormat for this? I will be off from today's evening till Monday, so if you need something from us regarding this host today or tomorrow, talk to her please.
Thank you!

I will be on site today the only thing i need for now is depool the server and power it down if it is not done yet.

Thanks.

Mentioned in SAL (#wikimedia-operations) [2020-09-24T12:27:12Z] <kormat> powering off db2125 for maintenance T260670

@Papaul : server is depooled and powered down now. Cheers :)

main board replaced and upgrade BIOS and IDRAC on the new board.
@Kormat you can repool the server and resolve this task for now when done

Thanks.

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db2125.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009250824_kormat_28716.log.

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db2125.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009250845_kormat_17901.log.

Change 630088 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] installer_server: Update MAC for db2125

https://gerrit.wikimedia.org/r/630088

Change 630088 merged by Kormat:
[operations/puppet@production] installer_server: Update MAC for db2125

https://gerrit.wikimedia.org/r/630088

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db2125.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009250904_kormat_3140.log.

Completed auto-reimage of hosts:

['db2125.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2020-09-28T07:43:14Z] <kormat@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12809 and previous config saved to /var/cache/conftool/dbconfig/20200928-074313-kormat.json

Change 630535 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2125: Enable notifications

https://gerrit.wikimedia.org/r/630535

Change 630535 merged by Marostegui:
[operations/puppet@production] db2125: Enable notifications

https://gerrit.wikimedia.org/r/630535

Mentioned in SAL (#wikimedia-operations) [2020-09-28T07:58:18Z] <kormat@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12810 and previous config saved to /var/cache/conftool/dbconfig/20200928-075817-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-09-28T08:13:22Z] <kormat@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12811 and previous config saved to /var/cache/conftool/dbconfig/20200928-081321-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-09-28T08:28:26Z] <kormat@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12813 and previous config saved to /var/cache/conftool/dbconfig/20200928-082825-kormat.json

Alright, the host is fully back in service now, so resolving this again :)

@Kormat tell @Marostegui to not break the host again :)

hahah - reminder: you are the one that walks next to the host often...........

CDanis added a subscriber: CDanis.

crashed again

from HW logs

--------------------------------------------------------------------------------
SeqNumber       = 286
Message ID      = PWR2270
Category        = System
AgentID         = iDRAC
Severity        = Warning
Timestamp       = 2020-09-29 20:47:51
Message         = The Intel Management Engine has encountered a Health Event.
Message Arg   1 = Event data1: 0xa0, data2: 0x0d, data3: 0x03
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
2020-09-29 20:47:59 	287 	SYS1003 	System CPU Resetting.
2020-09-29 20:47:51 	286 	PWR2270 	The Intel Management Engine has encountered a Health Event.
2020-09-29 20:47:47 	285 	SYS1000 	System is turning on.
2020-09-29 20:47:38 	284 	SYS1001 	System is turning off.
2020-09-29 20:47:38 	283 	SYS1003 	System CPU Resetting.
2020-09-29 20:47:21 	282 	RAC0703 	Requested system hardreset.
2020-09-29 20:47:12 	281 	**``//CPU0000//``** 	Internal error has occurred check for additional logs.

Change 630984 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2125: Disable notifications

https://gerrit.wikimedia.org/r/630984

Change 630984 merged by Marostegui:
[operations/puppet@production] db2125: Disable notifications

https://gerrit.wikimedia.org/r/630984

Just for the record, those CPU reset/error have been happening since the first crash.

There is one thing I have seen, which is that the temperature of this host, according to grafana is a lot higher than a host on the same section (db2126):

db2125

Captura de pantalla 2020-09-30 a las 12.46.32.png (796×1 px, 432 KB)

db2126

Captura de pantalla 2020-09-30 a las 12.46.58.png (785×1 px, 211 KB)

Though, if the host is switching off due to temperature, I would expect to see an alert on the idrac logs.

After escalating to technical account rep, replacement CPUs are being shipped by Dell, and can wait to be replaced when Papaul is back from vacation.

The host crashed again yesterday while loading a backup, same CPU error as always

--------------------------------------------------------------------------------
SeqNumber       = 343
Message ID      = SYS1001
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2020-10-06 16:27:19
Message         = System is turning off.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 342
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2020-10-06 16:27:18
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 341
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2020-10-06 16:27:02
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------

Mentioned in SAL (#wikimedia-operations) [2020-10-08T07:55:19Z] <marostegui> Rebuild db2125 from snapshots - T260670

Both CPU replaced, server is back up

Thank you Papaul, I will start repooling the host tomorrow and see how not goes with load

Change 633863 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2125: Enable notifications

https://gerrit.wikimedia.org/r/633863

Change 633863 merged by Marostegui:
[operations/puppet@production] db2125: Enable notifications

https://gerrit.wikimedia.org/r/633863

Mentioned in SAL (#wikimedia-operations) [2020-10-14T05:44:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 10%: Slowly repool db2125 after on-site maintenance T260670 ', diff saved to https://phabricator.wikimedia.org/P12982 and previous config saved to /var/cache/conftool/dbconfig/20201014-054420-root.json

Mentioned in SAL (#wikimedia-operations) [2020-10-14T05:59:24Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 30%: Slowly repool db2125 after on-site maintenance T260670 ', diff saved to https://phabricator.wikimedia.org/P12983 and previous config saved to /var/cache/conftool/dbconfig/20201014-055923-root.json

Mentioned in SAL (#wikimedia-operations) [2020-10-14T06:14:27Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 20%: Slowly repool db2125 after on-site maintenance T260670 ', diff saved to https://phabricator.wikimedia.org/P12984 and previous config saved to /var/cache/conftool/dbconfig/20201014-061426-root.json

Mentioned in SAL (#wikimedia-operations) [2020-10-14T06:29:30Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 40%: Slowly repool db2125 after on-site maintenance T260670 ', diff saved to https://phabricator.wikimedia.org/P12985 and previous config saved to /var/cache/conftool/dbconfig/20201014-062930-root.json

Mentioned in SAL (#wikimedia-operations) [2020-10-14T06:44:34Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: Slowly repool db2125 after on-site maintenance T260670 ', diff saved to https://phabricator.wikimedia.org/P12986 and previous config saved to /var/cache/conftool/dbconfig/20201014-064433-root.json

Mentioned in SAL (#wikimedia-operations) [2020-10-14T06:59:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: Slowly repool db2125 after on-site maintenance T260670 ', diff saved to https://phabricator.wikimedia.org/P12987 and previous config saved to /var/cache/conftool/dbconfig/20201014-065936-root.json

Mentioned in SAL (#wikimedia-operations) [2020-10-14T07:14:40Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: Slowly repool db2125 after on-site maintenance T260670 ', diff saved to https://phabricator.wikimedia.org/P12988 and previous config saved to /var/cache/conftool/dbconfig/20201014-071440-root.json

This host has been fully pooled and notifications were enabled too.
Going to close this as fixed for now, we'll see if it crashes again.