Page MenuHomePhabricator

db1135 has crashed
Closed, ResolvedPublic

Description

Can't ssh into it.

getsel:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   06/07/2023 17:29:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   06/07/2023 17:32:54
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   06/07/2023 17:32:54
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B6.
-------------------------------------------------------------------------------

Details

Related Changes in Gerrit:

Event Timeline

iDRAC reporting the same error message, paused waiting for user input OS is down.

image.png (771×889 px, 64 KB)

Change 928127 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1135: Disable notification

https://gerrit.wikimedia.org/r/928127

Change 928127 merged by Ladsgroup:

[operations/puppet@production] db1135: Disable notification

https://gerrit.wikimedia.org/r/928127

Mentioned in SAL (#wikimedia-operations) [2023-06-07T18:27:04Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1135.eqiad.wmnet with reason: T338354

Mentioned in SAL (#wikimedia-operations) [2023-06-07T18:27:17Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1135.eqiad.wmnet with reason: T338354

This server is out of warranty i can pull a dimm from decom server if needed

This server is out of warranty i can pull a dimm from decom server if needed

I think this is the only thing we can do, please go ahead, the server has disabled notifications (although I think it is stuck in POST).

@Ladsgroup @Marostegui As you will be back before I am, remember to (in case you want to do it, if not you can wait for me):

  • Resetup data (can be done from the lastest snapshot, as documented)
  • Remove the manual disabled notifications (had to be done because puppet hadn't run there)
  • Remove the pupper disabled notifications
  • Remove or let expire the downtime
  • Start replication, etc.
  • Repool

Thanks! I'm setting the mysql up and making sure it's getting replicated.

Mentioned in SAL (#wikimedia-operations) [2023-06-16T10:38:34Z] <Amir1> root@cumin1001:/home/ladsgroup/software2/dbtools# cat s1.dblist | grep -v "#" | while read db; do cat tables_to_check.txt | while read table index; do echo "$db.$table"; db-compare $db $table $index db1135.eqiad.wmnet:3306 db1118 db1139:3311 || break 2; done ; done (T338354)

Mentioned in SAL (#wikimedia-operations) [2023-06-19T10:16:53Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'db1135 (re)pooling @ 10%: Maint over (T338354)', diff saved to https://phabricator.wikimedia.org/P49446 and previous config saved to /var/cache/conftool/dbconfig/20230619-101653-ladsgroup.json

The data check didn't bring any difference. Repooling

Mentioned in SAL (#wikimedia-operations) [2023-06-19T10:31:58Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: Maint over (T338354)', diff saved to https://phabricator.wikimedia.org/P49447 and previous config saved to /var/cache/conftool/dbconfig/20230619-103157-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-06-19T10:47:03Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: Maint over (T338354)', diff saved to https://phabricator.wikimedia.org/P49448 and previous config saved to /var/cache/conftool/dbconfig/20230619-104702-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-06-19T11:02:07Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: Maint over (T338354)', diff saved to https://phabricator.wikimedia.org/P49449 and previous config saved to /var/cache/conftool/dbconfig/20230619-110207-ladsgroup.json