Page MenuHomePhabricator

db1112 (s3 contribs/rc replica) is down
Closed, ResolvedPublic

Description

Which means no cloud replication on s3

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Legoktm renamed this task from db1112 (s3 sanitarium master) is down to db1112 (s3 contribs/rc replica) is down.Oct 25 2021, 7:00 PM
Legoktm raised the priority of this task from High to Unbreak Now!.

Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:04:36Z] <legoktm@cumin1001> dbctl commit (dc=all): 'Depool db1112 (T294295)', diff saved to https://phabricator.wikimedia.org/P17596 and previous config saved to /var/cache/conftool/dbconfig/20211025-190436-legoktm.json

Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:07:17Z] <kormat@cumin1001> dbctl commit (dc=all): 'Temporarily move mw groups to db1123 T294295', diff saved to https://phabricator.wikimedia.org/P17597 and previous config saved to /var/cache/conftool/dbconfig/20211025-190717-kormat.json

Nothing obvious in syslog, last entry before reboot was:

Oct 25 18:23:03 db1112 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.

Nothing obvious in ipmi-sel either:

$ sudo ipmi-sel 
ID  | Date        | Time     | Name             | Type                     | Event
1   | Nov-04-2017 | 08:31:17 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
...
21  | Oct-10-2019 | 12:02:56 | PS Redundancy    | Power Supply             | Fully Redundant
22  | Oct-23-2021 | 10:54:23 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 10h

That last entry looks a bit suspicious, but it was 2 days ago, and it was a transition from non-crit to ok.

The server went down unexpectedly by itself, I connected to mgmt, powercycled it. watched it boot up, besides doing a normal fsck due to the unexpected power down there was nothing obvious.

It just booted as normal from there.

Then in racadm getsel there is this from 2 days ago.

-------------------------------------------------------------------------------
Record:      22
Date/Time:   10/23/2021 10:54:23
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Kormat lowered the priority of this task from Unbreak Now! to High.Oct 25 2021, 7:22 PM

Dropping priority. The host is depooled, and the only real impact right now is wikireplicas for s3, which is non-critical. I'll look into what happened in more detail tomorrow.

I'm brining mariadb up on the host (without replication running), and am going to let mysqlcheck --all-databases run on it overnight.

We've seen hosts getting rebooted after having memory issues. This host is obviously out of warranty, but maybe we can get a spare DIMM somewhere and replace B1?

Server is currently back up and sitting at login. Forwarding the DIMM replacement question to ops-eqiad. adding tag.

Dzahn renamed this task from db1112 (s3 contribs/rc replica) is down to db1112 - DIMM replacement (was: db1112 (s3 contribs/rc replica) is down).Oct 25 2021, 8:01 PM
Dzahn added projects: ops-eqiad, DC-Ops.

Once the table check is completed, let's do a data check for some of the tables of the biggest wikis

Change 734448 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1112: Disable notifications

https://gerrit.wikimedia.org/r/734448

Change 734448 merged by Marostegui:

[operations/puppet@production] db1112: Disable notifications

https://gerrit.wikimedia.org/r/734448

So the logs show lots of errors from previous days:

Oct 19 06:45:09 db1112 kernel: [13398165.820227] mce: [Hardware Error]: Machine check events logged
Oct 19 07:01:51 db1112 kernel: [13399167.594963] {16}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Oct 19 07:01:51 db1112 kernel: [13399167.594966] {16}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 19 07:01:51 db1112 kernel: [13399167.594967] {16}[Hardware Error]: event severity: corrected
Oct 19 07:01:51 db1112 kernel: [13399167.594969] {16}[Hardware Error]:  Error 0, type: corrected
Oct 19 07:01:51 db1112 kernel: [13399167.594970] {16}[Hardware Error]:  fru_text: B1
Oct 19 07:01:51 db1112 kernel: [13399167.594971] {16}[Hardware Error]:   section_type: memory error
Oct 19 07:01:51 db1112 kernel: [13399167.594972] {16}[Hardware Error]:   error_status: 0x0000000000000400
Oct 19 07:01:51 db1112 kernel: [13399167.594973] {16}[Hardware Error]:   physical_address: 0x000000691a6c48c0
Oct 19 07:01:51 db1112 kernel: [13399167.594975] {16}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 0 bank: 3 row: 18851 column: 288
Oct 19 07:01:51 db1112 kernel: [13399167.594977] {16}[Hardware Error]:   error_type: 2, single-bit ECC
Oct 19 07:01:51 db1112 kernel: [13399167.595008] mce: [Hardware Error]: Machine check events logged
Oct 19 07:02:01 db1112 kernel: [13399177.972838] {17}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Oct 19 07:02:01 db1112 kernel: [13399177.972841] {17}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 19 07:02:01 db1112 kernel: [13399177.972842] {17}[Hardware Error]: event severity: corrected
Oct 19 07:02:01 db1112 kernel: [13399177.972844] {17}[Hardware Error]:  Error 0, type: corrected
Oct 19 07:02:01 db1112 kernel: [13399177.972845] {17}[Hardware Error]:  fru_text: B1
Oct 19 07:02:01 db1112 kernel: [13399177.972846] {17}[Hardware Error]:   section_type: memory error
Oct 19 07:02:01 db1112 kernel: [13399177.972847] {17}[Hardware Error]:   error_status: 0x0000000000000400
Oct 19 07:02:01 db1112 kernel: [13399177.972848] {17}[Hardware Error]:   physical_address: 0x0000004240902c80
Oct 19 07:02:01 db1112 kernel: [13399177.972851] {17}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 0 bank: 2 row: 3092 column: 176
Oct 19 07:02:01 db1112 kernel: [13399177.972852] {17}[Hardware Error]:   error_type: 2, single-bit ECC
Oct 19 07:02:01 db1112 kernel: [13399177.972897] mce: [Hardware Error]: Machine check events logged

We should really get that B1 DIMM replaced.

mysqlcheck --all-databases completed successfully. Started replication again. Will run db-compare against it when it finished catching up.

Kormat renamed this task from db1112 - DIMM replacement (was: db1112 (s3 contribs/rc replica) is down) to db1112 (s3 contribs/rc replica) is down.Oct 26 2021, 1:20 PM
Kormat removed projects: DC-Ops, ops-eqiad.

Moved the dc-ops request to a subtask T294345: db1112 - DIMM replacement to simplify tracking for them.

Kormat changed the task status from Open to Stalled.Oct 26 2021, 2:33 PM

mysqlcheck --all-databases completed successfully. Started replication again. Will run db-compare against it when it finished catching up.

I ran db-compare against the 3 largest wikis by space (ruwikinews, arzwiki and cewiki); no diffs were found.

I'm going to leave it running as a sanitarium master for the moment, but leave it depooled on the MW side. Hopefully dcops can dig up a spare dimm for it, but if not it would be good to see that it runs for a few days without crashing again before considering repooling it.

+1 I would even leave it running till Monday and issue a mariadb restart on Monday issuing this first:

stop slave; SET GLOBAL innodb_buffer_pool_dump_at_shutdown = OFF;

And then start it again and monitor the error log just in case

More recent errors from the B1 module:

Oct 26 03:19:30 db1112 kernel: [29473.766843] mce: [Hardware Error]: Machine check events logged
Oct 26 03:35:38 db1112 kernel: [30441.033563] perf: interrupt took too long (3932 > 3930), lowering kernel.perf_event_max_sample_rate to 50750
Oct 26 03:37:50 db1112 kernel: [30573.322266] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Oct 26 03:37:50 db1112 kernel: [30573.322268] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 26 03:37:50 db1112 kernel: [30573.322270] {9}[Hardware Error]: event severity: corrected
Oct 26 03:37:50 db1112 kernel: [30573.322271] {9}[Hardware Error]:  Error 0, type: corrected
Oct 26 03:37:50 db1112 kernel: [30573.322271] {9}[Hardware Error]:  fru_text: B1
Oct 26 03:37:50 db1112 kernel: [30573.322272] {9}[Hardware Error]:   section_type: memory error
Oct 26 03:37:50 db1112 kernel: [30573.322273] {9}[Hardware Error]:   error_status: 0x0000000000000400
Oct 26 03:37:50 db1112 kernel: [30573.322273] {9}[Hardware Error]:   physical_address: 0x00000052c3a204c0
Oct 26 03:37:50 db1112 kernel: [30573.322275] {9}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 0 bank: 2 row: 37928 column: 528
Oct 26 03:37:50 db1112 kernel: [30573.322276] {9}[Hardware Error]:   error_type: 2, single-bit ECC
Oct 26 03:37:50 db1112 kernel: [30573.322297] mce: [Hardware Error]: Machine check events logged
Oct 26 03:43:00 db1112 kernel: [30883.430205] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Oct 26 03:43:00 db1112 kernel: [30883.430208] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 26 03:43:00 db1112 kernel: [30883.430209] {10}[Hardware Error]: event severity: corrected
Oct 26 03:43:00 db1112 kernel: [30883.430211] {10}[Hardware Error]:  Error 0, type: corrected
Oct 26 03:43:00 db1112 kernel: [30883.430212] {10}[Hardware Error]:  fru_text: B1
Oct 26 03:43:00 db1112 kernel: [30883.430214] {10}[Hardware Error]:   section_type: memory error
Oct 26 03:43:00 db1112 kernel: [30883.430216] {10}[Hardware Error]:   error_status: 0x0000000000000400
Oct 26 03:43:00 db1112 kernel: [30883.430217] {10}[Hardware Error]:   physical_address: 0x000000586e3a2a00
Oct 26 03:43:00 db1112 kernel: [30883.430221] {10}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 0 bank: 3 row: 48886 column: 680
Oct 26 03:43:00 db1112 kernel: [30883.430222] {10}[Hardware Error]:   error_type: 2, single-bit ECC
Oct 26 03:43:00 db1112 kernel: [30883.430254] mce: [Hardware Error]: Machine check events logged

I would say we shouldn't repool this host until the DIMM is changed, given the above errors happening again.

@Marostegui The DIMM arrived, let me know if I can take this server down anytime or if it needs to be scheduled

@Cmjohnson yeah, we'd need to be scheduled as we need to stop mysql first. Let us know which day would work for you!
Thank you

Marostegui changed the task status from Stalled to Open.Wed, Nov 10, 6:18 AM

@Marostegui 15 Nov 1000 Local 1500GMT ?

Marostegui is out today, but i can handle that, so yep, let's go for it.

@Cmjohnson: db1112 powered off now. Let me know when it's ready to be put back in service. Cheers.

Cmjohnson claimed this task.

DIMM replaced, cleared the log, all yours

(Reopening for us)

db1112 is back up and in service. Let's leave it a day or two before we repool it though.

Kormat lowered the priority of this task from High to Medium.

Change 739126 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1112: Re-enable notifications

https://gerrit.wikimedia.org/r/739126

Change 739126 merged by Kormat:

[operations/puppet@production] db1112: Re-enable notifications

https://gerrit.wikimedia.org/r/739126

I have started to slowly repool this host.

Host fully repooled!