Which means no cloud replication on s3
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +0 -1 | db1112: Re-enable notifications | |
operations/puppet | production | +1 -0 | db1112: Disable notifications |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Marostegui | T294295 db1112 (s3 contribs/rc replica) is down | |||
Resolved | Cmjohnson | T294345 db1112 - DIMM replacement | |||
Unknown Object (Task) |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:04:36Z] <legoktm@cumin1001> dbctl commit (dc=all): 'Depool db1112 (T294295)', diff saved to https://phabricator.wikimedia.org/P17596 and previous config saved to /var/cache/conftool/dbconfig/20211025-190436-legoktm.json
Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:07:17Z] <kormat@cumin1001> dbctl commit (dc=all): 'Temporarily move mw groups to db1123 T294295', diff saved to https://phabricator.wikimedia.org/P17597 and previous config saved to /var/cache/conftool/dbconfig/20211025-190717-kormat.json
Nothing obvious in syslog, last entry before reboot was:
Oct 25 18:23:03 db1112 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Nothing obvious in ipmi-sel either:
$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Nov-04-2017 | 08:31:17 | SEL | Event Logging Disabled | Log Area Reset/Cleared ... 21 | Oct-10-2019 | 12:02:56 | PS Redundancy | Power Supply | Fully Redundant 22 | Oct-23-2021 | 10:54:23 | Mem ECC Warning | Memory | transition to Non-Critical from OK ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 10h
That last entry looks a bit suspicious, but it was 2 days ago, and it was a transition from non-crit to ok.
The server went down unexpectedly by itself, I connected to mgmt, powercycled it. watched it boot up, besides doing a normal fsck due to the unexpected power down there was nothing obvious.
It just booted as normal from there.
Then in racadm getsel there is this from 2 days ago.
------------------------------------------------------------------------------- Record: 22 Date/Time: 10/23/2021 10:54:23 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B1. -------------------------------------------------------------------------------
Dropping priority. The host is depooled, and the only real impact right now is wikireplicas for s3, which is non-critical. I'll look into what happened in more detail tomorrow.
I'm brining mariadb up on the host (without replication running), and am going to let mysqlcheck --all-databases run on it overnight.
We've seen hosts getting rebooted after having memory issues. This host is obviously out of warranty, but maybe we can get a spare DIMM somewhere and replace B1?
Server is currently back up and sitting at login. Forwarding the DIMM replacement question to ops-eqiad. adding tag.
Once the table check is completed, let's do a data check for some of the tables of the biggest wikis
Change 734448 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db1112: Disable notifications
Change 734448 merged by Marostegui:
[operations/puppet@production] db1112: Disable notifications
So the logs show lots of errors from previous days:
Oct 19 06:45:09 db1112 kernel: [13398165.820227] mce: [Hardware Error]: Machine check events logged Oct 19 07:01:51 db1112 kernel: [13399167.594963] {16}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 Oct 19 07:01:51 db1112 kernel: [13399167.594966] {16}[Hardware Error]: It has been corrected by h/w and requires no further action Oct 19 07:01:51 db1112 kernel: [13399167.594967] {16}[Hardware Error]: event severity: corrected Oct 19 07:01:51 db1112 kernel: [13399167.594969] {16}[Hardware Error]: Error 0, type: corrected Oct 19 07:01:51 db1112 kernel: [13399167.594970] {16}[Hardware Error]: fru_text: B1 Oct 19 07:01:51 db1112 kernel: [13399167.594971] {16}[Hardware Error]: section_type: memory error Oct 19 07:01:51 db1112 kernel: [13399167.594972] {16}[Hardware Error]: error_status: 0x0000000000000400 Oct 19 07:01:51 db1112 kernel: [13399167.594973] {16}[Hardware Error]: physical_address: 0x000000691a6c48c0 Oct 19 07:01:51 db1112 kernel: [13399167.594975] {16}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 3 row: 18851 column: 288 Oct 19 07:01:51 db1112 kernel: [13399167.594977] {16}[Hardware Error]: error_type: 2, single-bit ECC Oct 19 07:01:51 db1112 kernel: [13399167.595008] mce: [Hardware Error]: Machine check events logged Oct 19 07:02:01 db1112 kernel: [13399177.972838] {17}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 Oct 19 07:02:01 db1112 kernel: [13399177.972841] {17}[Hardware Error]: It has been corrected by h/w and requires no further action Oct 19 07:02:01 db1112 kernel: [13399177.972842] {17}[Hardware Error]: event severity: corrected Oct 19 07:02:01 db1112 kernel: [13399177.972844] {17}[Hardware Error]: Error 0, type: corrected Oct 19 07:02:01 db1112 kernel: [13399177.972845] {17}[Hardware Error]: fru_text: B1 Oct 19 07:02:01 db1112 kernel: [13399177.972846] {17}[Hardware Error]: section_type: memory error Oct 19 07:02:01 db1112 kernel: [13399177.972847] {17}[Hardware Error]: error_status: 0x0000000000000400 Oct 19 07:02:01 db1112 kernel: [13399177.972848] {17}[Hardware Error]: physical_address: 0x0000004240902c80 Oct 19 07:02:01 db1112 kernel: [13399177.972851] {17}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 2 row: 3092 column: 176 Oct 19 07:02:01 db1112 kernel: [13399177.972852] {17}[Hardware Error]: error_type: 2, single-bit ECC Oct 19 07:02:01 db1112 kernel: [13399177.972897] mce: [Hardware Error]: Machine check events logged
We should really get that B1 DIMM replaced.
mysqlcheck --all-databases completed successfully. Started replication again. Will run db-compare against it when it finished catching up.
Moved the dc-ops request to a subtask T294345: db1112 - DIMM replacement to simplify tracking for them.
I ran db-compare against the 3 largest wikis by space (ruwikinews, arzwiki and cewiki); no diffs were found.
I'm going to leave it running as a sanitarium master for the moment, but leave it depooled on the MW side. Hopefully dcops can dig up a spare dimm for it, but if not it would be good to see that it runs for a few days without crashing again before considering repooling it.
+1 I would even leave it running till Monday and issue a mariadb restart on Monday issuing this first:
stop slave; SET GLOBAL innodb_buffer_pool_dump_at_shutdown = OFF;
And then start it again and monitor the error log just in case
More recent errors from the B1 module:
Oct 26 03:19:30 db1112 kernel: [29473.766843] mce: [Hardware Error]: Machine check events logged Oct 26 03:35:38 db1112 kernel: [30441.033563] perf: interrupt took too long (3932 > 3930), lowering kernel.perf_event_max_sample_rate to 50750 Oct 26 03:37:50 db1112 kernel: [30573.322266] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 Oct 26 03:37:50 db1112 kernel: [30573.322268] {9}[Hardware Error]: It has been corrected by h/w and requires no further action Oct 26 03:37:50 db1112 kernel: [30573.322270] {9}[Hardware Error]: event severity: corrected Oct 26 03:37:50 db1112 kernel: [30573.322271] {9}[Hardware Error]: Error 0, type: corrected Oct 26 03:37:50 db1112 kernel: [30573.322271] {9}[Hardware Error]: fru_text: B1 Oct 26 03:37:50 db1112 kernel: [30573.322272] {9}[Hardware Error]: section_type: memory error Oct 26 03:37:50 db1112 kernel: [30573.322273] {9}[Hardware Error]: error_status: 0x0000000000000400 Oct 26 03:37:50 db1112 kernel: [30573.322273] {9}[Hardware Error]: physical_address: 0x00000052c3a204c0 Oct 26 03:37:50 db1112 kernel: [30573.322275] {9}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 2 row: 37928 column: 528 Oct 26 03:37:50 db1112 kernel: [30573.322276] {9}[Hardware Error]: error_type: 2, single-bit ECC Oct 26 03:37:50 db1112 kernel: [30573.322297] mce: [Hardware Error]: Machine check events logged Oct 26 03:43:00 db1112 kernel: [30883.430205] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 Oct 26 03:43:00 db1112 kernel: [30883.430208] {10}[Hardware Error]: It has been corrected by h/w and requires no further action Oct 26 03:43:00 db1112 kernel: [30883.430209] {10}[Hardware Error]: event severity: corrected Oct 26 03:43:00 db1112 kernel: [30883.430211] {10}[Hardware Error]: Error 0, type: corrected Oct 26 03:43:00 db1112 kernel: [30883.430212] {10}[Hardware Error]: fru_text: B1 Oct 26 03:43:00 db1112 kernel: [30883.430214] {10}[Hardware Error]: section_type: memory error Oct 26 03:43:00 db1112 kernel: [30883.430216] {10}[Hardware Error]: error_status: 0x0000000000000400 Oct 26 03:43:00 db1112 kernel: [30883.430217] {10}[Hardware Error]: physical_address: 0x000000586e3a2a00 Oct 26 03:43:00 db1112 kernel: [30883.430221] {10}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 3 row: 48886 column: 680 Oct 26 03:43:00 db1112 kernel: [30883.430222] {10}[Hardware Error]: error_type: 2, single-bit ECC Oct 26 03:43:00 db1112 kernel: [30883.430254] mce: [Hardware Error]: Machine check events logged
I would say we shouldn't repool this host until the DIMM is changed, given the above errors happening again.
@Marostegui The DIMM arrived, let me know if I can take this server down anytime or if it needs to be scheduled
@Cmjohnson yeah, we'd need to be scheduled as we need to stop mysql first. Let us know which day would work for you!
Thank you
@Cmjohnson: db1112 powered off now. Let me know when it's ready to be put back in service. Cheers.
(Reopening for us)
db1112 is back up and in service. Let's leave it a day or two before we repool it though.
Change 739126 had a related patch set uploaded (by Kormat; author: Kormat):
[operations/puppet@production] db1112: Re-enable notifications
Change 739126 merged by Kormat:
[operations/puppet@production] db1112: Re-enable notifications