db1112 (s3 contribs/rc replica) is down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RhinosF1
	Oct 25 2021, 6:37 PM

Description

Which means no cloud replication on s3

Details

	Subject	Repo	Branch	Lines +/-
	db1112: Re-enable notifications	operations/puppet	production	+0 -1
	db1112: Disable notifications	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Marostegui	T294295 db1112 (s3 contribs/rc replica) is down
Resolved	• Cmjohnson	T294345 db1112 - DIMM replacement
		Unknown Object (Task)

Event Timeline

RhinosF1 created this task.Oct 25 2021, 6:37 PM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptOct 25 2021, 6:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

RhinosF1 triaged this task as High priority.Oct 25 2021, 6:37 PM

Legoktm renamed this task from db1112 (s3 sanitarium master) is down to db1112 (s3 contribs/rc replica) is down.Oct 25 2021, 7:00 PM

Legoktm raised the priority of this task from High to Unbreak Now!.

Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:04:36Z] <legoktm@cumin1001> dbctl commit (dc=all): 'Depool db1112 (T294295)', diff saved to https://phabricator.wikimedia.org/P17596 and previous config saved to /var/cache/conftool/dbconfig/20211025-190436-legoktm.json

Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:07:17Z] <kormat@cumin1001> dbctl commit (dc=all): 'Temporarily move mw groups to db1123 T294295', diff saved to https://phabricator.wikimedia.org/P17597 and previous config saved to /var/cache/conftool/dbconfig/20211025-190717-kormat.json

Nothing obvious in syslog, last entry before reboot was:

Oct 25 18:23:03 db1112 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.

Nothing obvious in ipmi-sel either:

$ sudo ipmi-sel 
ID  | Date        | Time     | Name             | Type                     | Event
1   | Nov-04-2017 | 08:31:17 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
...
21  | Oct-10-2019 | 12:02:56 | PS Redundancy    | Power Supply             | Fully Redundant
22  | Oct-23-2021 | 10:54:23 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 10h

That last entry looks a bit suspicious, but it was 2 days ago, and it was a transition from non-crit to ok.

The server went down unexpectedly by itself, I connected to mgmt, powercycled it. watched it boot up, besides doing a normal fsck due to the unexpected power down there was nothing obvious.

It just booted as normal from there.

Then in racadm getsel there is this from 2 days ago.

-------------------------------------------------------------------------------
Record:      22
Date/Time:   10/23/2021 10:54:23
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------

Dropping priority. The host is depooled, and the only real impact right now is wikireplicas for s3, which is non-critical. I'll look into what happened in more detail tomorrow.

Zabe subscribed.Oct 25 2021, 7:22 PM

I'm brining mariadb up on the host (without replication running), and am going to let mysqlcheck --all-databases run on it overnight.

Urbanecm subscribed.Oct 25 2021, 7:43 PM

LSobanski moved this task from Triage to In progress on the DBA board.Oct 25 2021, 7:45 PM

LSobanski removed a project: Data-Persistence.

We've seen hosts getting rebooted after having memory issues. This host is obviously out of warranty, but maybe we can get a spare DIMM somewhere and replace B1?

Server is currently back up and sitting at login. Forwarding the DIMM replacement question to ops-eqiad. adding tag.

Dzahn renamed this task from db1112 (s3 contribs/rc replica) is down to db1112 - DIMM replacement (was: db1112 (s3 contribs/rc replica) is down).Oct 25 2021, 8:01 PM

Dzahn added projects: ops-eqiad, DC-Ops.

Maintenance_bot added a project: SRE.Oct 25 2021, 8:45 PM

Once the table check is completed, let's do a data check for some of the tables of the biggest wikis

Change 734448 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1112: Disable notifications

https://gerrit.wikimedia.org/r/734448

Change 734448 merged by Marostegui:

[operations/puppet@production] db1112: Disable notifications

https://gerrit.wikimedia.org/r/734448

So the logs show lots of errors from previous days:

Oct 19 06:45:09 db1112 kernel: [13398165.820227] mce: [Hardware Error]: Machine check events logged
Oct 19 07:01:51 db1112 kernel: [13399167.594963] {16}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Oct 19 07:01:51 db1112 kernel: [13399167.594966] {16}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 19 07:01:51 db1112 kernel: [13399167.594967] {16}[Hardware Error]: event severity: corrected
Oct 19 07:01:51 db1112 kernel: [13399167.594969] {16}[Hardware Error]:  Error 0, type: corrected
Oct 19 07:01:51 db1112 kernel: [13399167.594970] {16}[Hardware Error]:  fru_text: B1
Oct 19 07:01:51 db1112 kernel: [13399167.594971] {16}[Hardware Error]:   section_type: memory error
Oct 19 07:01:51 db1112 kernel: [13399167.594972] {16}[Hardware Error]:   error_status: 0x0000000000000400
Oct 19 07:01:51 db1112 kernel: [13399167.594973] {16}[Hardware Error]:   physical_address: 0x000000691a6c48c0
Oct 19 07:01:51 db1112 kernel: [13399167.594975] {16}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 0 bank: 3 row: 18851 column: 288
Oct 19 07:01:51 db1112 kernel: [13399167.594977] {16}[Hardware Error]:   error_type: 2, single-bit ECC
Oct 19 07:01:51 db1112 kernel: [13399167.595008] mce: [Hardware Error]: Machine check events logged
Oct 19 07:02:01 db1112 kernel: [13399177.972838] {17}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Oct 19 07:02:01 db1112 kernel: [13399177.972841] {17}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 19 07:02:01 db1112 kernel: [13399177.972842] {17}[Hardware Error]: event severity: corrected
Oct 19 07:02:01 db1112 kernel: [13399177.972844] {17}[Hardware Error]:  Error 0, type: corrected
Oct 19 07:02:01 db1112 kernel: [13399177.972845] {17}[Hardware Error]:  fru_text: B1
Oct 19 07:02:01 db1112 kernel: [13399177.972846] {17}[Hardware Error]:   section_type: memory error
Oct 19 07:02:01 db1112 kernel: [13399177.972847] {17}[Hardware Error]:   error_status: 0x0000000000000400
Oct 19 07:02:01 db1112 kernel: [13399177.972848] {17}[Hardware Error]:   physical_address: 0x0000004240902c80
Oct 19 07:02:01 db1112 kernel: [13399177.972851] {17}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 0 bank: 2 row: 3092 column: 176
Oct 19 07:02:01 db1112 kernel: [13399177.972852] {17}[Hardware Error]:   error_type: 2, single-bit ECC
Oct 19 07:02:01 db1112 kernel: [13399177.972897] mce: [Hardware Error]: Machine check events logged

We should really get that B1 DIMM replaced.

Maintenance_bot removed a project: Patch-For-Review.Oct 26 2021, 5:10 AM

LSobanski subscribed.Oct 26 2021, 8:26 AM

mysqlcheck --all-databases completed successfully. Started replication again. Will run db-compare against it when it finished catching up.

Moved the dc-ops request to a subtask T294345: db1112 - DIMM replacement to simplify tracking for them.

In T294295#7457805, @Kormat wrote:

mysqlcheck --all-databases completed successfully. Started replication again. Will run db-compare against it when it finished catching up.

I ran db-compare against the 3 largest wikis by space (ruwikinews, arzwiki and cewiki); no diffs were found.

I'm going to leave it running as a sanitarium master for the moment, but leave it depooled on the MW side. Hopefully dcops can dig up a spare dimm for it, but if not it would be good to see that it runs for a few days without crashing again before considering repooling it.

+1 I would even leave it running till Monday and issue a mariadb restart on Monday issuing this first:

stop slave; SET GLOBAL innodb_buffer_pool_dump_at_shutdown = OFF;

And then start it again and monitor the error log just in case

More recent errors from the B1 module:

Oct 26 03:19:30 db1112 kernel: [29473.766843] mce: [Hardware Error]: Machine check events logged
Oct 26 03:35:38 db1112 kernel: [30441.033563] perf: interrupt took too long (3932 > 3930), lowering kernel.perf_event_max_sample_rate to 50750
Oct 26 03:37:50 db1112 kernel: [30573.322266] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Oct 26 03:37:50 db1112 kernel: [30573.322268] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 26 03:37:50 db1112 kernel: [30573.322270] {9}[Hardware Error]: event severity: corrected
Oct 26 03:37:50 db1112 kernel: [30573.322271] {9}[Hardware Error]:  Error 0, type: corrected
Oct 26 03:37:50 db1112 kernel: [30573.322271] {9}[Hardware Error]:  fru_text: B1
Oct 26 03:37:50 db1112 kernel: [30573.322272] {9}[Hardware Error]:   section_type: memory error
Oct 26 03:37:50 db1112 kernel: [30573.322273] {9}[Hardware Error]:   error_status: 0x0000000000000400
Oct 26 03:37:50 db1112 kernel: [30573.322273] {9}[Hardware Error]:   physical_address: 0x00000052c3a204c0
Oct 26 03:37:50 db1112 kernel: [30573.322275] {9}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 0 bank: 2 row: 37928 column: 528
Oct 26 03:37:50 db1112 kernel: [30573.322276] {9}[Hardware Error]:   error_type: 2, single-bit ECC
Oct 26 03:37:50 db1112 kernel: [30573.322297] mce: [Hardware Error]: Machine check events logged
Oct 26 03:43:00 db1112 kernel: [30883.430205] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Oct 26 03:43:00 db1112 kernel: [30883.430208] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 26 03:43:00 db1112 kernel: [30883.430209] {10}[Hardware Error]: event severity: corrected
Oct 26 03:43:00 db1112 kernel: [30883.430211] {10}[Hardware Error]:  Error 0, type: corrected
Oct 26 03:43:00 db1112 kernel: [30883.430212] {10}[Hardware Error]:  fru_text: B1
Oct 26 03:43:00 db1112 kernel: [30883.430214] {10}[Hardware Error]:   section_type: memory error
Oct 26 03:43:00 db1112 kernel: [30883.430216] {10}[Hardware Error]:   error_status: 0x0000000000000400
Oct 26 03:43:00 db1112 kernel: [30883.430217] {10}[Hardware Error]:   physical_address: 0x000000586e3a2a00
Oct 26 03:43:00 db1112 kernel: [30883.430221] {10}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 0 bank: 3 row: 48886 column: 680
Oct 26 03:43:00 db1112 kernel: [30883.430222] {10}[Hardware Error]:   error_type: 2, single-bit ECC
Oct 26 03:43:00 db1112 kernel: [30883.430254] mce: [Hardware Error]: Machine check events logged

I would say we shouldn't repool this host until the DIMM is changed, given the above errors happening again.

@Marostegui The DIMM arrived, let me know if I can take this server down anytime or if it needs to be scheduled

@Cmjohnson yeah, we'd need to be scheduled as we need to stop mysql first. Let us know which day would work for you!
Thank you

• Marostegui changed the task status from Stalled to Open.Nov 10 2021, 6:18 AM

@Marostegui 15 Nov 1000 Local 1500GMT ?

In T294295#7500678, @Cmjohnson wrote:

@Marostegui 15 Nov 1000 Local 1500GMT ?

Marostegui is out today, but i can handle that, so yep, let's go for it.

@Cmjohnson: db1112 powered off now. Let me know when it's ready to be put back in service. Cheers.

DIMM replaced, cleared the log, all yours

• Cmjohnson closed subtask T294345: db1112 - DIMM replacement as Resolved.Nov 15 2021, 3:30 PM

(Reopening for us)

db1112 is back up and in service. Let's leave it a day or two before we repool it though.

Kormat removed • Cmjohnson as the assignee of this task.Nov 15 2021, 3:45 PM

Kormat lowered the priority of this task from High to Medium.

Change 739126 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1112: Re-enable notifications

https://gerrit.wikimedia.org/r/739126

Change 739126 merged by Kormat:

[operations/puppet@production] db1112: Re-enable notifications

https://gerrit.wikimedia.org/r/739126

Maintenance_bot removed a project: Patch-For-Review.Nov 16 2021, 11:10 AM

I will take care of this

I have started to slowly repool this host.

Host fully repooled!

db1112 (s3 contribs/rc replica) is downClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

db1112 (s3 contribs/rc replica) is down
Closed, ResolvedPublic
Actions

Related Objects
Search...