pc1013 silently crashed with no alarming or icinga or prometheus, despite its notifications being enabled. P69388 contains the notes taken going through the server events and logs
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Duplicate | None | T375382 Post pc1013 crash | |||
Resolved | ABran-WMF | T375395 Parsercache primary master databases should monitor replication |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2024-09-23T12:12:56Z] <jynus> restarting replication on pc1013 after crash T375382
It appears it was a hw error on memory leading to an uncorrectable memory error, leading to killing mysql:
AFAIK pc1015 should be the candidate host if we want to fail it over, from dbctl:
"note": "Hot spare for pc4 and cold spare for pc3",
good catch, let's then start by moving replication from pc4 to: pc3: pc1013 -> pc1015, in the earliest binlog possible, for warmup (this should be a noop), and later we can patch/run dbctl if everybody agrees that's the right approach.
Change #1075024 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] mariadb: Move pc1015 configuration to master of pc3 section
Mentioned in SAL (#wikimedia-operations) [2024-09-23T14:30:15Z] <jynus> restarting and moving replication source of pc1015 T375382
Change #1075024 merged by Jcrespo:
[operations/puppet@production] mariadb: Move pc1015 configuration to master of pc3 section
@MoritzMuehlenhoff mentionned that we might have spare parts available for this server from decommssioned, but not yet recycled servers : @wiki_willy I'm not sure what @ would be best, so I apologize for this one! Could you please help us route our message?
Change #1075036 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] mariadb: Disable pc1013 notifications
Change #1075036 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable pc1013 notifications
I've created T375395 to reflect that, despite being prometed from a replica to a master, and from passive to active, it now has less monitoring than before. I think parsercache should have similar alerting (without paging) than active-active x2.
++ @Jclark-ctr & @VRiley-WMF, who can see if there are any parts available from decommissioned servers
Hi! We do have a spare DIMM (32 gig, 2666mts) that we can swap at anytime for this unit. Please let us know when is the best time to proceed with this. Thanks!
Mentioned in SAL (#wikimedia-operations) [2024-10-01T07:54:39Z] <jynus@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T375382
Mentioned in SAL (#wikimedia-operations) [2024-10-01T07:58:23Z] <jynus@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T375382