Page MenuHomePhabricator

labsdb1011 mariadb crashed
Closed, ResolvedPublic

Description

We were paged at Tue Sept 24 22:52:48 UTC 2019 for a mariadb crash on labsdb1011. It left a bunch of diags in the logs and some tables were apparently marked corrupt (though it appears to be tables related to wmf-pt-kill and mysql events rather than wikis).

It appears to have recovered for some intents and purposes, but I think it needs checking. This isn't a normal condition, after all.

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
ResolvedMarostegui
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedBstorm
ResolvedBstorm
ResolvedMoritzMuehlenhoff
ResolvedMarostegui
ResolvedMarostegui
ResolvedCmjohnson
Resolveddcaro
ResolvedMarostegui
ResolvedRequestwiki_willy
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson
ResolvedRequestCmjohnson
DeclinedNone
ResolvedKormat
ResolvedArielGlenn
OpenBstorm
DeclinedBstorm
OpenBstorm
ResolvedBstorm
OpenJhernandez
Resolvedrazzi
ResolvedMarostegui
ResolvedMilimetric
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
OpenNone
ResolvedBstorm
ResolvedAndrew
ResolvedBstorm
OpenNone
ResolvedJhernandez
ResolvedMarostegui
ResolvedRagesoss
ResolvedBstorm
ResolvedBstorm
OpenBstorm

Event Timeline

Bstorm triaged this task as Medium priority.Sep 24 2019, 11:10 PM
Bstorm created this task.
Bstorm moved this task from Backlog to Wiki replicas on the Data-Services board.
Bstorm added subscribers: jcrespo, Marostegui.

This is the logs on and around the crash P9170

Change 538987 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool lasbdb1011 just in case of issues

https://gerrit.wikimedia.org/r/538987

Change 538987 merged by Bstorm:
[operations/puppet@production] wiki replicas: depool lasbdb1011 just in case of issues

https://gerrit.wikimedia.org/r/538987

labsdb1011 is now depooled. @Marostegui if that doesn't seem useful, please repool it. I'm just hoping to prevent any possible harm in case it isn't working right.

From what I can see:

  • No HW errors.
  • Nothing relevant on the graphs that could indicate what caused the issue
  • Those warnings about the event scheduler are "normal".
  • No queries being killed by the query killer right before the crash.

Apart from the logs you pasted:

Sep 24 22:48:28 labsdb1011 kernel: [10074305.111470] mysqld[5779]: segfault at 18 ip 0000560ae346d099 sp 00007f8d8a561b10 error 4 in mysqld[560ae301c000+f9e000]

I have started replication on all threads - let's see if there is any corruption there. As we are running ROW, if there is any data drift, replication will broken.
Once it has caught up I will run a data check anyways against another host, just in case.

My bet is on a long heavy query that made the server run out of resources (although I would have expected an OOM there...)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T05:06:42Z] <marostegui> Run a data check on labsdb1011 - T233766

s4 (commonswiki) data comparison came back clean.
Ongoing:
s1 enwiki
s2 multiple wikis (https://raw.githubusercontent.com/wikimedia/operations-mediawiki-config/master/dblists/s2.dblist)
s8 wikidata

s8 wikidata clean

As replication has also been working fine for almost 8 hours (and any data drift would break replication) I am going to repool this host.

Marostegui claimed this task.

I am going to close this as resolved, there is not much else we can do now. If it crashes again or we see replication getting broken it could mean there is indeed data corruption as a result of the crash and we might need to explore some other approaches (recloning it).

Out of curiosity, did you run the checks against their master, between replicas or something else?

Out of curiosity, did you run the checks against their master, between replicas or something else?

Between replicas

Marostegui closed subtask Restricted Task as Resolved.Dec 17 2019, 8:34 AM