Page MenuHomePhabricator

labsdb1011 mariadb crashed
Closed, ResolvedPublic

Description

We were paged at Tue Sept 24 22:52:48 UTC 2019 for a mariadb crash on labsdb1011. It left a bunch of diags in the logs and some tables were apparently marked corrupt (though it appears to be tables related to wmf-pt-kill and mysql events rather than wikis).

It appears to have recovered for some intents and purposes, but I think it needs checking. This isn't a normal condition, after all.

Related Objects

StatusAssignedTask
ResolvedMarostegui
ResolvedMarostegui

Event Timeline

Bstorm triaged this task as Normal priority.Tue, Sep 24, 11:10 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Sep 24, 11:10 PM
Bstorm moved this task from Backlog to Wiki replicas on the Data-Services board.
Bstorm added subscribers: jcrespo, Marostegui.

This is the logs on and around the crash P9170

Change 538987 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool lasbdb1011 just in case of issues

https://gerrit.wikimedia.org/r/538987

Change 538987 merged by Bstorm:
[operations/puppet@production] wiki replicas: depool lasbdb1011 just in case of issues

https://gerrit.wikimedia.org/r/538987

labsdb1011 is now depooled. @Marostegui if that doesn't seem useful, please repool it. I'm just hoping to prevent any possible harm in case it isn't working right.

Bstorm updated the task description. (Show Details)Tue, Sep 24, 11:38 PM

From what I can see:

  • No HW errors.
  • Nothing relevant on the graphs that could indicate what caused the issue
  • Those warnings about the event scheduler are "normal".
  • No queries being killed by the query killer right before the crash.

Apart from the logs you pasted:

Sep 24 22:48:28 labsdb1011 kernel: [10074305.111470] mysqld[5779]: segfault at 18 ip 0000560ae346d099 sp 00007f8d8a561b10 error 4 in mysqld[560ae301c000+f9e000]

I have started replication on all threads - let's see if there is any corruption there. As we are running ROW, if there is any data drift, replication will broken.
Once it has caught up I will run a data check anyways against another host, just in case.

My bet is on a long heavy query that made the server run out of resources (although I would have expected an OOM there...)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T05:06:42Z] <marostegui> Run a data check on labsdb1011 - T233766

Marostegui added a comment.EditedWed, Sep 25, 7:31 AM

s4 (commonswiki) data comparison came back clean.
Ongoing:
s1 enwiki
s2 multiple wikis (https://raw.githubusercontent.com/wikimedia/operations-mediawiki-config/master/dblists/s2.dblist)
s8 wikidata

s1 (enwiki) clean

s8 wikidata clean

As replication has also been working fine for almost 8 hours (and any data drift would break replication) I am going to repool this host.

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:44:34Z] <marostegui> Repool labsdb1011 T233766

Marostegui closed this task as Resolved.Wed, Sep 25, 12:46 PM
Marostegui claimed this task.

I am going to close this as resolved, there is not much else we can do now. If it crashes again or we see replication getting broken it could mean there is indeed data corruption as a result of the crash and we might need to explore some other approaches (recloning it).

Out of curiosity, did you run the checks against their master, between replicas or something else?

Out of curiosity, did you run the checks against their master, between replicas or something else?

Between replicas