Page MenuHomePhabricator

db2135 crashed
Closed, ResolvedPublic

Description

Mar 24 22:26:53 db2135 mysqld[3981]: 210324 22:26:53 [ERROR] mysqld got signal 11 ;
Mar 24 22:26:53 db2135 mysqld[3981]: This could be because you hit a bug. It is also possible that this binary
Mar 24 22:26:53 db2135 mysqld[3981]: or one of the libraries it was linked against is corrupt, improperly built,
Mar 24 22:26:53 db2135 mysqld[3981]: or misconfigured. This error can also be caused by malfunctioning hardware.
Mar 24 22:26:53 db2135 mysqld[3981]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Mar 24 22:26:53 db2135 mysqld[3981]: We will try our best to scrape up some info that will hopefully help
Mar 24 22:26:53 db2135 mysqld[3981]: diagnose the problem, but since we have already crashed,
Mar 24 22:26:53 db2135 mysqld[3981]: something is definitely wrong and this may fail.
Mar 24 22:26:53 db2135 mysqld[3981]: Server version: 10.4.13-MariaDB-log

Suspiciously, this seems to have happened seconds after https://gerrit.wikimedia.org/r/c/operations/puppet/+/674724 @Legoktm

Event Timeline

oh crap, it probably is my fault. I had to delete and recreate some tables with the wrong charset (T277286#6944044) - I wasn't aware that would or could even crash mysql. And I should have noticed the m5 alerts in -operations and connected the dots. Please let me know if there's anything I can do to help fix the situation.

it crashed after "CREATE UNIQUE INDEX ix_mailinglist_list_id ON mailinglist (list_id)" at 2021-03-24 22:26:53

I restarted the host to check for hw errors.

After upgrade and restart, I ran into:

Error 'Duplicate key name 'ix_mailinglist_list_id'' on query. Default database: 'testmailman3'. Query: 'CREATE UNIQUE INDEX ix_mailinglist_list_id ON mailinglist (list_id)

The index existed on all s5 servers on codfw, so I dropped it using replication and then restarted replication.

LSobanski triaged this task as Medium priority.Thu, Mar 25, 1:35 PM
LSobanski moved this task from Triage to In progress on the DBA board.

This looks like https://jira.mariadb.org/browse/MDEV-23019, which was fixed in 10.4.14.

The server was running 10.4.13 when the crash occurred. The server is now running 10.4.18.

Wonder if this could have also been the reason for T272614.

We still have 34 hosts running 10.4.13, should these be fast-tracked for an upgrade?

Wonder if this could have also been the reason for T272614.

We still have 34 hosts running 10.4.13, should these be fast-tracked for an upgrade?

Probably we should, just to be on the safe side.

What else is pending on this task?

Nothing else that I'm aware of.

Thanks, will create a task to upgrade 10.4.13 hosts and close this.

Marostegui assigned this task to jcrespo.

Thanks everyone for responding to this.