db2135 crashed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Mar 25 2021, 7:04 AM

Description

Mar 24 22:26:53 db2135 mysqld[3981]: 210324 22:26:53 [ERROR] mysqld got signal 11 ;
Mar 24 22:26:53 db2135 mysqld[3981]: This could be because you hit a bug. It is also possible that this binary
Mar 24 22:26:53 db2135 mysqld[3981]: or one of the libraries it was linked against is corrupt, improperly built,
Mar 24 22:26:53 db2135 mysqld[3981]: or misconfigured. This error can also be caused by malfunctioning hardware.
Mar 24 22:26:53 db2135 mysqld[3981]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Mar 24 22:26:53 db2135 mysqld[3981]: We will try our best to scrape up some info that will hopefully help
Mar 24 22:26:53 db2135 mysqld[3981]: diagnose the problem, but since we have already crashed,
Mar 24 22:26:53 db2135 mysqld[3981]: something is definitely wrong and this may fail.
Mar 24 22:26:53 db2135 mysqld[3981]: Server version: 10.4.13-MariaDB-log

Suspiciously, this seems to have happened seconds after https://gerrit.wikimedia.org/r/c/operations/puppet/+/674724 @Legoktm

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		jcrespo	T278408 db2135 crashed
Resolved		Marostegui	T279281 Upgrade 10.4.13 hosts to a higher version
Resolved		Marostegui	T279625 Upgrade mysql on db1132 (phabricator db master)
Resolved		Marostegui	T276448 Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC
Resolved	Request	• Cmjohnson	T280121 decommission db1080.eqiad.mnet
Resolved		Marostegui	T279657 Upgrade mysql on db1128 (m5 db master)
Resolved		Marostegui	T280251 Upgrade mysql on db1107 (m2 db master)
Resolved		Marostegui	T281212 Restart x1 database master (db1103)
Resolved		Trizek-WMF	T281375 Read only time for extension 1 (x1) primary database on 2021-05-05

Event Timeline

jcrespo created this task.Mar 25 2021, 7:04 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 25 2021, 7:04 AM

jcrespo added a subscriber: LSobanski.Mar 25 2021, 7:04 AM

jcrespo added a subscriber: • Kormat.Mar 25 2021, 7:07 AM

RhinosF1 subscribed.Mar 25 2021, 7:08 AM

taavi subscribed.Mar 25 2021, 7:15 AM

oh crap, it probably is my fault. I had to delete and recreate some tables with the wrong charset (T277286#6944044) - I wasn't aware that would or could even crash mysql. And I should have noticed the m5 alerts in -operations and connected the dots. Please let me know if there's anything I can do to help fix the situation.

it crashed after "CREATE UNIQUE INDEX ix_mailinglist_list_id ON mailinglist (list_id)" at 2021-03-24 22:26:53

Ladsgroup subscribed.Mar 25 2021, 7:22 AM

Mentioned in SAL (#wikimedia-operations) [2021-03-25T07:35:40Z] <jynus> restart db2135 T278408 T273281

Maintenance_bot added a project: SRE.Mar 25 2021, 7:45 AM

I restarted the host to check for hw errors.

After upgrade and restart, I ran into:

Error 'Duplicate key name 'ix_mailinglist_list_id'' on query. Default database: 'testmailman3'. Query: 'CREATE UNIQUE INDEX ix_mailinglist_list_id ON mailinglist (list_id)

The index existed on all s5 servers on codfw, so I dropped it using replication and then restarted replication.

LSobanski triaged this task as Medium priority.Mar 25 2021, 1:35 PM

LSobanski moved this task from Triage to In progress on the DBA board.

This looks like https://jira.mariadb.org/browse/MDEV-23019, which was fixed in 10.4.14.

The server was running 10.4.13 when the crash occurred. The server is now running 10.4.18.

Legoktm moved this task from Backlog to Mailman v3 on the Wikimedia-Mailing-lists board.Mar 25 2021, 9:50 PM

Legoktm mentioned this in T278499: Improve workflow for mailman database bootstrapping and updates.Mar 26 2021, 8:26 AM

Wonder if this could have also been the reason for T272614.

We still have 34 hosts running 10.4.13, should these be fast-tracked for an upgrade?

In T278408#6954076, @LSobanski wrote:

Wonder if this could have also been the reason for T272614.

We still have 34 hosts running 10.4.13, should these be fast-tracked for an upgrade?

Probably we should, just to be on the safe side.

What else is pending on this task?

Nothing else that I'm aware of.

Thanks, will create a task to upgrade 10.4.13 hosts and close this.

Thanks everyone for responding to this.

Follow up task: T279281

Marostegui closed subtask T279281: Upgrade 10.4.13 hosts to a higher version as Resolved.May 5 2021, 6:02 AM

db2135 crashedClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

db2135 crashed
Closed, ResolvedPublic
Actions

Related Objects
Search...