Page MenuHomePhabricator

labsdb1009 crashed while doing an alter table on templatelinks
Closed, ResolvedPublic

Description

Creating this for the record.
labsdb1009 crashed while attempting to do the massive alters for: T166204

InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2017-07-13 22:21:07 7eefc3bfd700  InnoDB: Assertion failure in thread 139568246413056 in file srv0srv.cc line 2418
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
170713 22:21:07 [ERROR] mysqld got signal 6 ;

It crashed when alterting templatelinks (61G).
labsdb1011 finished the alters with no issues a couple of days ago, so I have given labsdb1009 another go to see how it goes this time.
Replication didn't get broken or anything upon restart.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2017, 5:42 AM
Marostegui triaged this task as Medium priority.Jul 14 2017, 5:42 AM
Marostegui moved this task from Triage to In progress on the DBA board.

I will leave this open until the alters are done

Should we copy 1009 from 1010?

Let's wait and see if the alters finish fine this time I would say.
The server recovered fine after the crash, replication had no issues or anything

templatelinks went thru finely, now pagelinks is being altered (121G)

Mentioned in SAL (#wikimedia-operations) [2017-07-17T05:09:29Z] <marostegui> Restart MySQL on labsdb1009 for maintenance - T170657

Marostegui closed this task as Resolved.Jul 17 2017, 5:15 AM
Marostegui claimed this task.

The alters finished correctly, I have also stopped the server and started it without any issues, so I am going to consider this resolved as a punctual case.