Page MenuHomePhabricator

db1047 has been restarted - needs another restart
Closed, ResolvedPublic

Description

Due to pt-table-checksum db1047 was totally stuck (trying to replicate still events from past Friday). I tried to set a filter to ignore wmf_checksum but it was impossible to stop the 's1' slave. There were also lots of nagios processes there hanging, it was impossible to kill them. They never died.
The only solution was to restart MySQL. But that didn't work, and I had to kill it. It automatically restarted and again it was impossible to enable the replication filter.
I will try to stop it again on Monday and start it with the replication threads stopped with a total kill if needed. For now I have let it running again (although delayed it is up).
Don't want to keep messing up with it on a Sunday.

Event Timeline

Restricted Application added a project: Analytics. · View Herald TranscriptMay 28 2017, 6:21 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui moved this task from Triage to In progress on the DBA board.May 28 2017, 6:21 AM

This is the reason why it is not able cope with the pt-table-checksum on s1:

On revision table:

PRIMARY KEY (`rev_page`,`rev_id`),

On Monday I will enable the filter and alter the revision table to get the normal PK.

Mentioned in SAL (#wikimedia-operations) [2017-05-29T05:54:54Z] <marostegui> Restart MySQL on db1047 - T166452

I have restarted it and able to set the replication filter on the s1 channel:

Replicate_Wild_Ignore_Table: enwiki.__wmf_checksums

I am going to leave it to catch up a bit before starting the alter table on the revision table (as it will delayed it more)
Right now it is (s1 channel):

Seconds_Behind_Master: 247171
Nuria added a subscriber: Nuria.May 29 2017, 3:18 PM

This is a slave machine that is not used.

Nuria moved this task from Incoming to Radar on the Analytics board.May 29 2017, 3:18 PM

Hey Nuria!

But we still need to maintain it, right? As in, it is not going to be decommissioned soon but it is a backup server just in case?

Thanks!

Mentioned in SAL (#wikimedia-operations) [2017-05-30T07:09:06Z] <marostegui> Deploy alter table on enwiki.revision on db1047 - T166452

Mentioned in SAL (#wikimedia-operations) [2017-06-02T07:47:56Z] <marostegui> Resume alter table on db1047 enwiki.revision - T166452

I have started the alter table again, as the server was shutdown yesterday for maintenance.

Mentioned in SAL (#wikimedia-operations) [2017-06-05T12:54:26Z] <marostegui> Stop MySQL db1047 - T166452

elukey added a subscriber: elukey.Jun 5 2017, 12:57 PM

The server went nuts again and got stuck.
I am going to leave it catch up (it is 3 days behind) and will attempt the ALTER once more. If it doesn't work, I will just ignore db1047 for the pt-table-checksum as we need to keep going checksumming that shard and we should be blocked on db1047 anymore.

jcrespo added a subscriber: jcrespo.Jun 5 2017, 1:05 PM
Marostegui closed this task as Resolved.Jun 6 2017, 10:27 AM

The scope of this ticket is done - pending is the ALTER table to unify revision so we can run pt-table-checksum for enwiki on this host but that can be handled at the main ticket: T162807

Mentioned in SAL (#wikimedia-operations) [2017-06-20T07:27:22Z] <marostegui> kill alter table on enwiki.revision db1047 after running for 13 days - T166452