Page MenuHomePhabricator

labsdb1009:s2, replication broken
Closed, ResolvedPublic

Description

Replication on s2 broke with:

Last_SQL_Errno: 1032
Last_SQL_Error: Could not execute Delete_rows_v1 event on table itwiki.recentchanges; Can't find record in 'recentchanges', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1125-bin.003389, end_log_pos 1044718762

This could be one of the consequences of the last crash it had (T276980).

For now s2, is delayed and the following wikis have no up-to-date data:

bgwiki
bgwiktionary
cswiki
enwikiquote
enwiktionary
eowiki
fiwiki
idwiki
itwiki
nlwiki
nowiki
plwiki
ptwiki
svwiki
thwiki
trwiki
zhwiki

I will try to get replication working again on Tuesday 13th....we'll see how many drifts it has.

Reminder: This host is scheduled for moving under the new sanitariums after 15th April, at which point, replication might stop anytime.

Related Objects

Event Timeline

Marostegui triaged this task as Medium priority.Apr 11 2021, 8:53 AM

@nskaggs @Bstorm If restoring replication turns out to be non-trivial, is it OK to wait until after the failover (in case it turns out it doesn't make sense to try and fix it after all)? In practical terms, the wikis above would be out of sync for over a week.

@nskaggs @Bstorm If restoring replication turns out to be non-trivial, is it OK to wait until after the failover (in case it turns out it doesn't make sense to try and fix it after all)? In practical terms, the wikis above would be out of sync for over a week.

If it is not easy to fix replication, we can always try to depool this host entirely and get to serve with labsdb1010 and labsdb1011. Pros: we don't have to worry about labsdb1009. Cons: In the past we've seen labsdb1010 or labsdb1011 struggling (or even crashing) when assuming more load.

Another option if we cannot fix that data drift, would be to put the slave into idempotent mode and assume that its data at least for itwiki on that host, contains drifts from production.

Marostegui moved this task from Refine to In progress on the DBA board.

It's been a bit of a pain to fix these drifts.
It was a huge transaction involving recentchangesand ores_classification.
5 rows were missing on recentchanges and 27 on ores_classification

Once that was fixed, another 2 rows were missing from another big transaction involving ores_classification.

Replicating has now been flowing for around 10 minutes without breaking, I am monitoring it to see if it crashes again.

labsdb1009:s2 caught up:

# mysql.py -hlabsdb1009 -e "show slave 's2' status\G" | grep Seconds
        Seconds_Behind_Master: 3

Closing this for now

@Marostegui Thanks for jumping on this quickly. Hopefully this means this collection of hosts will be in a healthy state for the migration. To answer your earlier question, I would say yes. If there is another non-trivial breakage let's wait until after switching sanitarium hosts and consider depooling or any other option as well depending on the outcome of the switchover.