Page MenuHomePhabricator

Include an option to stop/start slave
Closed, ResolvedPublic

Description

For some schema changes, when the altered table is very busy, it is common for the schema change to fail with Duplicate Entry.
This has happened on T300402 and the work around was to change the ALTER table to include stop replication before the alter and start it after it has run:
stop slave; ALTER TABLE xxxx; start slave

This unfortunately means a stop slave and start slave for all the wikis, which on s3 could be issue 900 times.

Ideally we should have a flag on the code to do this nicely, something like:
stop_replication = True

Which would run a stop slave before the alter on the first wiki and start slave after it the last one is done

Event Timeline

Marostegui triaged this task as Medium priority.Feb 2 2022, 7:05 AM
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)

@Ladsgroup This might be very useful to have before the *links schema change.

Yeah. I'm slightly hesitant to add another option to the code though. I can go on for hours about why it will harm in the long-term. Do you think if there is a way to automatically detect the need? Or try first and if it fails, try with replication disabled again and if that fails, panic?

But that second option would still require code changes (and capturing the error as well), so why not making it optional? For some tables it is probably a must (ie: revision, *link tables etc)

I know, it's just adding more configuration can lead to complexity and I feel slowly we might end up with something like exim4 config files and swear at ourselves when we want to create a schema change patch. This is a good explanation on why configs are evil: https://www.youtube.com/watch?v=NcT8-IoImXE

I have three ideas:

  • Just make it stop replication on everything, we downtime, we wait for replication to catch up, etc. I don't see a reason why not (except on master of active dc),
  • it makes a query to stats (performance_schema?) to see ratio of writes to the table and "detect it" preemptively\
  • simply inside the running the alter table, parse the result and if it's a "duplicate entry" error (is the the only type of error that can happen because of replication while running the alter table?) and retry if with replication stopped if that's the case.
    • This can also report back and every ensuing schema change on next replicas would stop replication after first case of seeing this issue to avoid retrying on each host.

The first and the last are pretty easy to implement (and I can do it in course of an hour at most), the middle one is hard (unless there is some way of detecting that I'm missing).

Stopping slave for every operation would work for me, but I would like it to stop replication at the stop before the first wiki, and start it after the last wiki is done. Rather than stopping it for every single ALTER (like the current approach of including stop slave; ALTER; start slave does.
The second option is too complex, and the third one might work, but the ALTER might fail after 3h or after 10h or after 2 minutes, so it could be the case that the alter might take a lot simply cause we need it to fail first.

The first idea sounds good and I can do it ^^ later today you will have the patch.

Change 759645 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/software@master] auto_schema: Stop replication before running schema changes

https://gerrit.wikimedia.org/r/759645

Change 759645 merged by jenkins-bot:

[operations/software@master] auto_schema: Stop replication before running schema changes

https://gerrit.wikimedia.org/r/759645

Ladsgroup claimed this task.
Ladsgroup moved this task from Ready to Done on the DBA board.

\o/