Page MenuHomePhabricator

Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default)
Open, Needs TriagePublic

Description

We were wondering if the failover script can have a check that makes sure codfw -> eqiad replication is working and if not, stop.

Considering that eqiad is always the active DC (in an active-passive model as we have now) and codfw is the passive, replication codfw -> eqiad is normally disconnected.
This is usually fine, but it should not be the case if we are going to failover to codfw, in which case, replication needs to be enabled again, so eqiad receives the new keys (and purge the old ones) so once we switch back we don't run into incidents:

T206841
T206740
https://wikitech.wikimedia.org/wiki/Incident_documentation/20181016-eqiad_parsercache_empty_post-switchover

Event Timeline

There would be 2 parts, a "prefligh check", checking mostly T207273#4677372 in advance and a "replication is working and up to date based on heartbeat", similar to the check at the same time than the other sections that go read only (but without touching the read only for the pcs).

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

I commented on the other ticket, but I will have a look at the code and give a more concrete proposal. I may even expand the preflight checks to "Is the topology right" in general to all hosts.

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

From the top of my head I think it should go where the check if all the masters are up to date is.
Can you give me link to the switch wiki page?

Thank you!

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

From the top of my head I think it should go where the check if all the masters are up to date is.
Can you give me link to the switch wiki page?

Thank you!

Mhhh, but that is done during the read-only period, while this one seems to me that it should be done before hand, unless I'm missing something.
See https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/ for the current cookbooks.

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

From the top of my head I think it should go where the check if all the masters are up to date is.
Can you give me link to the switch wiki page?

Thank you!

Mhhh, but that is done during the read-only period, while this one seems to me that it should be done before hand, unless I'm missing something.

Ah sorry, I thought that was done BEFORE the read only. You are correct, it should be done before.

See https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/ for the current cookbooks.

Thank you!

So I guess it should go on Phase 0 before #4, somewhere between #3 and #4?
Or maybe even before #2, as it if it is not connected, maybe better to abort before wasting time on other next steps?