Maniphest T207385

Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default)
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Marostegui
	Oct 18 2018, 12:45 PM

Description

We were wondering if the failover script can have a check that makes sure codfw -> eqiad replication is working and if not, stop.

Considering that eqiad is always the active DC (in an active-passive model as we have now) and codfw is the passive, replication codfw -> eqiad is normally disconnected.
This is usually fine, but it should not be the case if we are going to failover to codfw, in which case, replication needs to be enabled again, so eqiad receives the new keys (and purge the old ones) so once we switch back we don't run into incidents:

T206841
T206740
https://wikitech.wikimedia.org/wiki/Incident_documentation/20181016-eqiad_parsercache_empty_post-switchover

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		RLazarus	T243314 FY2020-2021 Q1 DC switchover and switchback
		Open		None	T207385 Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default)

Event Timeline

Marostegui created this task.Oct 18 2018, 12:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2018, 12:45 PM

Marostegui mentioned this in T206992: Create replication icinga check for the Parsercache hosts.Oct 18 2018, 12:45 PM

ArielGlenn subscribed.Oct 18 2018, 12:47 PM

There would be 2 parts, a "prefligh check", checking mostly T207273#4677372 in advance and a "replication is working and up to date based on heartbeat", similar to the check at the same time than the other sections that go read only (but without touching the read only for the pcs).

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

I commented on the other ticket, but I will have a look at the code and give a more concrete proposal. I may even expand the preflight checks to "Is the topology right" in general to all hosts.

Volans mentioned this in T207273: Parser cache hit ratio alerting.Oct 18 2018, 1:09 PM

In T207385#4677395, @Volans wrote:

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

From the top of my head I think it should go where the check if all the masters are up to date is.
Can you give me link to the switch wiki page?

Thank you!

In T207385#4677416, @Marostegui wrote:

In T207385#4677395, @Volans wrote:

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

From the top of my head I think it should go where the check if all the masters are up to date is.
Can you give me link to the switch wiki page?

Thank you!

Mhhh, but that is done during the read-only period, while this one seems to me that it should be done before hand, unless I'm missing something.
See https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/ for the current cookbooks.

In T207385#4677455, @Volans wrote:

In T207385#4677416, @Marostegui wrote:

In T207385#4677395, @Volans wrote:

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

From the top of my head I think it should go where the check if all the masters are up to date is.
Can you give me link to the switch wiki page?

Thank you!

Mhhh, but that is done during the read-only period, while this one seems to me that it should be done before hand, unless I'm missing something.

Ah sorry, I thought that was done BEFORE the read only. You are correct, it should be done before.

See https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/ for the current cookbooks.

Thank you!

So I guess it should go on Phase 0 before #4, somewhere between #3 and #4?
Or maybe even before #2, as it if it is not connected, maybe better to abort before wasting time on other next steps?

jcrespo added a parent task: T243314: FY2020-2021 Q1 DC switchover and switchback.Apr 29 2020, 6:54 AM

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 8:59 PM

Legoktm moved this task from Backlog to Automation improvements on the Datacenter-Switchover board.Jun 26 2021, 1:34 AM

joanna_borun removed projects: Infrastructure-Foundations, SRE-tools.Dec 4 2023, 4:25 PM

Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default)Open, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default)
Open, Needs TriagePublic
Actions

Related Objects
Search...