Page MenuHomePhabricator

28 February 2023 Service Switchover checklist
Closed, ResolvedPublic

Description

One hour before (13:00UTC)

  • Add a scap lock on deploy1002.eqiad.wmnet echo "Deployment lock for service switchover - T330651" > /var/lock/scap-global-lock

All services

  • Run sudo cookbook sre.discovery.datacenter depool eqiad --all --reason "Datacenter Switchover" --task-id T330651

Deployment server

  • Log SAL !log Switch deployment server - T330651
  • Run sudo cumin 'R:class = role::deployment_server' 'disable-puppet "Switchover of the deployment server"'
  • Merge https://gerrit.wikimedia.org/r/c/operations/dns/+/892372
  • Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/892373
  • Run puppet on deploy2002.codfw.wmnet sudo cumin deploy2002.codfw.wmnet 'run-puppet-agent --enable "Switchover of the deployment server"'
  • Run puppet on all other deployment servers sudo cumin 'R:class = role::deployment_server' 'run-puppet-agent --enable "Switchover of the deployment server"'
  • Run puppet on alert*
  • Cronjob check sudo cumin deploy2002.codfw.wmnet 'systemctl list-units | grep -A1 sync_deployment_dir'
  • Remove lock sudo cumin deploy2002.codfw.wmnet 'rm -v /var/lock/scap-global-lock'
  • Test scap deployment cd /srv/mediawiki-staging; scap sync-file README "check the deployment server after switchover"
  • Test scap3 deployments work (restbase?)
  • Test helmfile deployments
  • email ops@ about the switch

restbase-async

A week later (08 March 2023), restore restbase to it's normal state

  • Run sudo cookbook sre.discovery.service-route --reason T330651 pool --wipe-cache eqiad restbase-async
  • Run sudo cookbook sre.discovery.service-route --reason T330651 depool --wipe-cache codfw restbase-async

Event Timeline

Change 892372 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/dns@master] wmnet: Switch deployment CNAMEs to codfw

https://gerrit.wikimedia.org/r/892372

Change 892373 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Switch deployment server to deploy2002.codfw.wmnet

https://gerrit.wikimedia.org/r/892373

Clement_Goubert renamed this task from March 2023 Service Switchover checklist to 28 February 2023 Service Switchover checklist.Feb 27 2023, 12:11 PM

We should probably test that both scap works and a scap3 deployment also works (e.g. docker-pkg) when we've migrated the deployment server.

It's also probably wise to email ops@ when we've completed the switch.

[nit] the enable-puppet + run-puppe-agent can be simplified with run-puppet-agent --enable "reason".

We should probably test that both scap works and a scap3 deployment also works (e.g. docker-pkg) when we've migrated the deployment server.

It's also probably wise to email ops@ when we've completed the switch.

For scap-mediawiki, on deploy2002:

$ cd /srv/mediawiki-staging
$ scap sync-file README "check the deployment server after switchover"

I *suspect* we might be missing the releases git repository for mw on k8s, for which we set up no copy to the secundary server at (/etc/helmfile-defaults/mediawiki/release) - this shouldn't be a huge problem right now, but we need to open a task to keep the two repos in sync.

For scap3, I suggest you pair up with @hnowlan so that we test a restbase deployment, given we need to perform one the next day.

We should probably test that both scap works and a scap3 deployment also works (e.g. docker-pkg) when we've migrated the deployment server.

It's also probably wise to email ops@ when we've completed the switch.

For scap-mediawiki, on deploy2002:

$ cd /srv/mediawiki-staging
$ scap sync-file README "check the deployment server after switchover"

I *suspect* we might be missing the releases git repository for mw on k8s, for which we set up no copy to the secundary server at (/etc/helmfile-defaults/mediawiki/release) - this shouldn't be a huge problem right now, but we need to open a task to keep the two repos in sync.

I'm checking right now, but I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/892373 + a puppet run would be all that's needed.

For scap3, I suggest you pair up with @hnowlan so that we test a restbase deployment, given we need to perform one the next day.

ack

Mentioned in SAL (#wikimedia-operations) [2023-02-28T13:04:38Z] <claime> Locking scap deployments for service switchover - T330651

Change 892960 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] service::catalog: Remove discovery stanza for apt

https://gerrit.wikimedia.org/r/892960

Change 892960 merged by Clément Goubert:

[operations/puppet@production] service::catalog: Remove discovery stanza for apt

https://gerrit.wikimedia.org/r/892960

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:21:39Z] <claime> switching services over to codfw - T330651

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 started.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:21:53Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 failed.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:42:03Z] <cgoubert@cumin1001> END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) depool all services in eqiad: Datacenter Switchover - T330651

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 started.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:42:08Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 completed.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:44:07Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Datacenter Switchover - T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:51:08Z] <claime> Switch deployment server to deploy2002.codfw.wmnet - T330651

Change 892372 merged by Clément Goubert:

[operations/dns@master] wmnet: Switch deployment CNAMEs to codfw

https://gerrit.wikimedia.org/r/892372

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:10:36Z] <claime> Running authdns-update for deployment server switch - T330651

Change 892373 merged by Clément Goubert:

[operations/puppet@production] Switch deployment server to deploy2002.codfw.wmnet

https://gerrit.wikimedia.org/r/892373

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:16:37Z] <claime> Running puppet on all deployment servers - T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:18:28Z] <claime> Running puppet on fleet-wide - T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:20:25Z] <claime> Disregard running puppet on fleet-wide - T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:26:13Z] <claime> Testing scap deployment from deploy2002.codfw.wmnet - T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:47:56Z] <cgoubert@deploy2002> Synchronized README: check the deployment server after switchover - T330651 (duration: 20m 56s)

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:09:24Z] <claime> Switching netbox back to eqiad - T330651

Netbox was switched too to codfw as part of the discovery services switch and appears to be quite slow.
This setup has not been tested properly (and the DB was not switched with it) and to be on the safe side the I/F team would like to revert it back to eqiad for now so that it doesn't affect the rest of SREs and all the tooling using Netbox.
We'll investigate more the issues and come up with a solution for it.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:10:50Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route pool netbox in eqiad: T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:15:51Z] <cgoubert@cumin1001> END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) pool netbox in eqiad: T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:15:58Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route depool netbox in codfw: T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:20:54Z] <cgoubert@cumin1001> END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool netbox in codfw: T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:38:04Z] <claime> stale discovery files wiped for netbox - T330651

Marking as Resolved for now, will reopen in a week (or whenever restbase-async wants us to switch it back).

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:45:01Z] <claime> Traffic and Service switchovers to codfw finished - T330651 - T330650

T331285: March 2023 Traffic Repool checklist done, switching restbase-async back to its standard state.

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:48:46Z] <claime> Starting restbase-async switchback - T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:49:21Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route pool restbase-async in eqiad: T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:54:25Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool restbase-async in eqiad: T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:54:59Z] <claime> restbase-async pooled in eqiad, depooling in codfw- T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:55:12Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route depool restbase-async in codfw: T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T12:00:15Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T12:01:46Z] <claime> restbase-async back in standard state - T330651

Clement_Goubert updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:00:15Z] <claime> Locking scap deployment for service switchover - T330651