Page MenuHomePhabricator

Sept 2023 Switchover Checklist: Services & Traffic
Closed, ResolvedPublic

Description

Run everything in a tmux named switchover

Services

  • `scap lock --all "Datacenter Switchover: Services & Traffic - T346330" on deploy1002
  • sudo cookbook sre.discovery.datacenter depool eqiad --all --reason "Datacenter Switchover: Services" --task-id T346330 on cumin1001

Traffic

deployment server

  • log SAL: !log Switch deployment server - T346330
  • sudo cumin 'R:class = role::deployment_server' 'disable-puppet "Switchover of the deployment server"'
  • merge https://gerrit.wikimedia.org/r/c/operations/dns/+/957734
  • merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/957736
  • Run puppet on deploy2002.codfw.wmnet: sudo cumin deploy2002.codfw.wmnet 'run-puppet-agent --enable "Switchover of the deployment server"'
  • Run puppet on all other deployment servers sudo cumin 'R:class = role::deployment_server' 'run-puppet-agent --enable "Switchover of the deployment server"'
  • Run puppet on alert* sudo cumin 'A:icinga' 'run-puppet-agent -q'
  • Cronjob check sudo cumin deploy2002.codfw.wmnet 'systemctl list-units | grep -A1 sync_deployment_dir' out of date docs, TODO fix
  • remove scap lock
  • Test scap deployment cd /srv/mediawiki-staging; scap sync-world "check the deployment server after switchover"
  • Test scap3 deployments work (restbase?) => nope, due to T346354
  • email ops@ and wikitech-l about the switch

Event Timeline

Change 957734 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/dns@master] wmnet: switch deployment CNAMEs to codfw

https://gerrit.wikimedia.org/r/957734

Change 957736 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] Switch deployment server to deploy2002.codfw.wmnet

https://gerrit.wikimedia.org/r/957736

kamila triaged this task as High priority.Sat, Sep 16, 12:45 PM

Change 958920 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/dns@master] traffic: Depool eqiad from user traffic for switchover

https://gerrit.wikimedia.org/r/958920

Mentioned in SAL (#wikimedia-operations) [2023-09-19T14:00:55Z] <kamila@deploy1002> Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330

kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Services - T346330 started.

Mentioned in SAL (#wikimedia-operations) [2023-09-19T14:01:19Z] <kamila@cumin1001> START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Services - T346330

Mentioned in SAL (#wikimedia-operations) [2023-09-19T14:20:23Z] <kamila@deploy1002> Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330 (duration: 19m 27s)

kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Services - T346330 completed.

Mentioned in SAL (#wikimedia-operations) [2023-09-19T14:28:47Z] <kamila@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Datacenter Switchover: Services - T346330

Mentioned in SAL (#wikimedia-operations) [2023-09-19T14:30:32Z] <kamila@deploy1002> Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330

Mentioned in SAL (#wikimedia-operations) [2023-09-19T14:32:03Z] <kamila_> Switch deployment server - T346330

Change 957734 merged by Kamila Součková:

[operations/dns@master] wmnet: switch deployment CNAMEs to codfw

https://gerrit.wikimedia.org/r/957734

Change 957736 merged by Kamila Součková:

[operations/puppet@production] Switch deployment server to deploy2002.codfw.wmnet

https://gerrit.wikimedia.org/r/957736

Mentioned in SAL (#wikimedia-operations) [2023-09-19T15:05:18Z] <kamila@deploy1002> Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330 (duration: 34m 46s)

Change 958955 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw-on-k8s: Lower traffic to 3%

https://gerrit.wikimedia.org/r/958955

Change 958955 merged by Clément Goubert:

[operations/puppet@production] mw-on-k8s: Lower traffic to 3%

https://gerrit.wikimedia.org/r/958955

Mentioned in SAL (#wikimedia-operations) [2023-09-19T15:25:50Z] <claime> reduce mw-on-k8s traffic to 3% waiting on new nodes - T346330

Mentioned in SAL (#wikimedia-operations) [2023-09-19T15:26:28Z] <claime> running puppet on 'A:cp-text and P{P:trafficserver::backend}' - T346330

Change 958920 merged by Kamila Součková:

[operations/dns@master] traffic: Depool eqiad from user traffic for switchover

https://gerrit.wikimedia.org/r/958920

While there are some outstanding issues due to lack of capacity in codfw, overall we're done here :-)