⚓ T330651 28 February 2023 Service Switchover checklist

Subject	Repo	Branch	Lines +/-
Switch deployment server to deploy2002.codfw.wmnet	operations/puppet	production	+4 -2
wmnet: Switch deployment CNAMEs to codfw	operations/dns	master	+2 -2
service::catalog: Remove discovery stanza for apt	operations/puppet	production	+0 -3

Status	Assigned	Task
Resolved	Clement_Goubert	T327920 March 2023 Datacenter Switchover
Resolved	Clement_Goubert	T328903 March 2023 Datacenter Switchover eqiad pooling schedule
Resolved	Clement_Goubert	T330651 28 February 2023 Service Switchover checklist

Clement_Goubert created this task.Feb 27 2023, 12:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2023, 12:08 PM

Clement_Goubert triaged this task as High priority.Feb 27 2023, 12:08 PM

Clement_Goubert added a parent task: T327920: March 2023 Datacenter Switchover.

Clement_Goubert moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.

Clement_Goubert updated the task description. (Show Details)

Change 892372 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/dns@master] wmnet: Switch deployment CNAMEs to codfw

https://gerrit.wikimedia.org/r/892372

Change 892373 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Switch deployment server to deploy2002.codfw.wmnet

https://gerrit.wikimedia.org/r/892373

Clement_Goubert renamed this task from March 2023 Service Switchover checklist to 28 February 2023 Service Switchover checklist.Feb 27 2023, 12:11 PM

We should probably test that both scap works and a scap3 deployment also works (e.g. docker-pkg) when we've migrated the deployment server.

It's also probably wise to email ops@ when we've completed the switch.

[nit] the enable-puppet + run-puppe-agent can be simplified with run-puppet-agent --enable "reason".

Clement_Goubert updated the task description. (Show Details)Feb 27 2023, 2:14 PM

Clement_Goubert updated the task description. (Show Details)

Clement_Goubert updated the task description. (Show Details)Feb 27 2023, 2:18 PM

Clement_Goubert updated the task description. (Show Details)Feb 27 2023, 3:14 PM

In T330651#8648604, @Joe wrote:

We should probably test that both scap works and a scap3 deployment also works (e.g. docker-pkg) when we've migrated the deployment server.

It's also probably wise to email ops@ when we've completed the switch.

For scap-mediawiki, on deploy2002:

$ cd /srv/mediawiki-staging
$ scap sync-file README "check the deployment server after switchover"

I *suspect* we might be missing the releases git repository for mw on k8s, for which we set up no copy to the secundary server at (/etc/helmfile-defaults/mediawiki/release) - this shouldn't be a huge problem right now, but we need to open a task to keep the two repos in sync.

For scap3, I suggest you pair up with @hnowlan so that we test a restbase deployment, given we need to perform one the next day.

In T330651#8651629, @Joe wrote:
In T330651#8648604, @Joe wrote:

We should probably test that both scap works and a scap3 deployment also works (e.g. docker-pkg) when we've migrated the deployment server.

It's also probably wise to email ops@ when we've completed the switch.

For scap-mediawiki, on deploy2002:
$ cd /srv/mediawiki-staging
$ scap sync-file README "check the deployment server after switchover"
I *suspect* we might be missing the releases git repository for mw on k8s, for which we set up no copy to the secundary server at (/etc/helmfile-defaults/mediawiki/release) - this shouldn't be a huge problem right now, but we need to open a task to keep the two repos in sync.

I'm checking right now, but I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/892373 + a puppet run would be all that's needed.

For scap3, I suggest you pair up with @hnowlan so that we test a restbase deployment, given we need to perform one the next day.

ack

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 10:15 AM

Clement_Goubert updated the task description. (Show Details)

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 10:54 AM

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 11:00 AM

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 1:03 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-28T13:04:38Z] <claime> Locking scap deployments for service switchover - T330651

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 1:06 PM

Change 892960 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] service::catalog: Remove discovery stanza for apt

https://gerrit.wikimedia.org/r/892960

Change 892960 merged by Clément Goubert:

[operations/puppet@production] service::catalog: Remove discovery stanza for apt

https://gerrit.wikimedia.org/r/892960

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:21:39Z] <claime> switching services over to codfw - T330651

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 started.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:21:53Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651

Zabe subscribed.Feb 28 2023, 2:36 PM

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 failed.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:42:03Z] <cgoubert@cumin1001> END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) depool all services in eqiad: Datacenter Switchover - T330651

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 started.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:42:08Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T330651 completed.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:44:07Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Datacenter Switchover - T330651

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 2:45 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:51:08Z] <claime> Switch deployment server to deploy2002.codfw.wmnet - T330651

Change 892372 merged by Clément Goubert:

[operations/dns@master] wmnet: Switch deployment CNAMEs to codfw

https://gerrit.wikimedia.org/r/892372

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:10:36Z] <claime> Running authdns-update for deployment server switch - T330651

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 3:13 PM

Change 892373 merged by Clément Goubert:

[operations/puppet@production] Switch deployment server to deploy2002.codfw.wmnet

https://gerrit.wikimedia.org/r/892373

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:16:37Z] <claime> Running puppet on all deployment servers - T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:18:28Z] <claime> Running puppet on fleet-wide - T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:20:25Z] <claime> Disregard running puppet on fleet-wide - T330651

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 3:20 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:26:13Z] <claime> Testing scap deployment from deploy2002.codfw.wmnet - T330651

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 3:29 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 28 2023, 3:30 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-28T15:47:56Z] <cgoubert@deploy2002> Synchronized README: check the deployment server after switchover - T330651 (duration: 20m 56s)

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 3:55 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:09:24Z] <claime> Switching netbox back to eqiad - T330651

Netbox was switched too to codfw as part of the discovery services switch and appears to be quite slow.
This setup has not been tested properly (and the DB was not switched with it) and to be on the safe side the I/F team would like to revert it back to eqiad for now so that it doesn't affect the rest of SREs and all the tooling using Netbox.
We'll investigate more the issues and come up with a solution for it.

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:10:50Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route pool netbox in eqiad: T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:15:51Z] <cgoubert@cumin1001> END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) pool netbox in eqiad: T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:15:58Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route depool netbox in codfw: T330651

Clement_Goubert updated the task description. (Show Details)Feb 28 2023, 4:20 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:20:54Z] <cgoubert@cumin1001> END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool netbox in codfw: T330651

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:38:04Z] <claime> stale discovery files wiped for netbox - T330651

Marking as Resolved for now, will reopen in a week (or whenever restbase-async wants us to switch it back).

Mentioned in SAL (#wikimedia-operations) [2023-02-28T16:45:01Z] <claime> Traffic and Service switchovers to codfw finished - T330651 - T330650

Stashbot mentioned this in T330650: 28 February 2023 Traffic Switchover checklist.Feb 28 2023, 4:45 PM

Clement_Goubert updated the task description. (Show Details)Mar 6 2023, 11:56 AM

hashar mentioned this in T331378: Deployment server permissions are broken preventing MediaWiki deployment.Mar 7 2023, 6:16 AM

T331285: March 2023 Traffic Repool checklist done, switching restbase-async back to its standard state.

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:48:46Z] <claime> Starting restbase-async switchback - T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:49:21Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route pool restbase-async in eqiad: T330651

Clement_Goubert updated the task description. (Show Details)Mar 8 2023, 11:51 AM

Clement_Goubert added a parent task: T328903: March 2023 Datacenter Switchover eqiad pooling schedule.Mar 8 2023, 11:53 AM

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:54:25Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool restbase-async in eqiad: T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:54:59Z] <claime> restbase-async pooled in eqiad, depooling in codfw- T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T11:55:12Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route depool restbase-async in codfw: T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T12:00:15Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: T330651

Mentioned in SAL (#wikimedia-operations) [2023-03-08T12:01:46Z] <claime> restbase-async back in standard state - T330651

Clement_Goubert closed this task as Resolved.Mar 8 2023, 12:02 PM

Clement_Goubert updated the task description. (Show Details)

Clement_Goubert mentioned this in T331541: 14 March 2023 eqiad Service repooling.Mar 8 2023, 3:29 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:00:15Z] <claime> Locking scap deployment for service switchover - T330651

28 February 2023 Service Switchover checklist
Closed, ResolvedPublic
Actions

Description

One hour before (13:00UTC)

All services

Deployment server

restbase-async

Details

Related Objects
Search...

Event Timeline

28 February 2023 Service Switchover checklistClosed, ResolvedPublicActions

Description

One hour before (13:00UTC)

All services

Deployment server

restbase-async

Details

Related ObjectsSearch...

Event Timeline

28 February 2023 Service Switchover checklist
Closed, ResolvedPublic
Actions

Related Objects
Search...