Page MenuHomePhabricator

March 2023 Datacenter Switchover Excluded services
Closed, ResolvedPublic

Description

List of services to exclude from March 2023 switchover

List of services currently skipped by the sre.discovery.datacenter cookbook

ServiceReasonExplicitly Excluded
aqsit doesn't have a discovery record
aux-k8s-ctrlit doesn't have a discovery record
blubberoidblubberoid needs to follow swift replica for the docker registryY
cloudelastic-chi-httpsit doesn't have a discovery record
cloudelastic-chi-https-publicit doesn't have a discovery record
cloudelastic-omega-httpsit doesn't have a discovery record
cloudelastic-omega-https-publicit doesn't have a discovery record
cloudelastic-psi-httpsit doesn't have a discovery record
cloudelastic-psi-https-publicit doesn't have a discovery record
datahubsearchit doesn't have a discovery record
docker-registryswift replica goes codfw => eqiad and needs manual switchingY
druid-public-brokerit doesn't have a discovery record
kartotherianit doesn't have a discovery record
kibana7it doesn't have a discovery record
kubemasterit doesn't have a discovery record
labwebit doesn't have a discovery record
labweb-sslit doesn't have a discovery record
ldap-roit doesn't have a discovery record
ldap-ro-sslit doesn't have a discovery record
miscwebit doesn't have a discovery record
datahub-frontendit doesn't have a discovery record
datahub-gmsit doesn't have a discovery record
dse-k8s-ctrlit doesn't have a discovery record
ml-ctrlit doesn't have a discovery record
ml-staging-ctrlit doesn't have a discovery record
ncredirit doesn't have a discovery record
ncredir-httpsit doesn't have a discovery record
prometheusit doesn't have a discovery record
restbase-backendit doesn't have a discovery record
searchit doesn't have a discovery record
search-omega-httpsit doesn't have a discovery record
search-psi-httpsit doesn't have a discovery record
swiftit doesn't have a discovery record
textit doesn't have a discovery record
text-httpsit doesn't have a discovery record
thumborit doesn't have a discovery record
toolhubT288685: needs to match m5 database cluster replicationY
uploadit doesn't have a discovery record
upload-httpsit doesn't have a discovery record
wdqsit doesn't have a discovery record
wdqs-heavy-queriesit doesn't have a discovery record
helm-chartsnot a 'service', strictly speaking, thus excludedY
releasesnot a 'service', strictly speaking, thus excludedY
wikireplicas-a-s1it doesn't have a discovery record
wikireplicas-a-s2it doesn't have a discovery record
wikireplicas-a-s3it doesn't have a discovery record
wikireplicas-a-s4it doesn't have a discovery record
wikireplicas-a-s5it doesn't have a discovery record
wikireplicas-a-s6it doesn't have a discovery record
wikireplicas-a-s7it doesn't have a discovery record
wikireplicas-a-s8it doesn't have a discovery record
wikireplicas-b-s1it doesn't have a discovery record
wikireplicas-b-s2it doesn't have a discovery record
wikireplicas-b-s3it doesn't have a discovery record
wikireplicas-b-s4it doesn't have a discovery record
wikireplicas-b-s5it doesn't have a discovery record
wikireplicas-b-s6it doesn't have a discovery record
wikireplicas-b-s7it doesn't have a discovery record
wikireplicas-b-s8it doesn't have a discovery record
puppetdb-apinot a 'service', strictly speaking, thus excludedY
alertmanagerit doesn't have a discovery record
graphiteit doesn't have a discovery record
grafanait doesn't have a discovery record
librenmsit doesn't have a discovery record
inference-stagingit doesn't have a discovery record
image-suggestionit doesn't have a discovery record
developer-portalit doesn't have a discovery record

Event Timeline

Clement_Goubert created this task.

Toolhub does not have a working Kubernetes deployment outside of eqiad (T288685: Establish active/active multi-dc support for Toolhub). Who should I work with to try and prevent this from causing problems for either Toolhub or SREs?

Sorry for the late response @bd808
Toolhub is part of the EXCLUDED_SERVICES coded in the sre.switchdc.services cookbook, and m5 is not part of the switched over databases.

However, this means that since @JMeybohm will be ugrading the eqiad cluster to kubernetes 1.23 there will probably be an interruption of service at that moment T307943: Update Kubernetes clusters to v1.23

Do you see other possible issues?

Toolhub does not have a working Kubernetes deployment outside of eqiad (T288685: Establish active/active multi-dc support for Toolhub). Who should I work with to try and prevent this from causing problems for either Toolhub or SREs?

Sorry for the late response @bd808
Toolhub is part of the EXCLUDED_SERVICES coded in the sre.switchdc.services cookbook, and m5 is not part of the switched over databases.

However, this means that since @JMeybohm will be ugrading the eqiad cluster to kubernetes 1.23 there will probably be an interruption of service at that moment T307943: Update Kubernetes clusters to v1.23

Do you see other possible issues?

I made T329319: What should happen to Toolhub during the 2023 DC switch? today to try and figure out what is needed as we are now only a couple of weeks from the service cut over. Manuel generally confirmed in T329319#8602822 that m5 should keep working except for potential primary switch actions for various maintenance actions that are likely in eqiad. The other major external service dependency is the search-chi-eqiad cluster. I've pinged Guillaume about that in T329319#8602419. If I don't hear from him by early next week I will poke him via other channels.

A brief outage for a k8s upgrade is not a big deal for this service. Or if the aux cluster is ready prior to the upgrade of the eqiad cluster we could deploy and swing traffic to it. That's pretty much what I am asking you and Janis about in T329319#8602389.

Change 888208 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.switchdc.services: Exclude wdqs and wdqs-ssl

https://gerrit.wikimedia.org/r/888208

Change 888213 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.switchdc.services: import sre.discovery.datacenter excludes

https://gerrit.wikimedia.org/r/888213

Change 888208 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.services: Exclude wdqs and wdqs-ssl

https://gerrit.wikimedia.org/r/888208

Change 888213 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.services: import service exclusions

https://gerrit.wikimedia.org/r/888213

Mentioned in SAL (#wikimedia-operations) [2023-02-28T14:44:36Z] <claime> Services switched over to codfw - T329193