Page MenuHomePhabricator

Cloud Services: reallocate workload from rack B5-eqiad
Closed, ResolvedPublic

Description

We need to reallocate the workload (or prepare a contingency plan in case of sudden power loss) for the following servers:

  • labweb1001
  • cloudvirt1014 (running 14 VMs)
  • cloudvirt1028 (running 42 VMs)

The operation window is Thu 16 13:00 UTC.

labweb1001 has a counterpart, labweb1002, which can probably hold all the load while on the window.

In the case of VMs, we could probably reallocate them ahead of the window so we can gracefully shutdown the cloudvirts?

Event Timeline

cloudvirt1028.eqiad.wmnet:
    af-puppetdb01.automation-framework.eqiad.wmflabs
    bastion-eqiad1-02.bastion.eqiad.wmflabs
    fridolin.catgraph.eqiad.wmflabs
    cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs
    cloudstore-dev-01.cloudstore.eqiad.wmflabs
    commtech-nsfw.commtech.eqiad.wmflabs
    clm-test-01.community-labs-monitoring.eqiad.wmflabs
    cyberbot-exec-iabot-01.cyberbot.eqiad.wmflabs
    deployment-db05.deployment-prep.eqiad.wmflabs
    deployment-memc05.deployment-prep.eqiad.wmflabs
    deployment-sca01.deployment-prep.eqiad.wmflabs
    deployment-pdfrender02.deployment-prep.eqiad.wmflabs
    ign.ign2commons.eqiad.wmflabs
    integration-slave-docker-1050.integration.eqiad.wmflabs
    integration-castor03.integration.eqiad.wmflabs
    api.openocr.eqiad.wmflabs
    osmit-umap.osmit.eqiad.wmflabs
    builder-envoy.packaging.eqiad.wmflabs
    jmm-buster.puppet.eqiad.wmflabs
    a11y.reading-web-staging.eqiad.wmflabs
    adhoc-utils01.security-tools.eqiad.wmflabs
    util-abogott-stretch.testlabs.eqiad.wmflabs
    canary1028-01.testlabs.eqiad.wmflabs
    stretch.thumbor.eqiad.wmflabs
    tools-worker-1023.tools.eqiad.wmflabs
    tools-proxy-04.tools.eqiad.wmflabs
    tools-docker-builder-06.tools.eqiad.wmflabs
    tools-sgewebgrid-generic-0904.tools.eqiad.wmflabs
    tools-sgeexec-0942.tools.eqiad.wmflabs
    tools-sgeexec-0941.tools.eqiad.wmflabs
    tools-sgeexec-0940.tools.eqiad.wmflabs
    tools-sgeexec-0939.tools.eqiad.wmflabs
    tools-sgeexec-0937.tools.eqiad.wmflabs
    tools-sgeexec-0929.tools.eqiad.wmflabs
    tools-sgeexec-0921.tools.eqiad.wmflabs
    tools-sgeexec-0920.tools.eqiad.wmflabs
    tools-sgeexec-0911.tools.eqiad.wmflabs
    tools-sgeexec-0909.tools.eqiad.wmflabs
    toolsbeta-proxy-01.toolsbeta.eqiad.wmflabs
    vconverter-instance.videowiki.eqiad.wmflabs
    perfbot.webperf.eqiad.wmflabs
    wdhqs-1.wikidata-history-query-service.eqiad.wmflabs

cloudvirt1014.eqiad.wmnet:
    commonsarchive-prod.commonsarchive.eqiad.wmflabs
    deployment-imagescaler03.deployment-prep.eqiad.wmflabs
    dumps-5.dumps.eqiad.wmflabs
    dumps-4.dumps.eqiad.wmflabs
    incubator-mw.incubator.eqiad.wmflabs
    webperformance.integration.eqiad.wmflabs
    saucelabs-01.integration.eqiad.wmflabs
    integration-puppetmaster01.integration.eqiad.wmflabs
    maps-puppetmaster.maps.eqiad.wmflabs
    maps-wma.maps.eqiad.wmflabs
    mwoffliner3.mwoffliner.eqiad.wmflabs
    mwoffliner1.mwoffliner.eqiad.wmflabs
    phlogiston-5.phlogiston.eqiad.wmflabs
    discovery-testing-01.shiny-r.eqiad.wmflabs
    snuggle-enwiki-01.snuggle.eqiad.wmflabs
    canary-1014-01.testlabs.eqiad.wmflabs
    tools-sgeexec-0901.tools.eqiad.wmflabs
    wdqs-test.wikidata-query.eqiad.wmflabs

cloudvirt1014 is already depooled and marked for rebuild as it runs Jessie, would be a good opportunity to drain it. guess the other should be depooled too if stuff is to be moved off it during maintenance

Change 509903 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Depool cloudvirt1028

https://gerrit.wikimedia.org/r/509903

I've asked for clarification about what kind of power outage is feared here. Since emptying 1028 will cause downtime anyway I want to know if the expected downtime from the PDU move is more or less than the downtime associated with evacuation.

I think we should risk the slight chance of a multi-hour outage. Three days isn't enough time to give proper notice of an evacuation, and if things go well the work will have been in vain anyway. So, I propose:

  1. We move a few exec nodes off of 1028 so that the grid will be less affected by possible power loss
  2. We announce the upcoming maintenance window, and

2a) Give users the option of having their VMs evacuated if they prefer that

  1. Don't evacuate anything else

Change 509903 abandoned by Alex Monk:
Depool cloudvirt1028

Reason:
per ticket and irc, probably not worth the effort

https://gerrit.wikimedia.org/r/509903

Email sent with content:

Hi!

on 2019-05-16 13:00 UTC there will be a maintenance operation in one of the
Wikimedia Foundation datacenter racks that affects 2 of our servers running
virtual machines [0]. There is a risk that this maintenance operation can result
in power loss of the servers, affecting the virtual machines running on it.
However, there is no way to know for sure if there will be any outage at all.

If you are an admin of any of the VMs in the list and you want the VM to be
reallocated into other servers previous to the operation, please get in touch
with us as soon as possible. Remember that, right now, reallocating the VM to
other server means shutting down the VM briefly.

Here is a list of affected virtual machines:

cloudvirt1028.eqiad.wmnet:
    af-puppetdb01.automation-framework.eqiad.wmflabs
    bastion-eqiad1-02.bastion.eqiad.wmflabs
    fridolin.catgraph.eqiad.wmflabs
    cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs
    cloudstore-dev-01.cloudstore.eqiad.wmflabs
    commtech-nsfw.commtech.eqiad.wmflabs
    clm-test-01.community-labs-monitoring.eqiad.wmflabs
    cyberbot-exec-iabot-01.cyberbot.eqiad.wmflabs
    deployment-db05.deployment-prep.eqiad.wmflabs
    deployment-memc05.deployment-prep.eqiad.wmflabs
    deployment-sca01.deployment-prep.eqiad.wmflabs
    deployment-pdfrender02.deployment-prep.eqiad.wmflabs
    ign.ign2commons.eqiad.wmflabs
    integration-slave-docker-1050.integration.eqiad.wmflabs
    integration-castor03.integration.eqiad.wmflabs
    api.openocr.eqiad.wmflabs
    osmit-umap.osmit.eqiad.wmflabs
    builder-envoy.packaging.eqiad.wmflabs
    jmm-buster.puppet.eqiad.wmflabs
    a11y.reading-web-staging.eqiad.wmflabs
    adhoc-utils01.security-tools.eqiad.wmflabs
    util-abogott-stretch.testlabs.eqiad.wmflabs
    canary1028-01.testlabs.eqiad.wmflabs
    stretch.thumbor.eqiad.wmflabs
    tools-worker-1023.tools.eqiad.wmflabs
    tools-proxy-04.tools.eqiad.wmflabs
    tools-docker-builder-06.tools.eqiad.wmflabs
    tools-sgewebgrid-generic-0904.tools.eqiad.wmflabs
    tools-sgeexec-0942.tools.eqiad.wmflabs
    tools-sgeexec-0941.tools.eqiad.wmflabs
    tools-sgeexec-0940.tools.eqiad.wmflabs
    tools-sgeexec-0939.tools.eqiad.wmflabs
    tools-sgeexec-0937.tools.eqiad.wmflabs
    tools-sgeexec-0929.tools.eqiad.wmflabs
    tools-sgeexec-0921.tools.eqiad.wmflabs
    tools-sgeexec-0920.tools.eqiad.wmflabs
    tools-sgeexec-0911.tools.eqiad.wmflabs
    tools-sgeexec-0909.tools.eqiad.wmflabs
    toolsbeta-proxy-01.toolsbeta.eqiad.wmflabs
    vconverter-instance.videowiki.eqiad.wmflabs
    perfbot.webperf.eqiad.wmflabs
    wdhqs-1.wikidata-history-query-service.eqiad.wmflabs

cloudvirt1014.eqiad.wmnet:
    commonsarchive-prod.commonsarchive.eqiad.wmflabs
    deployment-imagescaler03.deployment-prep.eqiad.wmflabs
    dumps-5.dumps.eqiad.wmflabs
    dumps-4.dumps.eqiad.wmflabs
    incubator-mw.incubator.eqiad.wmflabs
    webperformance.integration.eqiad.wmflabs
    saucelabs-01.integration.eqiad.wmflabs
    integration-puppetmaster01.integration.eqiad.wmflabs
    maps-puppetmaster.maps.eqiad.wmflabs
    maps-wma.maps.eqiad.wmflabs
    mwoffliner3.mwoffliner.eqiad.wmflabs
    mwoffliner1.mwoffliner.eqiad.wmflabs
    phlogiston-5.phlogiston.eqiad.wmflabs
    discovery-testing-01.shiny-r.eqiad.wmflabs
    snuggle-enwiki-01.snuggle.eqiad.wmflabs
    canary-1014-01.testlabs.eqiad.wmflabs
    tools-sgeexec-0901.tools.eqiad.wmflabs
    wdqs-test.wikidata-query.eqiad.wmflabs


Toolforge won't be affected by this operation.
You can read more details about the datacenter operation itself in phabricator [1].

Sorry for the short notice,

regards.

[0] Cloud Services: reallocate workload from rack B5-eqiad
https://phabricator.wikimedia.org/T223148
[1] Install new PDUs into b5-eqiad https://phabricator.wikimedia.org/T223126

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T12:43:46Z] <arturo> T223148 depool tools-sgewebgrid-generic-0904

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T12:49:33Z] <arturo> T223148 reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T12:50:36Z] <arturo> T223148 depool tools-sgeexec-0942

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T12:52:34Z] <arturo> T223148 reallocating tools-sgeexec-0942 to cloudvirt1001

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T12:58:43Z] <arturo> T223148 reallocating tools-worker-1023 to cloudvirt1001

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T13:03:20Z] <arturo> T223148 repool tools-sgewebgrid-generic-0904

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T13:16:43Z] <arturo> T223148 repool tools-sgeexec-0942

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T15:13:11Z] <arturo> T223148 repool tools-worker-1023

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T15:23:03Z] <arturo> T223148 depool tools-worker-1009

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T15:24:02Z] <arturo> T223148 last SAL entry is bogus, please ignore (depool tools-worker-1009)

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T15:24:24Z] <arturo> T223148 depool tools-sgeexec-0909 and reallocate to cloudvirt1002

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T15:52:35Z] <arturo> T223148 repool tools-sgeexec-0909

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T15:56:49Z] <arturo> T223148 depool tools-sgeexec-0911 and reallocate to cloudvirt1003

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T16:36:47Z] <arturo> T223148 repool tools-sgeexec-0911

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T16:37:29Z] <arturo> T223148 depool tools-sgeexec-0920 and reallocate to cloudvirt1003

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T17:12:16Z] <arturo> T223148 repool tools-sgeexec-0920

Following the maintenance email, for the integration project, I could use some reallocations / sync up before the maintenance.

DONEcloudvirt1014.eqiad.wmnetintegration-puppetmaster01.integration.eqiad.wmflabs

Can be shutdown / reallocated at anytime, but I would rather have it up while the maintenance is ongoing or the whole fleet will alarm out due to lack of a puppet master.

DONEcloudvirt1028.eqiad.wmnetintegration-castor03.integration.eqiad.wmflabs

This is a CI SPOF which currently brings down the whole service eventually. It is preferable to reallocate it when CI has low traffic, typically during European morning.

cloudvirt1014.eqiad.wmnetwebperformance.integration.eqiad.wmflabs

Runs long standing jobs that are time sensitive. We would need to depool it from Jenkins, reallocate it then pool it back.

DONEcloudvirt1014.eqiad.wmnetsaucelabs-01.integration.eqiad.wmflabs
cloudvirt1028.eqiad.wmnetintegration-slave-docker-1050.integration.eqiad.wmflabs

Do not need to be reallocated. Just have to put them offline at least an hour before the maintenance window. Or we just reallocate them (which need them to be depooled from Jenkins, migrated and pooled back.

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T08:56:57Z] <arturo> T223148 reallocating integration-puppetmaster01 to cloudvirt1001

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T09:00:20Z] <arturo> T223148 depool tools-sgeexec-0901 and reallocate to cloudvirt1004

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T09:13:34Z] <arturo> T223148 reallocating integration-castor03 to cloudvirt1002

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T09:20:43Z] <arturo> T223148 reallocating saucelabs-01 to cloudvirt1007

Change 510450 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Disable castor saving

https://gerrit.wikimedia.org/r/510450

Mentioned in SAL (#wikimedia-operations) [2019-05-15T09:33:35Z] <hashar> Disable CI castor cache system since the instance is being migrated. Some / most CI jobs might have failed for the last 20 minutes or so T223148

Change 510450 merged by jenkins-bot:
[integration/config@master] Disable castor saving

https://gerrit.wikimedia.org/r/510450

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T09:44:08Z] <arturo> T223148 repool tools-sgeexec-0901

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T10:36:39Z] <arturo> T223148 reallocation of integration-castor03 is now done

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T10:46:14Z] <arturo> T223148 depool tools-sgeexec-0941 and move to cloudvirt1005

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T11:11:07Z] <arturo> T223148 repool tools-sgeexec-0941

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T11:20:50Z] <arturo> T223148 depool tools-sgeexec-0940 and move to cloudvirt1006

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T11:34:33Z] <arturo> T223148 repool tools-sgeexec-0940

integration is covered.

webperformance.integration.eqiad.wmflabs and integration-slave-docker-1050 would have to be put offline in Jenkins before the migration but it is not a big issue if they got down unexpectedly.

Thank you @arturo for the list of instances affected and for the reallocations!

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T12:07:43Z] <arturo> T223148 move bastion-eqiad1-02 to cloudvirt1001

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T12:13:06Z] <arturo> T223148 depool tools-sgeexec-0939 and move to cloudvirt1007

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T12:13:18Z] <arturo> T223148 depool tools-sgeexec-0937 and move to cloudvirt1008

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T12:29:54Z] <arturo> T223148 repool both tools-sgeexec-09[37,39]

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T15:32:28Z] <arturo> T223148 depool tools-sgeexec-0920 and move to cloudvirt1014

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T15:32:35Z] <arturo> T223148 depool tools-sgeexec-0921 and move to cloudvirt1014

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T16:20:02Z] <arturo> T223148 repool both tools-sgeexec-0921 and -0929

We are ready for this.

I will downtime now the servers:

aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "possible power outage" -t T223148 --hours 8 cloudvirt1014.eqiad.wmnet
aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "possible power outage" -t T223148 --hours 8 cloudvirt1028.eqiad.wmnet
aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "possible power outage" -t T223148 --hours 8 labweb1001.wikimedia.org

In case of disturbance, I will depool labweb1001:

aborrero@puppetmaster1001:~ $ sudo -i confctl select dc=eqiad,cluster=labweb get
{"labweb1001.wikimedia.org": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=labweb,service=labweb"}
{"labweb1002.wikimedia.org": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=labweb,service=labweb"}

aborrero@puppetmaster1001:~ $ sudo -i confctl depool --hostname labweb1001.wikimedia.org --service labweb

Icinga downtime for 8:00:00 set by aborrero@cumin1001 on 1 host(s) and their services with reason: possible power outage

cloudvirt1014.eqiad.wmnet

Icinga downtime for 8:00:00 set by aborrero@cumin1001 on 1 host(s) and their services with reason: possible power outage

cloudvirt1028.eqiad.wmnet

Icinga downtime for 8:00:00 set by aborrero@cumin1001 on 1 host(s) and their services with reason: possible power outage

labweb1001.wikimedia.org

Mentioned in SAL (#wikimedia-operations) [2019-05-16T11:00:43Z] <arturo> T223148 downtime cloudvirt[1014,1028].eqiad.wmnet and labweb1001.wikimedia.org for 8 hours