Page MenuHomePhabricator

Migrate the Integration cloud project to eqiad1-r
Closed, ResolvedPublic

Description

65a1f5b6-a15d-4bac-bba8-492bbaf1688aintegration-slave-docker-1043ACTIVEpublic=10.68.22.120
a85357b9-8b28-4af3-abe5-123ee9d841c4integration-slave-docker-1041ACTIVEpublic=10.68.19.23
abfb4d28-22f2-41fb-b379-db72e903d04eintegration-slave-docker-1040ACTIVEpublic=10.68.19.167
6b72d7ee-b5b9-46b3-b301-b4e0814f23d1integration-slave-docker-1038ACTIVEpublic=10.68.17.241
b5e22c89-a815-4187-aaa4-f8693fbfa28cintegration-slave-docker-1037ACTIVEpublic=10.68.21.155
7c4a6cef-b3ee-493a-b8e6-df7e9ab6960eintegration-slave-docker-1034ACTIVEpublic=10.68.23.35
d64194fa-b42c-497c-a1a7-472edfaf940dintegration-slave-docker-1021ACTIVEpublic=10.68.22.8
0490ca9c-22c9-4451-8d5a-a91ecbc09652integration-cuminACTIVEpublic=10.68.18.238
3d33c860-f462-480d-a4a0-29354f6d022bwebperformanceACTIVEpublic=10.68.20.166
58ae7fe9-c832-4bf3-b51e-6bce06ed8b30castor02ACTIVEpublic=10.68.20.186
6042250a-3af0-40b6-bf11-71616492982aintegration-r-lang-01ACTIVEpublic=10.68.20.232
6f261093-b0be-48f0-adc6-c0f2e7957dfejenkinstestACTIVEpublic=10.68.22.228
a4b5cdc8-8692-4814-84d5-0804e39d810bsaucelabs-03ACTIVEpublic=10.68.23.100
b4064edb-10c7-410a-b11c-4df8f358526fsaucelabs-02ACTIVEpublic=10.68.22.247
b83af49a-4eaf-469a-818f-e5e984f5fc65saucelabs-01ACTIVEpublic=10.68.21.186
6d6449bd-87ee-4108-b444-43b88ec61113integration-publishingACTIVEpublic=10.68.23.254
6da0fbd1-3b97-4a2f-9669-81507a04b3f1integration-puppetmaster01ACTIVEpublic=10.68.22.41
a64e731e-c987-4b84-85b9-cf450275d26dintegration-slave-jessie-androidACTIVEpublic=10.68.19.239
641c9a74-fcb7-49ea-b550-5f6cd5dcc893integration-slave-jessie-1002ACTIVEpublic=10.68.16.199
4a7b7e39-8dee-4388-befd-96a763f5d0faintegration-slave-jessie-1001ACTIVEpublic=10.68.16.72

I expect that it will be disruptive to CI to have these offline, but I also suspect that e.g. the docker cluster can function across regions just fine. So I propose that for a start:

  1. Someone in releng adjust security groups and firewall rules to allow ingress to 172.16.0.0/21 for docker things
  2. Someone at releng depools integration-slave-docker-1043 and pings me
  3. I move integration-slave-docker-1043 to the new region
  4. Someone at releng repools integration-slave-docker-1043 and confirms that it works.

If that all goes well then we can move the other docker nodes over in pairs without disruption.

I have no idea if the other things in that project can be similarly staged or if we should just plan a sprint during a low-traffic time. I'm open to suggestions.

Event Timeline

Andrew triaged this task as Medium priority.Nov 5 2018, 9:39 PM
Andrew created this task.

I just moved integration-slave-docker-1038 and integration-slave-docker-1034 to eqiad1-r and repooled them. They look OK to me, so far.

I've moved several docker-XXXX jessie-XXXX nodes and that seems to work fine. So I'll finish up with those on my own.

The next VMs of interest to me are integration-cumin and integration-publishing which might be more disruptive.

Change 474279 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Explain why integration-publishing IP is hardcoded

https://gerrit.wikimedia.org/r/474279

So for integration-publishing, its used from contint1001 which is a production machine. The server thus can not resolve the DNS entry integration-publishing.integration.eqiad.wmflabs. and we have hardcoded the instance IP address in the Jenkins jobs.

I think the best way to migrate is to create a fresh new instance in the new region, once we have the IP adress migrate the Jenkins job to use that new instance then we can discard the old one.

For castor02.integration.eqiad.wmflabs, I will have to double check, but I think they always use the DNS entry and it is always run from labs instance. So should be fine to migrate. But, castor02 is usually a SPOF for most of the CI jobs. We would probably want to create a new one, rsync the caches there and switch the jobs.

For the integration-slave hosts, some have been migrated this week with @thcipriani , they seem all fine. Or at least, I haven't been reported any issue related to the migration. I am fine having the other integration-slave ones to be migrated. Kudos!

Change 474279 merged by Andrew Bogott:
[integration/config@master] Explain why integration-publishing IP is hardcoded

https://gerrit.wikimedia.org/r/474279

Mentioned in SAL (#wikimedia-releng) [2018-11-19T13:21:10Z] <hashar> Created integration-publishing02 172.16.4.5 for WMCS region migration # T208803

Change 474689 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Switch to integration-publishing02

https://gerrit.wikimedia.org/r/474689

Mentioned in SAL (#wikimedia-releng) [2018-11-19T13:47:40Z] <hashar> Shutdown integration-publishing , replaced by integration-publishing02 # T208803

Change 474689 merged by jenkins-bot:
[integration/config@master] Switch to integration-publishing02

https://gerrit.wikimedia.org/r/474689

I have switched the CI jobs from integration-publishing to integration-publishing02 in the new region. I have shutdown the instance, will delete it in a few days.

For castor02, we always use the DNS entry, the only exception being in the Jenkins configuration for the node. So we can migrate like other instances. That will cause a slight downtime of CI while it is being done.

Mentioned in SAL (#wikimedia-releng) [2018-11-20T14:59:24Z] <hashar> created integration-castor03.integration.eqiad.wmflabs intended as a replacement for castor02 | T208803

I will prepare the migration to integration-castor03 later tonight and attempt to do the migration tomorrow or on Thursday (easier for me)

Eventually I felt migrating castor03 just before ThanksGiving to be a terrible idea. Postponed to next week.

Change 476213 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Migrate Castor to integration-castor03

https://gerrit.wikimedia.org/r/476213

Change 476213 merged by jenkins-bot:
[integration/config@master] Migrate Castor to integration-castor03

https://gerrit.wikimedia.org/r/476213

Mentioned in SAL (#wikimedia-releng) [2018-11-28T08:45:37Z] <hashar> Switching Jenkins job cache to integration-castor03 with AN EMPTY CACHE | T208803

CI jobs have been migrated to use integration-castor03. I am keeping castor02 around just in case.

Last rsync entry for castor02:

Nov 28 08:38:17 castor02 rsyncd[19693]: rsync on caches/pywikibot-core/master/pywikibot-core-tox-nose34-docker/ from integration-slave-docker-1037.integration.eqiad.wmflabs (10.68.21.155)

integration-castor03 cache is filling up properly.

doc.wikimedia.org has received updates hence the new publishing instance is working.

Deleting castor02 and integration-publishing

Mentioned in SAL (#wikimedia-releng) [2018-11-29T08:35:37Z] <hashar> Deleted instances castor02 and integration-publishing , replaced by new instances in the new WMCS region | T208803

I would like to move integration-puppetmaster01 and webperformance sometime soon. Any objection?

webperformance should be easy to migrate, it runs some jobs for the performance team but apart from that it is a regular Jenkins slave. Just have to remember to update its IP address in the Jenkins slave configuration.

integration-puppetmaster01 that potentially can end up being messy. But I think the instance have puppet.conf pointing to the hostname. Given DNS is adjusted, it should just work. I am not too worried :-)

Note: releng is having an offsite this week so we have little ability unfortunately.

I've now moved everything in this project except for jenkinstest.integration.eqiad.wmflabs. I can't tell what that is -- should I move it blindly, or delete it, or does it need special care?

I've now moved everything in this project except for jenkinstest.integration.eqiad.wmflabs. I can't tell what that is -- should I move it blindly, or delete it, or does it need special care?

I have a vague memory of this and I don't think there is any special care needed. I am reasonably certain that nothing depends on this instance and it can be moved blindly at will.

Thank you to have completed the migration!

For jenkintest that is most probably a one off dev instance I have created. There are evidence I have used it for T183569: npm 1.4.21 can't use a http proxy. There are no jobs running on it so it can be deleted.

Success! \o/

Mentioned in SAL (#wikimedia-releng) [2018-12-11T10:03:44Z] <hashar> Deleting unused instance jenkinstest.integration.eqiad.wmflabs | T208803