Maniphest T208803

Migrate the Integration cloud project to eqiad1-r
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Nov 5 2018, 9:39 PM

Description

65a1f5b6-a15d-4bac-bba8-492bbaf1688a	integration-slave-docker-1043	ACTIVE	public=10.68.22.120
a85357b9-8b28-4af3-abe5-123ee9d841c4	integration-slave-docker-1041	ACTIVE	public=10.68.19.23
abfb4d28-22f2-41fb-b379-db72e903d04e	integration-slave-docker-1040	ACTIVE	public=10.68.19.167
6b72d7ee-b5b9-46b3-b301-b4e0814f23d1	integration-slave-docker-1038	ACTIVE	public=10.68.17.241
b5e22c89-a815-4187-aaa4-f8693fbfa28c	integration-slave-docker-1037	ACTIVE	public=10.68.21.155
7c4a6cef-b3ee-493a-b8e6-df7e9ab6960e	integration-slave-docker-1034	ACTIVE	public=10.68.23.35
d64194fa-b42c-497c-a1a7-472edfaf940d	integration-slave-docker-1021	ACTIVE	public=10.68.22.8
0490ca9c-22c9-4451-8d5a-a91ecbc09652	integration-cumin	ACTIVE	public=10.68.18.238
3d33c860-f462-480d-a4a0-29354f6d022b	webperformance	ACTIVE	public=10.68.20.166
58ae7fe9-c832-4bf3-b51e-6bce06ed8b30	castor02	ACTIVE	public=10.68.20.186
6042250a-3af0-40b6-bf11-71616492982a	integration-r-lang-01	ACTIVE	public=10.68.20.232
6f261093-b0be-48f0-adc6-c0f2e7957dfe	jenkinstest	ACTIVE	public=10.68.22.228
a4b5cdc8-8692-4814-84d5-0804e39d810b	saucelabs-03	ACTIVE	public=10.68.23.100
b4064edb-10c7-410a-b11c-4df8f358526f	saucelabs-02	ACTIVE	public=10.68.22.247
b83af49a-4eaf-469a-818f-e5e984f5fc65	saucelabs-01	ACTIVE	public=10.68.21.186
6d6449bd-87ee-4108-b444-43b88ec61113	integration-publishing	ACTIVE	public=10.68.23.254
6da0fbd1-3b97-4a2f-9669-81507a04b3f1	integration-puppetmaster01	ACTIVE	public=10.68.22.41
a64e731e-c987-4b84-85b9-cf450275d26d	integration-slave-jessie-android	ACTIVE	public=10.68.19.239
641c9a74-fcb7-49ea-b550-5f6cd5dcc893	integration-slave-jessie-1002	ACTIVE	public=10.68.16.199
4a7b7e39-8dee-4388-befd-96a763f5d0fa	integration-slave-jessie-1001	ACTIVE	public=10.68.16.72

I expect that it will be disruptive to CI to have these offline, but I also suspect that e.g. the docker cluster can function across regions just fine. So I propose that for a start:

Someone in releng adjust security groups and firewall rules to allow ingress to 172.16.0.0/21 for docker things
Someone at releng depools integration-slave-docker-1043 and pings me
I move integration-slave-docker-1043 to the new region
Someone at releng repools integration-slave-docker-1043 and confirms that it works.

If that all goes well then we can move the other docker nodes over in pairs without disruption.

I have no idea if the other things in that project can be similarly staged or if we should just plan a sprint during a low-traffic time. I'm open to suggestions.

Details

Subject	Repo	Branch	Lines +/-
Migrate Castor to integration-castor03	integration/config	master	+12 -6
Switch to integration-publishing02	integration/config	master	+11 -11
Explain why integration-publishing IP is hardcoded	integration/config	master	+5 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T53494 Use Beta cluster as a true canary for code deployments (epic)
Open	None	T87220 Minimize infrastructure differences between Beta Cluster and production
Open	None	T196662 Set up LVS in beta like prod
Resolved	bd808	T166396 Program 1 Outcome 4: VPS hosting
Resolved	None	T167293 Nova-network to Neutron migration
Resolved	hashar	T208803 Migrate the Integration cloud project to eqiad1-r

Event Timeline

Andrew triaged this task as Medium priority.Nov 5 2018, 9:39 PM

Andrew created this task.

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptNov 5 2018, 9:39 PM

I just moved integration-slave-docker-1038 and integration-slave-docker-1034 to eqiad1-r and repooled them. They look OK to me, so far.

I've moved several docker-XXXX jessie-XXXX nodes and that seems to work fine. So I'll finish up with those on my own.

The next VMs of interest to me are integration-cumin and integration-publishing which might be more disruptive.

Change 474279 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Explain why integration-publishing IP is hardcoded

https://gerrit.wikimedia.org/r/474279

gerritbot added a project: Patch-For-Review.Nov 16 2018, 3:42 PM

So for integration-publishing, its used from contint1001 which is a production machine. The server thus can not resolve the DNS entry integration-publishing.integration.eqiad.wmflabs. and we have hardcoded the instance IP address in the Jenkins jobs.

I think the best way to migrate is to create a fresh new instance in the new region, once we have the IP adress migrate the Jenkins job to use that new instance then we can discard the old one.

For castor02.integration.eqiad.wmflabs, I will have to double check, but I think they always use the DNS entry and it is always run from labs instance. So should be fine to migrate. But, castor02 is usually a SPOF for most of the CI jobs. We would probably want to create a new one, rsync the caches there and switch the jobs.

For the integration-slave hosts, some have been migrated this week with @thcipriani , they seem all fine. Or at least, I haven't been reported any issue related to the migration. I am fine having the other integration-slave ones to be migrated. Kudos!

Change 474279 merged by Andrew Bogott:
[integration/config@master] Explain why integration-publishing IP is hardcoded

https://gerrit.wikimedia.org/r/474279

Mentioned in SAL (#wikimedia-releng) [2018-11-19T13:21:10Z] <hashar> Created integration-publishing02 172.16.4.5 for WMCS region migration # T208803

Change 474689 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Switch to integration-publishing02

https://gerrit.wikimedia.org/r/474689

Mentioned in SAL (#wikimedia-releng) [2018-11-19T13:47:40Z] <hashar> Shutdown integration-publishing , replaced by integration-publishing02 # T208803

Change 474689 merged by jenkins-bot:
[integration/config@master] Switch to integration-publishing02

https://gerrit.wikimedia.org/r/474689

I have switched the CI jobs from integration-publishing to integration-publishing02 in the new region. I have shutdown the instance, will delete it in a few days.

For castor02, we always use the DNS entry, the only exception being in the Jenkins configuration for the node. So we can migrate like other instances. That will cause a slight downtime of CI while it is being done.

greg moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.Nov 19 2018, 11:28 PM

Mentioned in SAL (#wikimedia-releng) [2018-11-20T14:59:24Z] <hashar> created integration-castor03.integration.eqiad.wmflabs intended as a replacement for castor02 | T208803

I will prepare the migration to integration-castor03 later tonight and attempt to do the migration tomorrow or on Thursday (easier for me)

Eventually I felt migrating castor03 just before ThanksGiving to be a terrible idea. Postponed to next week.

ayounsi unsubscribed.Nov 24 2018, 1:28 PM

greg assigned this task to hashar.Nov 27 2018, 9:09 PM

Change 476213 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Migrate Castor to integration-castor03

https://gerrit.wikimedia.org/r/476213

Change 476213 merged by jenkins-bot:
[integration/config@master] Migrate Castor to integration-castor03

https://gerrit.wikimedia.org/r/476213

Mentioned in SAL (#wikimedia-releng) [2018-11-28T08:45:37Z] <hashar> Switching Jenkins job cache to integration-castor03 with AN EMPTY CACHE | T208803

CI jobs have been migrated to use integration-castor03. I am keeping castor02 around just in case.

Last rsync entry for castor02:

Nov 28 08:38:17 castor02 rsyncd[19693]: rsync on caches/pywikibot-core/master/pywikibot-core-tox-nose34-docker/ from integration-slave-docker-1037.integration.eqiad.wmflabs (10.68.21.155)

integration-castor03 cache is filling up properly.

doc.wikimedia.org has received updates hence the new publishing instance is working.

Deleting castor02 and integration-publishing

Mentioned in SAL (#wikimedia-releng) [2018-11-29T08:35:37Z] <hashar> Deleted instances castor02 and integration-publishing , replaced by new instances in the new WMCS region | T208803

Thanks!

I would like to move integration-puppetmaster01 and webperformance sometime soon. Any objection?

webperformance should be easy to migrate, it runs some jobs for the performance team but apart from that it is a regular Jenkins slave. Just have to remember to update its IP address in the Jenkins slave configuration.

integration-puppetmaster01 that potentially can end up being messy. But I think the instance have puppet.conf pointing to the hostname. Given DNS is adjusted, it should just work. I am not too worried :-)

Note: releng is having an offsite this week so we have little ability unfortunately.

I've now moved everything in this project except for jenkinstest.integration.eqiad.wmflabs. I can't tell what that is -- should I move it blindly, or delete it, or does it need special care?

In T208803#4806941, @Andrew wrote:

I've now moved everything in this project except for jenkinstest.integration.eqiad.wmflabs. I can't tell what that is -- should I move it blindly, or delete it, or does it need special care?

I have a vague memory of this and I don't think there is any special care needed. I am reasonably certain that nothing depends on this instance and it can be moved blindly at will.

Andrew closed this task as Resolved.Dec 10 2018, 6:47 PM

Thank you to have completed the migration!

For jenkintest that is most probably a one off dev instance I have created. There are evidence I have used it for T183569: npm 1.4.21 can't use a http proxy. There are no jobs running on it so it can be deleted.

Success! \o/

Mentioned in SAL (#wikimedia-releng) [2018-12-11T10:03:44Z] <hashar> Deleting unused instance jenkinstest.integration.eqiad.wmflabs | T208803

Migrate the Integration cloud project to eqiad1-rClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate the Integration cloud project to eqiad1-r
Closed, ResolvedPublic
Actions

Related Objects
Search...