Page MenuHomePhabricator

Migrate CI services from gallium to contint1001
Closed, DeclinedPublic

Description

gallium is in bad shape and we have contint1001 (Jessie) available to migrate services to. Once firewall ports are open (T137323), we would want to sync Jenkins data and change all the puppet bits referencing the gallium IP address.

Event Timeline

Change 293283 had a related patch set uploaded (by Hashar):
contint: cleanup gallium / use contint1001

https://gerrit.wikimedia.org/r/293283

Change 293284 had a related patch set uploaded (by Hashar):
cache_misc: change doc/integration.wm.o backend

https://gerrit.wikimedia.org/r/293284

https://gerrit.wikimedia.org/r/#/c/293283/ against puppet.git is a beast it basically change all occurrences of gallium IP address or fqdn in puppet.

I am pondering between:

  • split it in more manageable chunks and switch service after service (safe, more preparation work)
  • stop CI again, merge it in one go and catch up with issues (evil)

Change 293300 had a related patch set uploaded (by Paladox):
gallium is replaced by contint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/293300

Following T137323: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium), @mark stated that there should be no traffic between the private network and labs instance. Which make sense. Moreover gallium being on production is legacy.

So either:

A) we move contint1001 to the labs support host next to scandium/labnodepool. It will then be able to communicate with labs instances.

B) we can reuse scandium which is currently solely hosting zuul-merger

C) we migrate the whole CI infra to labs

From talk we had, contint1001 was setup in emergency since gallium could have been unrecoverable. Turns out contint1001 cant reach out labs instances per design so there is not much to do with it at this point.

Depending on outcome of T133300 we might want to decomm it.

Change 293284 abandoned by Hashar:
cache_misc: change doc/integration.wm.o backend

Reason:
I have prepared this patch in case we had to switch the CI infra to contint if gallium proven to be lost.

That is nore more an urgency and we are considering a better long term plan via T133300

https://gerrit.wikimedia.org/r/293284

Change 293283 abandoned by Hashar:
contint: cleanup gallium / use contint1001

Reason:
Was done in a rush last week to switch to contint1001. Turns out the machine is in a private lan and would not let us setup the service.

More discussion is happening on T133300 which would eventually lead to a similar change but split in smaller chunks.

https://gerrit.wikimedia.org/r/293283