Page MenuHomePhabricator

Phase out gallium.wikimedia.org
Closed, ResolvedPublic

Description

gallium.wikimedia.org is years old and running Precise. We are migrating the CI services it hosts to contint1001.wikimedia.org.

Contacts

  • Antoine “hashar” Musso
  • Tyler “thcipriani” Cipriani

TL;DR:

  • merge patches for puppet cleanup
  • install zuul/jenkins on contint1001 in stopped state
  • restore contint1001 from gallium backup
  • on migration window: stop CI on gallium, start on contint1001
  • switch backend in misc varnish
  • refine/tune
  • done

Proposed window

  • Tuesday Oct 25 9am mountain - 15:00 UTC - 8am PST

Migration plan is on Google Doc https://docs.google.com/document/d/1xOcXkQA9gJaLAeyA6pePUJPZmV62RFU3KapGg8LCJ_A/edit# (requires WMF Google account).

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar subscribed.
Krinkle triaged this task as Medium priority.Apr 14 2015, 1:39 AM
Krinkle moved this task from Untriaged to Backlog on the Continuous-Integration-Infrastructure board.
Krinkle set Security to None.
hashar changed the task status from Open to Stalled.Oct 28 2015, 2:15 PM

So this need to happen. Precise is definitely legacy and we need to migrate straight up to Jessie. There is a bunch of challenges though since part of what is running on gallium is unpuppetized (Jenkins) and it is mixing web publishing with the Zuul scheduler.

Would need @hashar (me) to write down what is running on the machine and fill a bunch of sub tasks.

Paladox raised the priority of this task from Medium to Needs Triage.Jun 8 2016, 1:30 PM

I doint know what priority to change it to so changing to triage for someone to choose a priority but T137265 is the task that is migrating to the new server.

Paladox changed the task status from Stalled to Open.Jun 8 2016, 1:33 PM
Paladox triaged this task as Medium priority.
greg raised the priority of this task from Medium to High.Jun 21 2016, 9:26 PM
greg added a project: releng-201617-q1.
hashar updated the task description. (Show Details)

@thcipriani and I have overhauled this task. The task details highlight the migration overview.

https://docs.google.com/document/d/1xOcXkQA9gJaLAeyA6pePUJPZmV62RFU3KapGg8LCJ_A/edit# has the whole detailed migration plan, specially prerequisites that should be done beforehand.

The migration to contint1001 is scheduled for Thursday November 3rd at 9:00am PST / 16:00 UTC / 17:00 CET.

https://wikitech.wikimedia.org/w/index.php?title=Deployments&oldid=930567#Thursday.2C.C2.A0November.C2.A003

Change 318216 had a related patch set uploaded (by Dzahn):
remove gallium from site.pp, installserver

https://gerrit.wikimedia.org/r/318216

Change 318217 had a related patch set uploaded (by Dzahn):
contint: remove gallium conditional from contint::master_dir

https://gerrit.wikimedia.org/r/318217

Change 318218 had a related patch set uploaded (by Dzahn):
nodepool: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318218

Change 318218 abandoned by Dzahn:
nodepool: switch gallium to contint1001

Reason:
duplicate of https://gerrit.wikimedia.org/r/#/c/313599/1

https://gerrit.wikimedia.org/r/318218

Change 313599 had a related patch set uploaded (by Hashar):
nodepool: point to Jenkins on contint1001

https://gerrit.wikimedia.org/r/313599

Change 318245 had a related patch set uploaded (by Dzahn):
contint: remove gallium from firewall::labs

https://gerrit.wikimedia.org/r/318245

Change 318246 had a related patch set uploaded (by Dzahn):
cache::misc: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318246

Change 318247 had a related patch set uploaded (by Dzahn):
contint: rm gallium from ferm rules in zuul::merger

https://gerrit.wikimedia.org/r/318247

Change 318248 had a related patch set uploaded (by Dzahn):
deployment-prep/integration: stop downgrading sshd MAC and KEX

https://gerrit.wikimedia.org/r/318248

Change 318249 had a related patch set uploaded (by Dzahn):
switch zuul CNAME from gallium to contint1001

https://gerrit.wikimedia.org/r/318249

Change 318250 had a related patch set uploaded (by Dzahn):
remove gallium.wikimedia.org, keep gallium.mgmt

https://gerrit.wikimedia.org/r/318250

Change 318252 had a related patch set uploaded (by Dzahn):
zuul::merger: switch gearman server to contint1001

https://gerrit.wikimedia.org/r/318252

Change 318248 abandoned by Hashar:
deployment-prep/integration: stop downgrading sshd MAC and KEX

Reason:
The issue is in Jenkins itself, not Java :D T100509

https://gerrit.wikimedia.org/r/318248

IPv6 for contint1001 while at it:

https://gerrit.wikimedia.org/r/#/c/316040/
https://gerrit.wikimedia.org/r/#/c/319258/

contint1001:~] $ host contint1001.wikimedia.org
contint1001.wikimedia.org has address 208.80.154.17
contint1001.wikimedia.org has IPv6 address 2620:0:861:1:208:80:154:17

Change 319557 had a related patch set uploaded (by Hashar):
Announce CI maintenance

https://gerrit.wikimedia.org/r/319557

Change 319557 merged by jenkins-bot:
Announce CI maintenance

https://gerrit.wikimedia.org/r/319557

Change 319584 had a related patch set uploaded (by Hashar):
contint: enable zuul::server on contint1001

https://gerrit.wikimedia.org/r/319584

Change 319584 merged by Dzahn:
contint: enable zuul::server on contint1001

https://gerrit.wikimedia.org/r/319584

Change 318252 merged by Dzahn:
zuul::merger: switch gearman server to contint1001

https://gerrit.wikimedia.org/r/318252

Change 313599 merged by Dzahn:
nodepool: point to Jenkins on contint1001

https://gerrit.wikimedia.org/r/313599

Change 318246 merged by Dzahn:
cache::misc: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318246

Change 318247 merged by Dzahn:
contint: rm gallium from ferm rules in zuul::merger

https://gerrit.wikimedia.org/r/318247

Change 318249 merged by Dzahn:
switch zuul CNAME from gallium to contint1001

https://gerrit.wikimedia.org/r/318249

Change 319619 had a related patch set uploaded (by Dzahn):
integration.wm: update Apache config to 2.4

https://gerrit.wikimedia.org/r/319619

Change 319619 merged by Dzahn:
integration.wm: update Apache config to 2.4

https://gerrit.wikimedia.org/r/319619

Change 319627 had a related patch set uploaded (by Dzahn):
contint: fix Apache config of doc.wikimedia.org

https://gerrit.wikimedia.org/r/319627

Change 319627 merged by Dzahn:
contint: fix Apache config of doc.wikimedia.org

https://gerrit.wikimedia.org/r/319627

The services have been migrated to contint1001 successfully a couple hours ago.

Change 318245 merged by Dzahn:
contint: remove gallium from firewall::labs

https://gerrit.wikimedia.org/r/318245

Change 318217 merged by Dzahn:
contint: remove gallium conditional from contint::master_dir

https://gerrit.wikimedia.org/r/318217

Change 318216 merged by Dzahn:
remove gallium from site.pp, installserver

https://gerrit.wikimedia.org/r/318216

Mentioned in SAL (#wikimedia-operations) [2016-11-08T21:40:37Z] <mutante> gallium, ex-CI server, shutdown -h now (the contents of your home dir have been copied to contint1001 in /home/gallium-home/) (T95757)

Thank you @Dzahn :] It is all done from my point of view.

Change 318250 merged by Dzahn:
remove gallium.wikimedia.org, keep gallium.mgmt

https://gerrit.wikimedia.org/r/318250

Mentioned in SAL (#wikimedia-operations) [2016-11-09T03:44:11Z] <mutante> gallium.wikimedia.org removed from DNS (T95757)

all done on my side as well now :) created subtask for datacenter ops to take care of it (wipe disks, physical decom, racktables....recycling ..)

Mentioned in SAL (#wikimedia-releng) [2016-12-16T13:45:38Z] <hashar> integration / contintcloud : remove security rules of labs projects that allowed gallium (phased out) T95757