Phase out gallium.wikimedia.org
Closed, ResolvedPublic

Description

gallium.wikimedia.org is years old and running Precise. We are migrating the CI services it hosts to contint1001.wikimedia.org.

Contacts

  • Antoine “hashar” Musso
  • Tyler “thcipriani” Cipriani

TL;DR:

  • merge patches for puppet cleanup
  • install zuul/jenkins on contint1001 in stopped state
  • restore contint1001 from gallium backup
  • on migration window: stop CI on gallium, start on contint1001
  • switch backend in misc varnish
  • refine/tune
  • done

Proposed window

  • Tuesday Oct 25 9am mountain - 15:00 UTC - 8am PST

Migration plan is on Google Doc https://docs.google.com/document/d/1xOcXkQA9gJaLAeyA6pePUJPZmV62RFU3KapGg8LCJ_A/edit# (requires WMF Google account).

hashar updated the task description. (Show Details)
hashar raised the priority of this task from to Needs Triage.
hashar added a subscriber: hashar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2015, 8:24 PM
Krinkle triaged this task as Normal priority.
Krinkle set Security to None.
hashar changed the task status from Open to Stalled.Oct 28 2015, 2:15 PM

So this need to happen. Precise is definitely legacy and we need to migrate straight up to Jessie. There is a bunch of challenges though since part of what is running on gallium is unpuppetized (Jenkins) and it is mixing web publishing with the Zuul scheduler.

Would need @hashar (me) to write down what is running on the machine and fill a bunch of sub tasks.

We have created a subproject to Continuous-Integration-Infrastructure to track and organize the various tasks: https://phabricator.wikimedia.org/project/view/1966/

fgiunchedi changed the status of subtask T133150: Move gallium to an internal host? from Open to Stalled.Apr 28 2016, 9:55 AM
Paladox raised the priority of this task from Normal to Needs Triage.

I doint know what priority to change it to so changing to triage for someone to choose a priority but T137265 is the task that is migrating to the new server.

Paladox triaged this task as Normal priority.Jun 8 2016, 1:33 PM
Paladox changed the task status from Stalled to Open.
jayvdb added a subscriber: jayvdb.Jun 8 2016, 1:35 PM
greg raised the priority of this task from Normal to High.
hashar updated the task description. (Show Details)Oct 4 2016, 3:26 PM
hashar updated the task description. (Show Details)

@thcipriani and I have overhauled this task. The task details highlight the migration overview.

https://docs.google.com/document/d/1xOcXkQA9gJaLAeyA6pePUJPZmV62RFU3KapGg8LCJ_A/edit# has the whole detailed migration plan, specially prerequisites that should be done beforehand.

The migration to contint1001 is scheduled for Thursday November 3rd at 9:00am PST / 16:00 UTC / 17:00 CET.

https://wikitech.wikimedia.org/w/index.php?title=Deployments&oldid=930567#Thursday.2C.C2.A0November.C2.A003

Change 318216 had a related patch set uploaded (by Dzahn):
remove gallium from site.pp, installserver

https://gerrit.wikimedia.org/r/318216

Change 318217 had a related patch set uploaded (by Dzahn):
contint: remove gallium conditional from contint::master_dir

https://gerrit.wikimedia.org/r/318217

Change 318218 had a related patch set uploaded (by Dzahn):
nodepool: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318218

Change 318218 abandoned by Dzahn:
nodepool: switch gallium to contint1001

Reason:
duplicate of https://gerrit.wikimedia.org/r/#/c/313599/1

https://gerrit.wikimedia.org/r/318218

Change 313599 had a related patch set uploaded (by Hashar):
nodepool: point to Jenkins on contint1001

https://gerrit.wikimedia.org/r/313599

Change 318245 had a related patch set uploaded (by Dzahn):
contint: remove gallium from firewall::labs

https://gerrit.wikimedia.org/r/318245

Change 318246 had a related patch set uploaded (by Dzahn):
cache::misc: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318246

Change 318247 had a related patch set uploaded (by Dzahn):
contint: rm gallium from ferm rules in zuul::merger

https://gerrit.wikimedia.org/r/318247

Change 318248 had a related patch set uploaded (by Dzahn):
deployment-prep/integration: stop downgrading sshd MAC and KEX

https://gerrit.wikimedia.org/r/318248

Change 318249 had a related patch set uploaded (by Dzahn):
switch zuul CNAME from gallium to contint1001

https://gerrit.wikimedia.org/r/318249

Change 318250 had a related patch set uploaded (by Dzahn):
remove gallium.wikimedia.org, keep gallium.mgmt

https://gerrit.wikimedia.org/r/318250

Change 318252 had a related patch set uploaded (by Dzahn):
zuul::merger: switch gearman server to contint1001

https://gerrit.wikimedia.org/r/318252

Change 318248 abandoned by Hashar:
deployment-prep/integration: stop downgrading sshd MAC and KEX

Reason:
The issue is in Jenkins itself, not Java :D T100509

https://gerrit.wikimedia.org/r/318248

Dzahn added a comment.Nov 2 2016, 12:53 AM

IPv6 for contint1001 while at it:

https://gerrit.wikimedia.org/r/#/c/316040/
https://gerrit.wikimedia.org/r/#/c/319258/

contint1001:~] $ host contint1001.wikimedia.org
contint1001.wikimedia.org has address 208.80.154.17
contint1001.wikimedia.org has IPv6 address 2620:0:861:1:208:80:154:17

Change 319557 had a related patch set uploaded (by Hashar):
Announce CI maintenance

https://gerrit.wikimedia.org/r/319557

Change 319557 merged by jenkins-bot:
Announce CI maintenance

https://gerrit.wikimedia.org/r/319557

Change 319584 had a related patch set uploaded (by Hashar):
contint: enable zuul::server on contint1001

https://gerrit.wikimedia.org/r/319584

Change 319584 merged by Dzahn:
contint: enable zuul::server on contint1001

https://gerrit.wikimedia.org/r/319584

Change 318252 merged by Dzahn:
zuul::merger: switch gearman server to contint1001

https://gerrit.wikimedia.org/r/318252

Change 313599 merged by Dzahn:
nodepool: point to Jenkins on contint1001

https://gerrit.wikimedia.org/r/313599

Change 318246 merged by Dzahn:
cache::misc: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318246

Change 318247 merged by Dzahn:
contint: rm gallium from ferm rules in zuul::merger

https://gerrit.wikimedia.org/r/318247

Change 318249 merged by Dzahn:
switch zuul CNAME from gallium to contint1001

https://gerrit.wikimedia.org/r/318249

Change 319619 had a related patch set uploaded (by Dzahn):
integration.wm: update Apache config to 2.4

https://gerrit.wikimedia.org/r/319619

Change 319619 merged by Dzahn:
integration.wm: update Apache config to 2.4

https://gerrit.wikimedia.org/r/319619

Change 319627 had a related patch set uploaded (by Dzahn):
contint: fix Apache config of doc.wikimedia.org

https://gerrit.wikimedia.org/r/319627

Change 319627 merged by Dzahn:
contint: fix Apache config of doc.wikimedia.org

https://gerrit.wikimedia.org/r/319627

Dzahn claimed this task.Nov 3 2016, 7:18 PM

The services have been migrated to contint1001 successfully a couple hours ago.

Change 318245 merged by Dzahn:
contint: remove gallium from firewall::labs

https://gerrit.wikimedia.org/r/318245

Change 318217 merged by Dzahn:
contint: remove gallium conditional from contint::master_dir

https://gerrit.wikimedia.org/r/318217

Change 318216 merged by Dzahn:
remove gallium from site.pp, installserver

https://gerrit.wikimedia.org/r/318216

Mentioned in SAL (#wikimedia-operations) [2016-11-08T21:40:37Z] <mutante> gallium, ex-CI server, shutdown -h now (the contents of your home dir have been copied to contint1001 in /home/gallium-home/) (T95757)

Thank you @Dzahn :] It is all done from my point of view.

Change 318250 merged by Dzahn:
remove gallium.wikimedia.org, keep gallium.mgmt

https://gerrit.wikimedia.org/r/318250

Mentioned in SAL (#wikimedia-operations) [2016-11-09T03:44:11Z] <mutante> gallium.wikimedia.org removed from DNS (T95757)

Dzahn closed this task as Resolved.Nov 9 2016, 3:53 AM

all done on my side as well now :) created subtask for datacenter ops to take care of it (wipe disks, physical decom, racktables....recycling ..)

Mentioned in SAL (#wikimedia-releng) [2016-12-16T13:45:38Z] <hashar> integration / contintcloud : remove security rules of labs projects that allowed gallium (phased out) T95757