Page MenuHomePhabricator

Phase out gallium.wikimedia.org
Closed, ResolvedPublic

Description

gallium.wikimedia.org is years old and running Precise. We are migrating the CI services it hosts to contint1001.wikimedia.org.

Contacts

  • Antoine “hashar” Musso
  • Tyler “thcipriani” Cipriani

TL;DR:

  • merge patches for puppet cleanup
  • install zuul/jenkins on contint1001 in stopped state
  • restore contint1001 from gallium backup
  • on migration window: stop CI on gallium, start on contint1001
  • switch backend in misc varnish
  • refine/tune
  • done

Proposed window

  • Tuesday Oct 25 9am mountain - 15:00 UTC - 8am PST

Migration plan is on Google Doc https://docs.google.com/document/d/1xOcXkQA9gJaLAeyA6pePUJPZmV62RFU3KapGg8LCJ_A/edit# (requires WMF Google account).

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a subscriber: hashar.
Krinkle triaged this task as Medium priority.Apr 14 2015, 1:39 AM
Krinkle moved this task from Untriaged to Backlog on the Continuous-Integration-Infrastructure board.
Krinkle set Security to None.
hashar changed the task status from Open to Stalled.Oct 28 2015, 2:15 PM

So this need to happen. Precise is definitely legacy and we need to migrate straight up to Jessie. There is a bunch of challenges though since part of what is running on gallium is unpuppetized (Jenkins) and it is mixing web publishing with the Zuul scheduler.

Would need @hashar (me) to write down what is running on the machine and fill a bunch of sub tasks.

Paladox raised the priority of this task from Medium to Needs Triage.Jun 8 2016, 1:30 PM

I doint know what priority to change it to so changing to triage for someone to choose a priority but T137265 is the task that is migrating to the new server.

Paladox changed the task status from Stalled to Open.Jun 8 2016, 1:33 PM
Paladox triaged this task as Medium priority.
greg raised the priority of this task from Medium to High.Jun 21 2016, 9:26 PM
greg added a project: releng-201617-q1.
hashar updated the task description. (Show Details)

@thcipriani and I have overhauled this task. The task details highlight the migration overview.

https://docs.google.com/document/d/1xOcXkQA9gJaLAeyA6pePUJPZmV62RFU3KapGg8LCJ_A/edit# has the whole detailed migration plan, specially prerequisites that should be done beforehand.

The migration to contint1001 is scheduled for Thursday November 3rd at 9:00am PST / 16:00 UTC / 17:00 CET.

https://wikitech.wikimedia.org/w/index.php?title=Deployments&oldid=930567#Thursday.2C.C2.A0November.C2.A003

Change 318216 had a related patch set uploaded (by Dzahn):
remove gallium from site.pp, installserver

https://gerrit.wikimedia.org/r/318216

Change 318217 had a related patch set uploaded (by Dzahn):
contint: remove gallium conditional from contint::master_dir

https://gerrit.wikimedia.org/r/318217

Change 318218 had a related patch set uploaded (by Dzahn):
nodepool: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318218

Change 318218 abandoned by Dzahn:
nodepool: switch gallium to contint1001

Reason:
duplicate of https://gerrit.wikimedia.org/r/#/c/313599/1

https://gerrit.wikimedia.org/r/318218

Change 313599 had a related patch set uploaded (by Hashar):
nodepool: point to Jenkins on contint1001

https://gerrit.wikimedia.org/r/313599

Change 318245 had a related patch set uploaded (by Dzahn):
contint: remove gallium from firewall::labs

https://gerrit.wikimedia.org/r/318245

Change 318246 had a related patch set uploaded (by Dzahn):
cache::misc: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318246

Change 318247 had a related patch set uploaded (by Dzahn):
contint: rm gallium from ferm rules in zuul::merger

https://gerrit.wikimedia.org/r/318247

Change 318248 had a related patch set uploaded (by Dzahn):
deployment-prep/integration: stop downgrading sshd MAC and KEX

https://gerrit.wikimedia.org/r/318248

Change 318249 had a related patch set uploaded (by Dzahn):
switch zuul CNAME from gallium to contint1001

https://gerrit.wikimedia.org/r/318249

Change 318250 had a related patch set uploaded (by Dzahn):
remove gallium.wikimedia.org, keep gallium.mgmt

https://gerrit.wikimedia.org/r/318250

Change 318252 had a related patch set uploaded (by Dzahn):
zuul::merger: switch gearman server to contint1001

https://gerrit.wikimedia.org/r/318252

Change 318248 abandoned by Hashar:
deployment-prep/integration: stop downgrading sshd MAC and KEX

Reason:
The issue is in Jenkins itself, not Java :D T100509

https://gerrit.wikimedia.org/r/318248

IPv6 for contint1001 while at it:

https://gerrit.wikimedia.org/r/#/c/316040/
https://gerrit.wikimedia.org/r/#/c/319258/

contint1001:~] $ host contint1001.wikimedia.org
contint1001.wikimedia.org has address 208.80.154.17
contint1001.wikimedia.org has IPv6 address 2620:0:861:1:208:80:154:17

Change 319557 had a related patch set uploaded (by Hashar):
Announce CI maintenance

https://gerrit.wikimedia.org/r/319557

Change 319557 merged by jenkins-bot:
Announce CI maintenance

https://gerrit.wikimedia.org/r/319557

Change 319584 had a related patch set uploaded (by Hashar):
contint: enable zuul::server on contint1001

https://gerrit.wikimedia.org/r/319584

Change 319584 merged by Dzahn:
contint: enable zuul::server on contint1001

https://gerrit.wikimedia.org/r/319584

Change 318252 merged by Dzahn:
zuul::merger: switch gearman server to contint1001

https://gerrit.wikimedia.org/r/318252

Change 313599 merged by Dzahn:
nodepool: point to Jenkins on contint1001

https://gerrit.wikimedia.org/r/313599

Change 318246 merged by Dzahn:
cache::misc: switch gallium to contint1001

https://gerrit.wikimedia.org/r/318246

Change 318247 merged by Dzahn:
contint: rm gallium from ferm rules in zuul::merger

https://gerrit.wikimedia.org/r/318247

Change 318249 merged by Dzahn:
switch zuul CNAME from gallium to contint1001

https://gerrit.wikimedia.org/r/318249

Change 319619 had a related patch set uploaded (by Dzahn):
integration.wm: update Apache config to 2.4

https://gerrit.wikimedia.org/r/319619

Change 319619 merged by Dzahn:
integration.wm: update Apache config to 2.4

https://gerrit.wikimedia.org/r/319619

Change 319627 had a related patch set uploaded (by Dzahn):
contint: fix Apache config of doc.wikimedia.org

https://gerrit.wikimedia.org/r/319627

Change 319627 merged by Dzahn:
contint: fix Apache config of doc.wikimedia.org

https://gerrit.wikimedia.org/r/319627

The services have been migrated to contint1001 successfully a couple hours ago.

Change 318245 merged by Dzahn:
contint: remove gallium from firewall::labs

https://gerrit.wikimedia.org/r/318245

Change 318217 merged by Dzahn:
contint: remove gallium conditional from contint::master_dir

https://gerrit.wikimedia.org/r/318217

Change 318216 merged by Dzahn:
remove gallium from site.pp, installserver

https://gerrit.wikimedia.org/r/318216

Mentioned in SAL (#wikimedia-operations) [2016-11-08T21:40:37Z] <mutante> gallium, ex-CI server, shutdown -h now (the contents of your home dir have been copied to contint1001 in /home/gallium-home/) (T95757)

Thank you @Dzahn :] It is all done from my point of view.

Change 318250 merged by Dzahn:
remove gallium.wikimedia.org, keep gallium.mgmt

https://gerrit.wikimedia.org/r/318250

Mentioned in SAL (#wikimedia-operations) [2016-11-09T03:44:11Z] <mutante> gallium.wikimedia.org removed from DNS (T95757)

all done on my side as well now :) created subtask for datacenter ops to take care of it (wipe disks, physical decom, racktables....recycling ..)

Mentioned in SAL (#wikimedia-releng) [2016-12-16T13:45:38Z] <hashar> integration / contintcloud : remove security rules of labs projects that allowed gallium (phased out) T95757