Phase out scandium.eqiad.wmnet
Closed, ResolvedPublic

Description

scandium.eqiad.wmnet is in the lab network and is only used to run a single zuul-merger instance. The reason we had it in labs is because that is an internal service that does not need to be reached from the internet.

The contint1001 (Jenkins/Zuul server) has a public IP address and labs instances will be able to reach it. Hence we should move the zuul-merger there which will let us phase out scandium entirely.

Stretch goal: refactor the puppet zuul:merger class so we can have several instance on a single server (will address T140297)

decommission steps

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped by onsite
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from asw-a-eqiad for scandium when it is unracked.
hashar created this task.Nov 17 2016, 9:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 17 2016, 9:45 AM
hashar triaged this task as "Normal" priority.

Change 336807 had a related patch set uploaded (by Hashar):
Add zuul-merger on contint1001 and contint2001

https://gerrit.wikimedia.org/r/336807

Change 336807 merged by Dzahn:
Add zuul-merger on contint1001 and contint2001

https://gerrit.wikimedia.org/r/336807

Dzahn added a subscriber: Dzahn.Feb 10 2017, 3:24 AM

after the merge above, puppet run on scandium is unchanged. no-op

Dzahn added a comment.Feb 10 2017, 3:54 AM

now there is just this to check

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=git_daemon

git_daemon check is added and CRIT on contint1001/2001

Change 336961 had a related patch set uploaded (by Dzahn):
zuul: add contint1001/2001 to zuul merger hosts for ferm

https://gerrit.wikimedia.org/r/336961

Change 336961 merged by Dzahn:
zuul: add contint1001/2001 to zuul merger hosts for ferm

https://gerrit.wikimedia.org/r/336961

Mentioned in SAL (#wikimedia-operations) [2017-02-10T09:51:53Z] <hashar> Reenabling puppet and zuul-merger on contint1001 and contint2001. The git-daemon is running now T140297 T150936. The 'systemctl status git-daemon' thought that the service was running when it was not (filled T157785 )

We now have a zuul-merger on each of contint1001 and contint2001. Assuming they are working properly we will be able to phase out scandium.eqiad.wmnet entirely.

Change 337023 had a related patch set uploaded (by Hashar):
Remove zuul-merger from scandium.eqiad.wmnet

https://gerrit.wikimedia.org/r/337023

hashar edited projects, added Operations; removed Patch-For-Review.Feb 10 2017, 2:38 PM
hashar added a subscriber: RobH.

@Dzahn @RobH we no more need scandium.eqiad.wmnet. It was solely running the zuul-merger service which is now running on contint1001 and contint2001.

We would want to first remove the role::zuul::merger from the host https://gerrit.wikimedia.org/r/337023 and once that change is merged make sure the daemon is stopped:

sudo systemctl stop zuul-merger
sudo dpkg --purge zuul

You will then want to ACK alarms in Icinga or force refresh its configuration.

Once done, can you move the server back to spares or decommission it? Thanks!

Change 337023 merged by Dzahn:
Remove zuul-merger from scandium.eqiad.wmnet

https://gerrit.wikimedia.org/r/337023

Mentioned in SAL (#wikimedia-operations) [2017-02-10T16:15:24Z] <mutante> scandium - stopping zuul-merger service (T150936)

Change 337041 had a related patch set uploaded (by Dzahn):
CI: decom scandium

https://gerrit.wikimedia.org/r/337041

Change 337041 merged by Dzahn:
CI: decom scandium

https://gerrit.wikimedia.org/r/337041

Change 337434 had a related patch set uploaded (by Dzahn):
remove scandium, keep mgmt

https://gerrit.wikimedia.org/r/337434

Dzahn added a comment.Feb 13 2017, 6:17 PM
  • 10:07 < mutante> !log scandium - ex-zuul merger - removing from puppet, revoking puppet cert, salt key..
  • removed from Icinga

Mentioned in SAL (#wikimedia-operations) [2017-02-13T18:18:05Z] <mutante> scandium - shutdown -h now (T150936)

RobH edited the task description. (Show Details)Feb 13 2017, 6:32 PM

Change 337434 merged by Dzahn:
remove scandium, keep mgmt

https://gerrit.wikimedia.org/r/337434

Dzahn edited the task description. (Show Details)Feb 13 2017, 6:38 PM
Cmjohnson closed this task as "Resolved".Mar 3 2017, 5:43 PM

This server has been decom'd and removed from rack.