Page MenuHomePhabricator

reimage WMF6937/mw1298
Closed, ResolvedPublic

Description

This task will track the decommission of wmf6937 as phab1002 and then its reimage as mw1298.

Background: This system was taken from mw pool for phab1002 use, but turns out they need 64GB ram not 32GB, so now it is being returned to its original service pool in mw systems.

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update switch port to mw1298 description, ensure internal vlan, set disabled until reimage is ready
  • - switch port assignment noted on this task
  • - remove all remaining puppet references (include role::spare)
  • - update production dns entries to mw1298, update mgmt dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)
  • - change hostname in netbox to mw1298

End non-interrupt steps.

  • - update physical hostname label to mw1298
  • - system disks wiped (by onsite)
  • - proceed to re-image steps

Install Checklist

  • - check all dns entries for accuracy
  • - check/update operations/puppet repo for new hostname and role
  • - install/reimage
  • - handoff to appserver handlers for role assignment (imagescaler, mw, that kind of thing)

Event Timeline

RobH triaged this task as Medium priority.Feb 5 2019, 7:30 PM
RobH created this task.
Dzahn added a comment.Feb 5 2019, 10:49 PM

@RobH Want me to take it for the first couple check boxes and then give it back?

Change 496116 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] set phab1002 as a spare::system

https://gerrit.wikimedia.org/r/496116

Change 496116 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] set phab1002 as a spare::system

https://gerrit.wikimedia.org/r/496116

Dzahn added a comment.Mar 22 2019, 3:57 PM

T215335 is unblocked again. If we can do that and assign it to me as phab1003 then doing this decom task would be slightly easier. Then i wouldn't have to first remove phab1002 completely and then re-add phab1003 in phabricator config but could just replace it. Happy to take the first part of this decom task together with the new box.

RobH renamed this task from decommission wmf6937 as phab1002, reimage as mw1298 to reimage WMF6937/mw1298.Apr 18 2019, 4:27 PM
RobH reassigned this task from RobH to jijiki.Apr 18 2019, 6:37 PM

So this would be ideal to reimage as a thumbnor server to replace thumbor1004 via T221132.

@effie: would this work for thumbor replacement (I think it would but I don't want to make decisions for your team's server allocations/service clusters.)

Change 496116 abandoned by Dzahn:
set phab1002 as a spare::system

Reason:
duplicate, done in https://gerrit.wikimedia.org/r/c/operations/puppet/ /504959

https://gerrit.wikimedia.org/r/496116

@jijiki @RobH Is this happening and i should skip mw1298 on T192457? Or is it going to be another host or none?

@Dzahn I need to to talk with our team before I green light this, also mentioned in T221132. Is it Possible to revisit this in a week from now? Thank you!

@jijiki Yes, of course it can wait, i just realized again the holiday situation.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1298.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909181848_dzahn_252267_mw1298_eqiad_wmnet.log.

Change 537658 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: allocate mw1298 as a jobrunner, remove as spare

https://gerrit.wikimedia.org/r/537658

Completed auto-reimage of hosts:

['mw1298.eqiad.wmnet']

and were ALL successful.

Dzahn added a comment.Sep 18 2019, 8:58 PM

phab1002 is already gone from repo since a while. This host is called mw1298 and i just reinstalled it again just in case. What is there really left to do here?

Dzahn updated the task description. (Show Details)Sep 18 2019, 9:04 PM

Looks to me no other steps are needed for decom. Besides maybe checking the switch port description and physlcal label.

Next we can apply a mediawiki role (jobrunner) again and re-add it to confctl and finally pool it.

Change 537658 merged by Dzahn:
[operations/puppet@production] site: allocate mw1298 as a jobrunner, add to conftool

https://gerrit.wikimedia.org/r/537658

Dzahn reassigned this task from jijiki to RobH.Sep 24 2019, 10:35 PM
Dzahn added a subscriber: jijiki.

@RobH I think all this needs is a quick check if switch port label and physical label are mw1298 and if it is this can be closed. This is back to being mw1298 and in prod.

Dzahn updated the task description. (Show Details)Sep 24 2019, 10:35 PM
Dzahn added a comment.EditedSep 24 2019, 10:38 PM

So to be clear. Despite what the checkboxes say we do NOT need to disable puppet on this host and take it down.

It is back in production as an mw appserver as it was in the past.

It is only about confirming this host, WMF6937, is called mw1298 everywhere and not phab1002. The rest is then resolved/invalid and can be closed.

There is also T221391 for decom of phab1002.

RobH reassigned this task from RobH to Cmjohnson.Sep 24 2019, 10:39 PM
RobH added subscribers: Jclark-ctr, Cmjohnson.

@RobH I think all this needs is a quick check if switch port label and physical label are mw1298 and if it is this can be closed. This is back to being mw1298 and in prod.

switch port label has been updated. as there is no comment about the physical label on this or any sub-task, I would surmise it has NOT been done.

@Cmjohnson or @Jclark-ctr: Please check mw1298 https://netbox.wikimedia.org/dcim/devices/962/ and place a new hostname label on it (as it likely doesn't have the right name currently.)

Once done, you can resolve this task, as it seems the other steps were done (but no one checked the boxes, which I dislike.)

Thanks!

Cmjohnson closed this task as Resolved.Nov 13 2019, 5:12 PM

The label has been done

Change 552884 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] conftool: un-comment mw1298, add back to pool

https://gerrit.wikimedia.org/r/552884

Change 552884 merged by Dzahn:
[operations/puppet@production] conftool: un-comment mw1298, add back to pool

https://gerrit.wikimedia.org/r/552884

Change 552888 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] conftool: move mw1298 to the jobrunner section

https://gerrit.wikimedia.org/r/552888

Change 552888 merged by Dzahn:
[operations/puppet@production] conftool: move mw1298 to the jobrunner section

https://gerrit.wikimedia.org/r/552888

Dzahn added a comment.Nov 25 2019, 9:14 PM

mw1298 is now back here, with weight 10 as a jobrunner

https://config-master.wikimedia.org/pybal/eqiad/jobrunner