Page MenuHomePhabricator

decom ms-be201[345]
Open, NormalPublic

Description

ms-be2013.codfw.wmnet

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

ms-be2014.codfw.wmnet

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

ms-be2015.codfw.wmnet

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 16 2019, 10:45 AM
colewhite triaged this task as Normal priority.Apr 16 2019, 3:39 PM
CDanis added a subscriber: CDanis.Apr 23 2019, 5:22 PM

Change 505888 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software/swift-ring@master] codfw decom: halve non-object weights and 2/3rds object weights

https://gerrit.wikimedia.org/r/505888

Change 505888 merged by CDanis:
[operations/software/swift-ring@master] codfw decom: halve non-object weights and 2/3rds object weights

https://gerrit.wikimedia.org/r/505888

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Apr 24 2019, 2:03 PM

Change 506469 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software/swift-ring@master] codfw decom: halve weights again

https://gerrit.wikimedia.org/r/506469

Change 506469 merged by CDanis:
[operations/software/swift-ring@master] codfw decom: halve weights again

https://gerrit.wikimedia.org/r/506469

Change 507396 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] swift: codfw: bump replicate concurrency for decomm hosts

https://gerrit.wikimedia.org/r/507396

Change 507396 merged by CDanis:
[operations/puppet@production] swift: codfw: bump replicate concurrency for decomm hosts

https://gerrit.wikimedia.org/r/507396

Mentioned in SAL (#wikimedia-operations) [2019-04-30T18:24:08Z] <cdanis> running puppet on ms-be2014 to bump replication concurrency T221068

Mentioned in SAL (#wikimedia-operations) [2019-04-30T18:40:50Z] <cdanis> running puppet on ms-be201[3,5] to bump replication concurrency T221068

Change 508899 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software/swift-ring@master] swift codfw-prod: decomm ms-be201{3,4,5}: 0 weight

https://gerrit.wikimedia.org/r/508899

Change 508899 merged by CDanis:
[operations/software/swift-ring@master] swift codfw-prod: decomm ms-be201{3,4,5}: 0 weight

https://gerrit.wikimedia.org/r/508899

Mentioned in SAL (#wikimedia-operations) [2019-05-08T19:21:49Z] <cdanis> swift codfw-prod: deploy I59c88aed T221068

Change 509838 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software/swift-ring@master] swift codfw-prod: touch *.builder to finish decom

https://gerrit.wikimedia.org/r/509838

Change 509838 merged by CDanis:
[operations/software/swift-ring@master] swift codfw-prod: touch *.builder to finish decom

https://gerrit.wikimedia.org/r/509838

RobH added a subscriber: RobH.

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission queue and shifted to dc ops to decom them.

@fgiunchedi: I added in the decommission project so its easier to find out why these are showing on the report listed here.

We should likely shift all those ms-be systems back to active in netbox.

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission queue and shifted to dc ops to decom them.

@fgiunchedi: I added in the decommission project so its easier to find out why these are showing on the report listed here.

We should likely shift all those ms-be systems back to active in netbox.

I'm not sure if this is wrong, or if the report is. I would guess there is the need for a state that is set by the service owner and represents "depooled from production, but still runs puppet until DC Ops wipes it". @Volans?

@faidon given that the available list of states is fixed and we don't have one for this case, one option could be to put it back into staged in Netbox and allow a staged -> decommissioned transition.

Although in my vision the whole decom process should be all covered in the end by a single decom cookbook that should take care of all the actions apart code changes. The current state is far from optimal and we have many inconsistencies-by-design with the current workflow. Find the right small steps to go from the current state to the final state might be challenging, but I think we should start progressing in that direction.

The most obvious next step for me is to have the decom script make the host unbootable and shut it down. If there is an agreement I can make a patch for it.

RobH added a comment.May 16 2019, 10:12 PM

Please note I didn't actually change the state in puppet, since as @faidon pointed out, I'm not sure if we need to change the report, or the process, or what. I did add the decommission project so it is easy to look at the workboard for decommission and the report output and match hostnames to tasks.

In this case it was a mistake by me setting decommissioning in netbox for those hosts, although they still run role swift. I'll put them back in active, and try again the decom process !

Change 510819 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Set spares for ms-be[12]01[345]

https://gerrit.wikimedia.org/r/510819

Change 511665 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/software/swift-ring@master] codfw-prod: remove ms-be201[345]

https://gerrit.wikimedia.org/r/511665

Change 511665 merged by Filippo Giunchedi:
[operations/software/swift-ring@master] codfw-prod: remove ms-be201[345]

https://gerrit.wikimedia.org/r/511665

Mentioned in SAL (#wikimedia-operations) [2019-05-21T11:07:53Z] <godog> swift codfw-prod: remove ms-be201[345] - T221068

Change 510819 merged by Filippo Giunchedi:
[operations/puppet@production] Set spares for ms-be[12]01[345]

https://gerrit.wikimedia.org/r/510819

fgiunchedi updated the task description. (Show Details)May 23 2019, 9:51 AM
fgiunchedi assigned this task to RobH.

Task updated with the checklist, hosts are now marked as spare in puppet and I've set netbox status to decommissioning, moving to @RobH

I've put the state of those hosts in Netbox back to active as they are currently "active" for the spare::system role and decomissioning should be set once we run the decom script (and it will be done automatically by the script very soon) and the host is removed from puppet completely.
I've also updated the documentation to reduce confusion:
https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&type=revision&diff=1827408&oldid=1827206