Page MenuHomePhabricator

decommission of restbase200[1-6] (lease return in December 2018)
Closed, ResolvedPublic

Description

This task will track the decommission-hardware of servers restbase200[1-6] (

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Please note these systems are being replaced by T209615. Once the new systems restbase201[3-8].codfw.wmnet are in service, this should immediately proceed, as these are due back for lease return.

restbase2001:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet for LEASE RETURNS
  • - mgmt dns entries removed.

restbase2002:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet for LEASE RETURNS
  • - mgmt dns entries removed.

restbase2003:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet for LEASE RETURNS
  • - mgmt dns entries removed.

restbase2004:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet for LEASE RETURNS
  • - mgmt dns entries removed.

restbase2005:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet for LEASE RETURNS
  • - mgmt dns entries removed.

restbase2006:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet for LEASE RETURNS
  • - mgmt dns entries removed.

Event Timeline

RobH triaged this task as High priority.Dec 3 2018, 11:51 PM
RobH created this task.
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).Dec 3 2018, 11:58 PM

Change 479401 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Remove restbase200[1-6] from restbase

https://gerrit.wikimedia.org/r/479401

Change 479402 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Remove restbase200[1-6] cassandra instances

https://gerrit.wikimedia.org/r/479402

Change 479403 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] site: spare::system for restbase200[1-6]

https://gerrit.wikimedia.org/r/479403

Change 479401 merged by Filippo Giunchedi:
[operations/puppet@production] Remove restbase200[1-6] from restbase

https://gerrit.wikimedia.org/r/479401

Change 479402 merged by Filippo Giunchedi:
[operations/puppet@production] Remove restbase200[1-6] cassandra instances

https://gerrit.wikimedia.org/r/479402

Change 479403 merged by Filippo Giunchedi:
[operations/puppet@production] site: spare::system for restbase200[1-6]

https://gerrit.wikimedia.org/r/479403

Change 479406 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Remove restbase200[1-6]

https://gerrit.wikimedia.org/r/479406

Change 479406 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Remove restbase200[1-6]

https://gerrit.wikimedia.org/r/479406

Mentioned in SAL (#wikimedia-operations) [2018-12-13T11:31:52Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@29a0902]: Remove restbase200[1-6] and ensure body.tfa exists for feed responses - T211070 T211871

Mentioned in SAL (#wikimedia-operations) [2018-12-13T11:39:00Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@29a0902]: Remove restbase200[1-6] and ensure body.tfa exists for feed responses - T211070 T211871 (duration: 07m 08s)

Mentioned in SAL (#wikimedia-operations) [2018-12-13T11:39:18Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@29a0902]: Remove restbase200[1-6] and ensure body.tfa exists for feed responses - T211070 T211871

Mentioned in SAL (#wikimedia-operations) [2018-12-13T11:45:26Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@29a0902]: Remove restbase200[1-6] and ensure body.tfa exists for feed responses - T211070 T211871 (duration: 06m 08s)

Mentioned in SAL (#wikimedia-operations) [2018-12-13T11:53:49Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@55fcd4b]: Remove restbase200[1-6], ensure body.tfa exists for feed responses and disable Citoid check - T211070 T211871 T211411

Mentioned in SAL (#wikimedia-operations) [2018-12-13T12:12:47Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@55fcd4b]: Remove restbase200[1-6], ensure body.tfa exists for feed responses and disable Citoid check - T211070 T211871 T211411 (duration: 18m 59s)

Mentioned in SAL (#wikimedia-operations) [2018-12-13T13:15:38Z] <godog> stop restbase and cassandra on restbase200[1-6] - T211070

fgiunchedi reassigned this task from Eevans to RobH.Dec 13 2018, 1:21 PM
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a subscriber: Eevans.

Ready for decom @RobH

wmf-decommission-host was executed by robh for restbase2001.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for restbase2002.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for restbase2003.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for restbase2004.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for restbase2005.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for restbase2006.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor
RobH updated the task description. (Show Details)
RobH added a comment.Dec 13 2018, 10:10 PM

Network ports for later label removal:

restbase2001 = asw-b-codfw:ge-5/0/29
restbase2002 = asw-b-codfw:ge-8/0/8

restbase2003 = asw-c-codfw:ge-1/0/13
restbase2004 = asw-c-codfw:ge-5/0/13

restbase2005 = asw-d-codfw:ge-1/0/3
restbase2006 = asw-d-codfw:ge-5/0/3

All have been removed from the private1 vlan for their rows and added to the disabled group.

Change 479559 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom restbase200[1-6].codfw.wmnet

https://gerrit.wikimedia.org/r/479559

Change 479560 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom restbase200[1-6] production dns entries

https://gerrit.wikimedia.org/r/479560

Change 479559 merged by RobH:
[operations/puppet@production] decom restbase200[1-6].codfw.wmnet

https://gerrit.wikimedia.org/r/479559

RobH updated the task description. (Show Details)
RobH reassigned this task from RobH to Papaul.Dec 13 2018, 10:26 PM
RobH added a project: ops-codfw.
RobH updated the task description. (Show Details)
Restricted Application added a project: Operations. · View Herald TranscriptDec 13 2018, 10:26 PM
Papaul updated the task description. (Show Details)Dec 20 2018, 3:24 PM
Papaul updated the task description. (Show Details)Dec 20 2018, 5:22 PM
Papaul updated the task description. (Show Details)Jan 3 2019, 6:53 PM

Change 482121 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS entries for restbase200[1-6]

https://gerrit.wikimedia.org/r/482121

Change 482121 merged by Dzahn:
[operations/dns@master] DNS: Remove mgmt DNS entries for restbase200[1-6]

https://gerrit.wikimedia.org/r/482121

Dzahn added a subscriber: Dzahn.EditedJan 3 2019, 8:47 PM

@RobH @fgiunchedi I still see a bunch of production DNS records for these.. although most of the check boxes above are checked. They are unracked.

templates/wmnet:restbase2001    1H  IN A    10.192.16.152
templates/wmnet:restbase2001-a  1H  IN A    10.192.16.162 ; cassandra instance
templates/wmnet:restbase2001-b  1H  IN A    10.192.16.163 ; cassandra instance
templates/wmnet:restbase2001-c  1H  IN A    10.192.16.164 ; cassandra instance
templates/wmnet:restbase2002    1H  IN A    10.192.16.153
templates/wmnet:restbase2002-a  1H  IN A    10.192.16.165 ; cassandra instance
templates/wmnet:restbase2002-b  1H  IN A    10.192.16.166 ; cassandra instance
templates/wmnet:restbase2002-c  1H  IN A    10.192.16.167 ; cassandra instance
templates/wmnet:restbase2003    1H  IN A    10.192.32.124
templates/wmnet:restbase2003-a  1H  IN A    10.192.32.134 ; cassandra instance
templates/wmnet:restbase2003-b  1H  IN A    10.192.32.135 ; cassandra instance
templates/wmnet:restbase2003-c  1H  IN A    10.192.32.136 ; cassandra instance
templates/wmnet:restbase2004    1H  IN A    10.192.32.125
templates/wmnet:restbase2004-a  1H  IN A    10.192.32.137 ; cassandra instance
templates/wmnet:restbase2004-b  1H  IN A    10.192.32.138 ; cassandra instance
templates/wmnet:restbase2004-c  1H  IN A    10.192.32.139 ; cassandra instance
templates/wmnet:restbase2005    1H  IN A    10.192.48.37
templates/wmnet:restbase2005-a  1H  IN A    10.192.48.46 ; cassandra instance
templates/wmnet:restbase2005-b  1H  IN A    10.192.48.47 ; cassandra instance
templates/wmnet:restbase2005-c  1H  IN A    10.192.48.48 ; cassandra instance
templates/wmnet:restbase2006    1H  IN A    10.192.48.38
templates/wmnet:restbase2006-a  1H  IN A    10.192.48.49 ; cassandra instance
templates/wmnet:restbase2006-b  1H  IN A    10.192.48.50 ; cassandra instance
templates/wmnet:restbase2006-c  1H  IN A    10.192.48.51 ; cassandra instance
...
Papaul updated the task description. (Show Details)Jan 4 2019, 5:46 PM
RobH added a comment.Jan 4 2019, 6:07 PM

@RobH @fgiunchedi I still see a bunch of production DNS records for these.. although most of the check boxes above are checked. They are unracked.

templates/wmnet:restbase2001    1H  IN A    10.192.16.152
templates/wmnet:restbase2001-a  1H  IN A    10.192.16.162 ; cassandra instance
templates/wmnet:restbase2001-b  1H  IN A    10.192.16.163 ; cassandra instance
templates/wmnet:restbase2001-c  1H  IN A    10.192.16.164 ; cassandra instance
templates/wmnet:restbase2002    1H  IN A    10.192.16.153
templates/wmnet:restbase2002-a  1H  IN A    10.192.16.165 ; cassandra instance
templates/wmnet:restbase2002-b  1H  IN A    10.192.16.166 ; cassandra instance
templates/wmnet:restbase2002-c  1H  IN A    10.192.16.167 ; cassandra instance
templates/wmnet:restbase2003    1H  IN A    10.192.32.124
templates/wmnet:restbase2003-a  1H  IN A    10.192.32.134 ; cassandra instance
templates/wmnet:restbase2003-b  1H  IN A    10.192.32.135 ; cassandra instance
templates/wmnet:restbase2003-c  1H  IN A    10.192.32.136 ; cassandra instance
templates/wmnet:restbase2004    1H  IN A    10.192.32.125
templates/wmnet:restbase2004-a  1H  IN A    10.192.32.137 ; cassandra instance
templates/wmnet:restbase2004-b  1H  IN A    10.192.32.138 ; cassandra instance
templates/wmnet:restbase2004-c  1H  IN A    10.192.32.139 ; cassandra instance
templates/wmnet:restbase2005    1H  IN A    10.192.48.37
templates/wmnet:restbase2005-a  1H  IN A    10.192.48.46 ; cassandra instance
templates/wmnet:restbase2005-b  1H  IN A    10.192.48.47 ; cassandra instance
templates/wmnet:restbase2005-c  1H  IN A    10.192.48.48 ; cassandra instance
templates/wmnet:restbase2006    1H  IN A    10.192.48.38
templates/wmnet:restbase2006-a  1H  IN A    10.192.48.49 ; cassandra instance
templates/wmnet:restbase2006-b  1H  IN A    10.192.48.50 ; cassandra instance
templates/wmnet:restbase2006-c  1H  IN A    10.192.48.51 ; cassandra instance
...

I neglected to merge my patchset https://gerrit.wikimedia.org/r/#/c/operations/dns/+/479560/

fixing now

Change 479560 merged by RobH:
[operations/dns@master] decom restbase200[1-6] production dns entries

https://gerrit.wikimedia.org/r/479560

Papaul updated the task description. (Show Details)Jan 8 2019, 4:23 PM
Papaul reassigned this task from Papaul to RobH.Jan 14 2019, 5:13 PM
Papaul added a subscriber: Papaul.

This is complete. All servers ready to be ship out.

RobH closed this task as Resolved.Mar 28 2019, 9:13 PM