Page MenuHomePhabricator

decommission lvs100[123456].wikimedia.org
Closed, ResolvedPublicRequest

Description

This task will track the decommission-hardware of servers lvs100[123456].wikimedia.org

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

lvs1001.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

lvs1002.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

lvs1003.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

lvs1004.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN

End non-interrupt steps.

  • - system disks wiped (by onsite) - @RobH on 2019-07-26
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

lvs1005.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Change 512169 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs1001-6: remove prod cfg for spare reimage

https://gerrit.wikimedia.org/r/512169

Mentioned in SAL (#wikimedia-operations) [2019-05-23T13:41:51Z] <bblack> stopped pybal on lvs1001-6 - T224223

Change 512169 merged by BBlack:
[operations/puppet@production] lvs1001-6: remove prod cfg for spare reimage

https://gerrit.wikimedia.org/r/512169

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['lvs1001.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1005.wikimedia.org', 'lvs1006.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201905231441_bblack_85352.log.

Completed auto-reimage of hosts:

['lvs1001.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1006.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1005.wikimedia.org']

Of which those FAILED:

['lvs1001.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1006.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1005.wikimedia.org']

Change 512189 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs1001-6: fix partman recipe

https://gerrit.wikimedia.org/r/512189

Change 512190 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs1001-6: remove jessie-installer settings

https://gerrit.wikimedia.org/r/512190

Change 512189 merged by BBlack:
[operations/puppet@production] lvs1001-6: fix partman recipe

https://gerrit.wikimedia.org/r/512189

Change 512190 merged by BBlack:
[operations/puppet@production] lvs1001-6: remove jessie-installer settings

https://gerrit.wikimedia.org/r/512190

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['lvs1001.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1005.wikimedia.org', 'lvs1006.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201905232134_bblack_168841.log.

Completed auto-reimage of hosts:

['lvs1004.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1005.wikimedia.org', 'lvs1001.wikimedia.org', 'lvs1006.wikimedia.org']

and were ALL successful.

These are reimaged to role(spare::system) now. Over to @ayounsi for getting rid of all the special cases related to these hosts in the eqiad routers and switches (BGP stuff, fw filters, the special public-vlan LVS-balancer port groups, etc), and then we can move this on to dcops -level decom stuff.

Mentioned in SAL (#wikimedia-operations) [2019-05-24T00:27:10Z] <XioNoX> remove term protect-old-lvs-servers from cr1/2-eqiad - T224223

cr1/2-eqiad
[edit firewall family inet filter border-in4]
-      /* workaround until lvs1001-lvs1007 are decom'ed */
-      term protect-old-lvs-servers {
-          from {
-              destination-address {
-                  208.80.154.55/32;
-                  208.80.154.56/32;
-                  208.80.154.57/32;
-                  208.80.154.137/32;
-                  208.80.154.138/32;
-                  208.80.154.139/32;
-              }
-              protocol tcp;
-              destination-port [ 22 179 9090 9100 ];
-          }
-          then {
-              discard;
-          }
-      }

Mentioned in SAL (#wikimedia-operations) [2019-05-24T00:31:59Z] <XioNoX> remove lvs1001-5 bgp sessions from cr1/2-eqiad - T224223

ayounsi updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: lvs[1001-1006].wikimedia.org

  • lvs1004.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1003.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1006.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1005.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1001.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1002.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

Change 525644 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom lvs100[1-6] production dns

https://gerrit.wikimedia.org/r/525644

Change 525644 merged by RobH:
[operations/dns@master] decom lvs100[1-6] production dns

https://gerrit.wikimedia.org/r/525644

RobH removed a project: Patch-For-Review.
RobH added a subscriber: ayounsi.

Change 525805 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom lvs100[1-6]

https://gerrit.wikimedia.org/r/525805

Change 525805 merged by RobH:
[operations/puppet@production] decom lvs100[1-6]

https://gerrit.wikimedia.org/r/525805

Change 525822 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom mgmt dns for lvs100[1-6]

https://gerrit.wikimedia.org/r/525822

@ayounsi,

Per your request, we are assigning this to you for the switch configuration removal for lvs100[1-6]. All of the systems have been unracked. Please resolve this task once you remove all switch configuration for these hosts.

Mentioned in SAL (#wikimedia-operations) [2019-07-26T18:01:51Z] <XioNoX> remove lvs100[1-6] switch config from asw2-a-eqiad - T224223

Mentioned in SAL (#wikimedia-operations) [2019-07-26T18:08:40Z] <XioNoX> remove lvs100[1-6] switch config from asw2-b-eqiad - T224223

Mentioned in SAL (#wikimedia-operations) [2019-07-26T18:20:01Z] <XioNoX> remove lvs100[1-6] switch config from asw2-c-eqiad - T224223

Mentioned in SAL (#wikimedia-operations) [2019-07-26T18:43:13Z] <XioNoX> remove lvs100[1-6] switch config from asw2-d-eqiad - T224223

lvs100[1-6] removed from switches.

Change 525822 abandoned by RobH:
[operations/dns@master] decom mgmt dns for lvs100[1-6]

Reason:
old neglected patchset, no longer needed.

https://gerrit.wikimedia.org/r/525822