Page MenuHomePhabricator

decommission lvs100[123456].wikimedia.org
Closed, ResolvedPublicRequest

Description

This task will track the decommission of servers lvs100[123456].wikimedia.org

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

lvs1001.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

lvs1002.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

lvs1003.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

lvs1004.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN

End non-interrupt steps.

  • - system disks wiped (by onsite) - @RobH on 2019-07-26
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

lvs1005.wikimedia.org:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - NONSTANDARD - Remove lvs1001-6 special support from eqiad router configs (eg. firewall filter) [Arzhel]
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.
  • - Once DCops steps are done, please reassign to @ayounsi for the switch cleanup part

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - prod network cables unplugged in liu of switch config change at end of day on Thursday 2019-07-25
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked. - assign to @ayounsi for this as LVS is more complex than other servers.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

BBlack created this task.May 23 2019, 1:30 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 23 2019, 1:30 PM
BBlack updated the task description. (Show Details)May 23 2019, 1:31 PM
BBlack added a project: Traffic.
BBlack moved this task from Triage to LoadBalancer on the Traffic board.May 23 2019, 1:34 PM

Change 512169 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs1001-6: remove prod cfg for spare reimage

https://gerrit.wikimedia.org/r/512169

BBlack updated the task description. (Show Details)May 23 2019, 1:39 PM

Mentioned in SAL (#wikimedia-operations) [2019-05-23T13:41:51Z] <bblack> stopped pybal on lvs1001-6 - T224223

Change 512169 merged by BBlack:
[operations/puppet@production] lvs1001-6: remove prod cfg for spare reimage

https://gerrit.wikimedia.org/r/512169

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['lvs1001.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1005.wikimedia.org', 'lvs1006.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201905231441_bblack_85352.log.

Completed auto-reimage of hosts:

['lvs1001.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1006.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1005.wikimedia.org']

Of which those FAILED:

['lvs1001.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1006.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1005.wikimedia.org']

Change 512189 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs1001-6: fix partman recipe

https://gerrit.wikimedia.org/r/512189

Change 512190 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs1001-6: remove jessie-installer settings

https://gerrit.wikimedia.org/r/512190

Change 512189 merged by BBlack:
[operations/puppet@production] lvs1001-6: fix partman recipe

https://gerrit.wikimedia.org/r/512189

Change 512190 merged by BBlack:
[operations/puppet@production] lvs1001-6: remove jessie-installer settings

https://gerrit.wikimedia.org/r/512190

ayounsi updated the task description. (Show Details)May 23 2019, 6:06 PM

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['lvs1001.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1005.wikimedia.org', 'lvs1006.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201905232134_bblack_168841.log.

Completed auto-reimage of hosts:

['lvs1004.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1005.wikimedia.org', 'lvs1001.wikimedia.org', 'lvs1006.wikimedia.org']

and were ALL successful.

BBlack updated the task description. (Show Details)May 23 2019, 10:33 PM
BBlack reassigned this task from BBlack to ayounsi.May 23 2019, 10:36 PM

These are reimaged to role(spare::system) now. Over to @ayounsi for getting rid of all the special cases related to these hosts in the eqiad routers and switches (BGP stuff, fw filters, the special public-vlan LVS-balancer port groups, etc), and then we can move this on to dcops -level decom stuff.

Mentioned in SAL (#wikimedia-operations) [2019-05-24T00:27:10Z] <XioNoX> remove term protect-old-lvs-servers from cr1/2-eqiad - T224223

cr1/2-eqiad
[edit firewall family inet filter border-in4]
-      /* workaround until lvs1001-lvs1007 are decom'ed */
-      term protect-old-lvs-servers {
-          from {
-              destination-address {
-                  208.80.154.55/32;
-                  208.80.154.56/32;
-                  208.80.154.57/32;
-                  208.80.154.137/32;
-                  208.80.154.138/32;
-                  208.80.154.139/32;
-              }
-              protocol tcp;
-              destination-port [ 22 179 9090 9100 ];
-          }
-          then {
-              discard;
-          }
-      }

Mentioned in SAL (#wikimedia-operations) [2019-05-24T00:31:59Z] <XioNoX> remove lvs1001-5 bgp sessions from cr1/2-eqiad - T224223

ayounsi reassigned this task from ayounsi to RobH.May 24 2019, 12:43 AM
ayounsi updated the task description. (Show Details)
Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.May 28 2019, 2:51 PM
RobH updated the task description. (Show Details)Thu, Jul 25, 8:09 PM

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: lvs[1001-1006].wikimedia.org

  • lvs1004.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1003.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1006.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1005.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1001.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • lvs1002.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

Change 525644 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom lvs100[1-6] production dns

https://gerrit.wikimedia.org/r/525644

Change 525644 merged by RobH:
[operations/dns@master] decom lvs100[1-6] production dns

https://gerrit.wikimedia.org/r/525644

ayounsi removed a subscriber: ayounsi.Thu, Jul 25, 8:25 PM
RobH updated the task description. (Show Details)Thu, Jul 25, 8:29 PM
RobH removed a project: Patch-For-Review.
RobH added a subscriber: ayounsi.
RobH removed RobH as the assignee of this task.Fri, Jul 26, 1:39 PM

Change 525805 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom lvs100[1-6]

https://gerrit.wikimedia.org/r/525805

Change 525805 merged by RobH:
[operations/puppet@production] decom lvs100[1-6]

https://gerrit.wikimedia.org/r/525805

RobH updated the task description. (Show Details)Fri, Jul 26, 2:04 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)Fri, Jul 26, 2:43 PM

Change 525822 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom mgmt dns for lvs100[1-6]

https://gerrit.wikimedia.org/r/525822

RobH assigned this task to ayounsi.Fri, Jul 26, 2:45 PM

@ayounsi,

Per your request, we are assigning this to you for the switch configuration removal for lvs100[1-6]. All of the systems have been unracked. Please resolve this task once you remove all switch configuration for these hosts.

RobH moved this task from Decommission to Blocked on the ops-eqiad board.Fri, Jul 26, 2:46 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-26T18:01:51Z] <XioNoX> remove lvs100[1-6] switch config from asw2-a-eqiad - T224223

Mentioned in SAL (#wikimedia-operations) [2019-07-26T18:08:40Z] <XioNoX> remove lvs100[1-6] switch config from asw2-b-eqiad - T224223

Mentioned in SAL (#wikimedia-operations) [2019-07-26T18:20:01Z] <XioNoX> remove lvs100[1-6] switch config from asw2-c-eqiad - T224223

Mentioned in SAL (#wikimedia-operations) [2019-07-26T18:43:13Z] <XioNoX> remove lvs100[1-6] switch config from asw2-d-eqiad - T224223

ayounsi closed this task as Resolved.Fri, Jul 26, 6:45 PM

lvs100[1-6] removed from switches.