Page MenuHomePhabricator

decommission: labtestservices2001.wikimedia.org
Closed, ResolvedPublic

Description

This task will track the decommission-hardware of server labtestservices2001.wikimedia.org.

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

labtestservices2001.wikimedia.org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw-b-codfw:ge-8/0/12
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update Netbox with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - change netbox status to offline when unracked

Event Timeline

right now labtestservices2001 is the only host for the labtest ldap db. So we should move that someplace before we decom, unless we want to start with a fresh db entirely.

Change 497293 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

Mentioned in SAL (#wikimedia-operations) [2019-03-18T13:03:58Z] <arturo> T218022 disable icinga checks for labtestservices2001.wikimedia.org

aborrero renamed this task from Hardware decommission: labtestservices2001.wikimedia.org to decommission: labtestservices2001.wikimedia.org.Mar 18 2019, 1:04 PM
aborrero updated the task description. (Show Details)
aborrero added a subscriber: RobH.
aborrero changed the task status from Open to Stalled.Mar 18 2019, 1:38 PM

right now labtestservices2001 is the only host for the labtest ldap db. So we should move that someplace before we decom, unless we want to start with a fresh db entirely.

I don't really know what that database is about. But perhaps we want to do it at the same time as T218569: Openstack codfw DBs: move to m5-master.eqiad.wmnet. Would you mind updating that tickets so we have all the DB-reallocating info in a single place?

I will block this task on that so we don't accidentally wipe the server :-)

I don't really know what that database is about. But perhaps we want to do it at the same time as T218569: Openstack codfw DBs: move to m5-master.eqiad.wmnet. Would you mind updating that tickets so we have all the DB-reallocating info in a single place?

It's not a mysql database. Labtest has its own testing ldap -- that ldap is stored on ldapservices1001 so we'd lose all that state unless we sync this to a different ldap server.

I don't really know what that database is about. But perhaps we want to do it at the same time as T218569: Openstack codfw DBs: move to m5-master.eqiad.wmnet. Would you mind updating that tickets so we have all the DB-reallocating info in a single place?

It's not a mysql database. Labtest has its own testing ldap -- that ldap is stored on ldapservices1001 so we'd lose all that state unless we sync this to a different ldap server.

OK, Then we should probably create a LDAP server in codfw if we want to have both environments as close as possible? I'm also fine if we just copy&paste the LDAP DB to another server.

Mentioned in SAL (#wikimedia-operations) [2019-03-25T07:58:21Z] <vgutierrez> disable puppet and downtime host in icinga for labtestservices2001 - T218022

Change 505629 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labtestservices2001: use spare role

https://gerrit.wikimedia.org/r/505629

Change 505629 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] labtestservices2001: use spare role

https://gerrit.wikimedia.org/r/505629

aborrero changed the task status from Stalled to Open.Apr 22 2019, 11:35 AM
aborrero reassigned this task from aborrero to RobH.
aborrero triaged this task as Medium priority.
aborrero updated the task description. (Show Details)
aborrero unsubscribed.

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labtestservices2001.wikimedia.org

  • labtestservices2001.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

Change 505810 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decommission labtestservices2001 production dns

https://gerrit.wikimedia.org/r/505810

Change 505812 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom labtestservices2001

https://gerrit.wikimedia.org/r/505812

Change 505810 merged by RobH:
[operations/dns@master] decommission labtestservices2001 production dns

https://gerrit.wikimedia.org/r/505810

Change 505812 merged by RobH:
[operations/puppet@production] decom labtestservices2001

https://gerrit.wikimedia.org/r/505812

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to pending onsite steps (codfw) on the decommission-hardware board.

Ready for the remainder of decom steps, then removal from racks, thanks!

Mentioned in SAL (#wikimedia-operations) [2019-04-26T09:48:56Z] <marostegui> Remove labtestservices2001 from tendril - T218022

@RobH this server is still showing up on the switch side

papaul@asw-b-codfw> show interfaces ge-8/0/12 descriptions 
Interface       Admin Link Description
ge-8/0/12       up    up   labtestservices2001-eth0

Done, port disabled, back to you.

Change 510567 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS for labtestservices2001

https://gerrit.wikimedia.org/r/510567

Change 510567 merged by Papaul:
[operations/dns@master] DNS: Remove mgmt DNS for labtestservices2001

https://gerrit.wikimedia.org/r/510567

Papaul updated the task description. (Show Details)

Complete