Page MenuHomePhabricator

decommission: labtestservices2001.wikimedia.org
Closed, ResolvedPublic

Description

This task will track the decommission of server labtestservices2001.wikimedia.org.

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

labtestservices2001.wikimedia.org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw-b-codfw:ge-8/0/12
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update Netbox with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - change netbox status to offline when unracked

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptMar 11 2019, 12:17 PM
aborrero updated the task description. (Show Details)Mar 11 2019, 1:07 PM

right now labtestservices2001 is the only host for the labtest ldap db. So we should move that someplace before we decom, unless we want to start with a fresh db entirely.

Change 497293 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

Mentioned in SAL (#wikimedia-operations) [2019-03-18T13:03:58Z] <arturo> T218022 disable icinga checks for labtestservices2001.wikimedia.org

aborrero renamed this task from Hardware decommission: labtestservices2001.wikimedia.org to decommission: labtestservices2001.wikimedia.org.Mar 18 2019, 1:04 PM
aborrero updated the task description. (Show Details)
aborrero added a subscriber: RobH.
aborrero changed the task status from Open to Stalled.Mar 18 2019, 1:38 PM

right now labtestservices2001 is the only host for the labtest ldap db. So we should move that someplace before we decom, unless we want to start with a fresh db entirely.

I don't really know what that database is about. But perhaps we want to do it at the same time as T218569: Openstack codfw DBs: move to m5-master.eqiad.wmnet. Would you mind updating that tickets so we have all the DB-reallocating info in a single place?

I will block this task on that so we don't accidentally wipe the server :-)

aborrero updated the task description. (Show Details)Mar 18 2019, 1:39 PM

I don't really know what that database is about. But perhaps we want to do it at the same time as T218569: Openstack codfw DBs: move to m5-master.eqiad.wmnet. Would you mind updating that tickets so we have all the DB-reallocating info in a single place?

It's not a mysql database. Labtest has its own testing ldap -- that ldap is stored on ldapservices1001 so we'd lose all that state unless we sync this to a different ldap server.

I don't really know what that database is about. But perhaps we want to do it at the same time as T218569: Openstack codfw DBs: move to m5-master.eqiad.wmnet. Would you mind updating that tickets so we have all the DB-reallocating info in a single place?

It's not a mysql database. Labtest has its own testing ldap -- that ldap is stored on ldapservices1001 so we'd lose all that state unless we sync this to a different ldap server.

OK, Then we should probably create a LDAP server in codfw if we want to have both environments as close as possible? I'm also fine if we just copy&paste the LDAP DB to another server.

Mentioned in SAL (#wikimedia-operations) [2019-03-25T07:58:21Z] <vgutierrez> disable puppet and downtime host in icinga for labtestservices2001 - T218022

Dzahn moved this task from Backlog to Decommission on the ops-codfw board.Apr 12 2019, 12:07 AM
aborrero updated the task description. (Show Details)Apr 22 2019, 10:59 AM

Change 505629 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labtestservices2001: use spare role

https://gerrit.wikimedia.org/r/505629

Change 505629 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] labtestservices2001: use spare role

https://gerrit.wikimedia.org/r/505629

aborrero changed the task status from Stalled to Open.Apr 22 2019, 11:35 AM
aborrero reassigned this task from aborrero to RobH.
aborrero triaged this task as Normal priority.
aborrero updated the task description. (Show Details)
aborrero removed a subscriber: aborrero.
RobH updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labtestservices2001.wikimedia.org

  • labtestservices2001.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
RobH updated the task description. (Show Details)Apr 23 2019, 4:13 PM

Change 505810 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decommission labtestservices2001 production dns

https://gerrit.wikimedia.org/r/505810

Change 505812 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom labtestservices2001

https://gerrit.wikimedia.org/r/505812

Change 505810 merged by RobH:
[operations/dns@master] decommission labtestservices2001 production dns

https://gerrit.wikimedia.org/r/505810

Change 505812 merged by RobH:
[operations/puppet@production] decom labtestservices2001

https://gerrit.wikimedia.org/r/505812

RobH reassigned this task from RobH to Papaul.Apr 23 2019, 4:21 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to pending onsite steps (codfw) on the decommission board.

Ready for the remainder of decom steps, then removal from racks, thanks!

RobH updated the task description. (Show Details)Apr 25 2019, 11:53 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-26T09:48:56Z] <marostegui> Remove labtestservices2001 from tendril - T218022

Papaul added a comment.May 2 2019, 4:51 PM

@RobH this server is still showing up on the switch side

papaul@asw-b-codfw> show interfaces ge-8/0/12 descriptions 
Interface       Admin Link Description
ge-8/0/12       up    up   labtestservices2001-eth0
Papaul reassigned this task from Papaul to RobH.May 2 2019, 4:51 PM
Papaul added a subscriber: Papaul.
RobH reassigned this task from RobH to Papaul.May 2 2019, 4:54 PM

Done, port disabled, back to you.

Papaul updated the task description. (Show Details)May 7 2019, 2:44 PM
Papaul updated the task description. (Show Details)May 8 2019, 3:05 PM

Change 510567 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS for labtestservices2001

https://gerrit.wikimedia.org/r/510567

Change 510567 merged by Papaul:
[operations/dns@master] DNS: Remove mgmt DNS for labtestservices2001

https://gerrit.wikimedia.org/r/510567

Papaul closed this task as Resolved.May 15 2019, 4:32 PM
Papaul updated the task description. (Show Details)

Complete