Page MenuHomePhabricator

Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet
Open, NormalPublic

Description

Following the outage tracked by T216208, tasks around evacuating and replacing labsdb1004 and labsdb1005 were expedited.

This will track the decommission of labsdb1004 and labsdb1005 when it is unblocked by T193264

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

labsdb1005:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw2-c-eqiad:ge-3/0/0
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

labsdb1004:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw2-c-eqiad:ge-3/0/1
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - Label disk  #6 as broken so it doesn't get re-used
  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Openaborrero
Openaborrero
Resolvedchasemp
ResolvedBstorm
DeclinedNone
ResolvedBstorm
ResolvedCmjohnson
ResolvedCmjohnson
Resolvedaborrero
DeclinedNone
DeclinedNone
ResolvedHalfak
ResolvedHalfak
ResolvedBstorm
OpenNone
OpenNone
OpenNone
OpenJclark-ctr

Event Timeline

Bstorm changed the task status from Open to Stalled.Feb 21 2019, 6:07 PM
Bstorm triaged this task as Normal priority.
Bstorm created this task.
bd808 moved this task from Backlog to ToolsDB on the Data-Services board.Mar 5 2019, 4:18 PM

This should only be stalled on T216441 at this point. @jcrespo and @Marostegui, if that is ready to close, then we can begin this ticket.

Bstorm added a subscriber: Halfak.Mar 19 2019, 10:24 PM
Bstorm changed the task status from Stalled to Open.Mar 26 2019, 5:01 PM
Bstorm claimed this task.
Bstorm moved this task from Important to Doing on the cloud-services-team (Kanban) board.

I believe this is ready to move forward now.

In the course of sorting out the disabling of things, I found out we monitor the wikilabels db via toolschecker. Figuring out how to shift that to the new server.

Change 499910 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikilabels: Update toolschecker to monitor the live DB

https://gerrit.wikimedia.org/r/499910

Change 499910 merged by Bstorm:
[operations/puppet@production] wikilabels: Update toolschecker to monitor the live DB

https://gerrit.wikimedia.org/r/499910

Need to disable or update any monitoring of the mariadb databases here now. @Marostegui and @jcrespo, I can try putting up patches and such. I presume that there is tendril host that should have access to the instances (and likely doesn't yet). If you have that IP/IPs, let me know.

Mentioned in SAL (#wikimedia-operations) [2019-03-29T05:47:09Z] <marostegui> Remove labsdb1004 and labsdb1005 from tendril - T216749

Mentioned in SAL (#wikimedia-operations) [2019-03-29T05:49:26Z] <marostegui> Disable notifications on labsdb1004 and labsdb1005 - T216749

@Bstorm I have removed the hosts from Tendril (tendril doesn't page or send alerts, but it is nice to get them removed indeed).
I have also disabled notifications and downtimed for both in Icinga.
Once you set role::spare for them, they will get the checks removed too.

Marostegui updated the task description. (Show Details)Mar 29 2019, 6:00 AM

Mentioned in SAL (#wikimedia-operations) [2019-03-29T06:56:13Z] <marostegui> Remove tools section from tendril by doing: update shards set display='0' where name='tools'; T216749

Change 500090 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] labsdb: remove old and likely unused cname for labsdb1004

https://gerrit.wikimedia.org/r/500090

Change 500117 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labsdb: decommissioning labsdb1004/5

https://gerrit.wikimedia.org/r/500117

Change 500117 merged by Bstorm:
[operations/puppet@production] labsdb: decommissioning labsdb1004/5

https://gerrit.wikimedia.org/r/500117

Database services (postgres and mariadb) are now shut off, and the spare role is applied.

Bstorm renamed this task from Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready to Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready.Mar 29 2019, 10:09 PM
Bstorm updated the task description. (Show Details)
Bstorm renamed this task from Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready to Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet.Mar 29 2019, 10:40 PM

@Bstorm mysql is still up at labsdb1004, is that expected?

Change 500373 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Remove labsdb1004,labsdb1005

https://gerrit.wikimedia.org/r/500373

Change 500373 merged by Marostegui:
[operations/puppet@production] mariadb: Remove labsdb1004,labsdb1005

https://gerrit.wikimedia.org/r/500373

Change 500090 merged by Bstorm:
[operations/dns@master] labsdb: remove old and likely unused cname for labsdb1004

https://gerrit.wikimedia.org/r/500090

Bstorm updated the task description. (Show Details)Apr 1 2019, 3:20 PM
Bstorm reassigned this task from Bstorm to RobH.Apr 1 2019, 3:23 PM
Bstorm removed a project: Patch-For-Review.
Bstorm updated the task description. (Show Details)

@Marostegui it was supposed to be down. It needed a kill -9.

Thanks :-)
So fully ready for @RobH to take over

Change 500657 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] site.pp: Clarify labsdb1004 and 1005 status

https://gerrit.wikimedia.org/r/500657

Change 500657 merged by Marostegui:
[operations/puppet@production] site.pp: Clarify labsdb1004 and 1005 status

https://gerrit.wikimedia.org/r/500657

RobH added a comment.Apr 18 2019, 6:43 PM

I'll be using the new cookbook documented on https://wikitech.wikimedia.org/wiki/Decom_script

This replaces the wmf-decommssion-host script use.

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labsdb1004.eqiad.wmnet

  • labsdb1004.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labsdb1005.eqiad.wmnet

  • labsdb1005.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
RobH updated the task description. (Show Details)Apr 18 2019, 6:48 PM

Change 504941 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decommission labsdb100[45]

https://gerrit.wikimedia.org/r/504941

Change 504941 merged by RobH:
[operations/puppet@production] decommission labsdb100[45]

https://gerrit.wikimedia.org/r/504941

Change 504944 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decommission labsdb100[45] production dns entries

https://gerrit.wikimedia.org/r/504944

Change 504944 merged by RobH:
[operations/dns@master] decommission labsdb100[45] production dns entries

https://gerrit.wikimedia.org/r/504944

RobH reassigned this task from RobH to Cmjohnson.Apr 18 2019, 7:03 PM
RobH updated the task description. (Show Details)
RobH edited projects, added ops-eqiad; removed Patch-For-Review.

These are very old R510s, so they are slated for decommission and disposal, not reclaim. Please wipe disks and add to the next disposal batch of servers.

RobH moved this task from Backlog to Decommission on the ops-eqiad board.Apr 18 2019, 7:04 PM
RobH updated the task description. (Show Details)
Cmjohnson reassigned this task from Cmjohnson to Jclark-ctr.Thu, Sep 19, 8:57 PM
Cmjohnson added a subscriber: Cmjohnson.

John, please wipe the servers, remove from the rack, update netbox and the tracking sheet. Assign back to me once you finish so I can kill the switch ports.

Jclark-ctr updated the task description. (Show Details)Fri, Oct 11, 10:47 PM