Page MenuHomePhabricator

Decommission labsdb1001 and labsdb1003
Closed, ResolvedPublic

Description

labsdb1001 and labsdb1003 are ready to be wiped, unracked, and sent back to Cisco.

labsdb1001:

  • Ops steps
    • All system services confirmed offline from production use (MySQL is now stopped)
    • Set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
    • Remove system from all lvs/pybal active configuration
    • Any service group puppet/hiera/dsh config removed
    • Update site.pp with role::spare::system (https://gerrit.wikimedia.org/r/#/c/404323/)
  • DC ops
    • Disable puppet on host
    • Remove all remaining puppet references (include role::spare:system)
    • Power down host
    • Disable switch port
    • Switch port assignment noted on this task (for later removal)
    • Remove production dns entries
    • Puppet node clean, puppet node deactivate
  • Decommission
    • System disks wiped (by onsite)
    • System unracked and decommissioned (by onsite), update racktables with result
    • Switch port configration removed from switch once system is unracked.
    • Mgmt dns entries removed.

labsdb1003:

  • Ops steps
    • All system services confirmed offline from production use (MySQL is now stopped)
    • Set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
    • Remove system from all lvs/pybal active configuration
    • Any service group puppet/hiera/dsh config removed
    • Update site.pp with role::spare::system (https://gerrit.wikimedia.org/r/#/c/404323/)
  • DC ops
    • Disable puppet on host
    • Remove all remaining puppet references (include role::spare:system)
    • Power down host
    • Disable switch port
    • Switch port assignment noted on this task (for later removal)
    • Remove production dns entries
    • Puppet node clean, puppet node deactivate
  • Decommission
    • System disks wiped (by onsite)
    • System unracked and decommissioned (by onsite), update racktables with result
    • Switch port configration removed from switch once system is unracked.
    • Mgmt dns entries removed.

See also: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission

Event Timeline

chasemp added projects: SRE, DC-Ops.

Change 404323 had a related patch set uploaded (by BryanDavis; owner: Jcrespo):
[operations/puppet@production] mariadb: Set as spares labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/404323

Mentioned in SAL (#wikimedia-operations) [2018-01-17T06:40:21Z] <marostegui> Stop MySQL on labsdb1001 (already dead) and labsdb1003 - T184832

Change 404323 merged by Marostegui:
[operations/puppet@production] mariadb: Set as spares labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/404323

Mentioned in SAL (#wikimedia-operations) [2018-01-17T06:47:17Z] <marostegui> Remove labsdb1001 and labsdb1003 from tendril - T184832

Marostegui added a subscriber: Cmjohnson.

I believe this is now ready for @Cmjohnson to proceed.

Change 405275 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software@master] mariadb: Remove references to labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/405275

Change 405275 merged by Jcrespo:
[operations/software@master] mariadb: Remove references to labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/405275

RobH subscribed.

I believe this is now ready for @Cmjohnson to proceed.

Ideally, they should come to me until they are ready for the on-site steps. I didn't notice this until now, stealing. (Chris can totally do these, but he typically has on-site specific stuff in his queue, so I try to do the remote accessible decom steps for all sites when possible.)

Once I finish all the remote capable steps, I'll assign it over to Chris.

So while these appear to not be in use, the puppet repo was NOT cleared of references before escalation to DC ops, as the lifecycle steps state it should be.

modules/mariadb/files/check_mariadb.py: if host.startswith('labsdb1001') or host.startswith('labsdb1003'):
modules/role/files/mariadb/check_private_data_report:if [ "$HOSTNAME" == "labsdb1001" ] || [ "$HOSTNAME" == "labsdb1003" ]
modules/role/files/prometheus/mysql-labs_eqiad.yaml: - labsdb1001:9104
modules/toollabs/files/toolschecker.py:@check('/labsdb/labsdb1001')
modules/toollabs/files/toolschecker.py:def labsdb_check_labsdb1001():
modules/toollabs/files/toolschecker.py: return db_query_check('labsdb1001.eqiad.wmnet')
modules/toollabs/files/toolschecker.py:@check('/labsdb/labsdb1001rw')
modules/toollabs/files/toolschecker.py:def labsdb_check_labsdb1001rw():
modules/toollabs/files/toolschecker.py: return db_read_write_check('labsdb1001.eqiad.wmnet', 's52524__rwtest')
modules/toollabs/manifests/checker.pp: 'labsdb_labsdb1001' => {
modules/toollabs/manifests/checker.pp: path => '/labsdb/labsdb1001',
modules/toollabs/manifests/checker.pp: 'labsdb_labsdb1001rw' => {
modules/toollabs/manifests/checker.pp: path => '/labsdb/labsdb1001rw',

A quick grep of the repo shows the above for labsdb1001. I assume that since @Marostegui was decommissioning the host before esclation to dc-ops, he is likely the person knowlegable on what these refernces need to change to? (Or perhaps @bd808.)

I'll continue with the decom, but that step was skipped and those items should be cleaned up so they don't just add to the cruft in the repo.

Change 408446 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom of labsdb100[13]

https://gerrit.wikimedia.org/r/408446

Change 408448 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom of labsdb100[13] production dns

https://gerrit.wikimedia.org/r/408448

Change 408446 merged by RobH:
[operations/puppet@production] decom of labsdb100[13]

https://gerrit.wikimedia.org/r/408446

Change 408448 merged by RobH:
[operations/dns@master] decom of labsdb100[13] production dns

https://gerrit.wikimedia.org/r/408448

RobH updated the task description. (Show Details)
RobH removed a project: Patch-For-Review.

Ok, now ready for onsite wipe. it looks like labsdb1003 may also have a disk shelf, please ensure all disks are wiped.

Change 408469 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] mariadb: remove labsdb1001 & labsdb1003 special behavior

https://gerrit.wikimedia.org/r/408469

Change 408470 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolschecker: remove labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/408470

Change 408471 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] prometheus: remove labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/408471

Change 408471 merged by Madhuvishy:
[operations/puppet@production] prometheus: remove labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/408471

Change 408470 merged by Madhuvishy:
[operations/puppet@production] toolschecker: remove labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/408470

A quick grep of the repo shows the above for labsdb1001. I assume that since @Marostegui was decommissioning the host before esclation to dc-ops, he is likely the person knowlegable on what these refernces need to change to? (Or perhaps @bd808.)

I'll continue with the decom, but that step was skipped and those items should be cleaned up so they don't just add to the cruft in the repo.

Thanks for taking care of those referenced Rob.
I thought that was all done by the cloud-services-team. From the start it was a bit unclear who was responsible for decommissioning these hosts (or at least it was unclear to me).
DBAs did some of the stuff we normally do for decommissioning databases, but I assumed the rest was done by cloud-services-team so there was clearly a misunderstanding there.
Thanks again for taking care of all those pending things.

No worries, I just didn't want to remove all the old references directly. I wasn't sure which needed removal, and which required update to new hosts. BD went ahead and pulled it out though, so all good!

Change 408469 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] mariadb: remove labsdb1001 & labsdb1003 special behavior

https://gerrit.wikimedia.org/r/408469

BTW, I can still see on racktables a labsdb1002-array1- not sure if a mistake on the application or it really is still there on reality, but that should be removed too (along with labsdb1001/3-array).

As per the steps completed above, looks like 1001 and 1003 are down but not unracked.

Change 453152 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns for decom host labsdb1001-3

https://gerrit.wikimedia.org/r/453152

Change 453152 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns for decom host labsdb1001-3

https://gerrit.wikimedia.org/r/453152

Cmjohnson updated the task description. (Show Details)