Page MenuHomePhabricator

decommmision: labtestweb2001.wikimedia.org
Closed, ResolvedPublic

Description

This task will track the decommission-hardware of server labtestweb2001.wikimedia.org

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

This system is over 5 years old and will be decommission-hardware and unracked.

labtestweb2001.wikimedia.org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - change netbox status to offline when unracked

Event Timeline

Change 497293 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

aborrero renamed this task from Hardware decommmision: labtestweb2001.wikimedia.org to decommmision: labtestweb2001.wikimedia.org.Mar 18 2019, 1:31 PM
aborrero updated the task description. (Show Details)
aborrero added subscribers: RobH, Papaul.

Change 497293 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

Mentioned in SAL (#wikimedia-operations) [2019-03-21T14:09:25Z] <arturo> T218024 disabled icinga checks for labtestweb2001

I just scanned for databases in this host:

aborrero@labtestweb2001:~ 16s $ sudo mysql -u root
[...]
MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| labtestwiki        |
| mysql              |
| performance_schema |
| striker            |
| test               |
+--------------------+
6 rows in set (0.00 sec)
  • labtestwiki seems to be a mediawiki database, for a testing wikitech?
  • striker seems to be a database for the toolsadmin service, but not sure what is this doing in codfw, if its for testing purposes or what

If we need to reallocate more databases, we may want to include them in the batch here: T218569: Openstack codfw DBs: move to m5-master.eqiad.wmnet

Please @bd808 and @Andrew confirm.

aborrero changed the task status from Open to Stalled.Mar 21 2019, 4:47 PM
aborrero triaged this task as Medium priority.
  • labtestwiki seems to be a mediawiki database, for a testing wikitech?

This is a MediaWiki database for https://labtestwikitech.wikimedia.org/, but I think that it is actually not currently used. The configured db for labtestwiki in codfw is currently db2037 (m5 codfw replica). That is something we have been talking about changing though as all the shared db servers in codfw are read-only which is pretty useless for testing MediaWiki changes.

  • striker seems to be a database for the toolsadmin service, but not sure what is this doing in codfw, if its for testing purposes or what

This is the database for https://labtesttoolsadmin.wikimedia.org/ (possibly only reachable via a ssh tunnel?) which is a testing deployment of Striker. It should be easy enough to archive the database contents with mysqldump and then put the db back up on whatever host we decide to use for mysql/mariadb in this staging cluster.

It might be easiest just to build a new

I seem to have not accounted for a cloudweb200x-dev host in the codfw1dev planning? I guess we will need to figure out how to fix that by re-purposing one of the hosts we have planned or buying/claiming a new misc spare system.

  • labtestwiki seems to be a mediawiki database, for a testing wikitech?

This is a MediaWiki database for https://labtestwikitech.wikimedia.org/, but I think that it is actually not currently used. The configured db for labtestwiki in codfw is currently db2037 (m5 codfw replica). That is something we have been talking about changing though as all the shared db servers in codfw are read-only which is pretty useless for testing MediaWiki changes.

  • striker seems to be a database for the toolsadmin service, but not sure what is this doing in codfw, if its for testing purposes or what

This is the database for https://labtesttoolsadmin.wikimedia.org/ (possibly only reachable via a ssh tunnel?) which is a testing deployment of Striker. It should be easy enough to archive the database contents with mysqldump and then put the db back up on whatever host we decide to use for mysql/mariadb in this staging cluster.

It might be easiest just to build a new

I seem to have not accounted for a cloudweb200x-dev host in the codfw1dev planning? I guess we will need to figure out how to fix that by re-purposing one of the hosts we have planned or buying/claiming a new misc spare system.

I propose we repurpose labtestmetal2001.codfw.wmnet for this, i.e reimage + rename as cloudweb2001-dev.wikimedia.org. Please @bd808 ACK

I seem to have not accounted for a cloudweb200x-dev host in the codfw1dev planning? I guess we will need to figure out how to fix that by re-purposing one of the hosts we have planned or buying/claiming a new misc spare system.

I propose we repurpose labtestmetal2001.codfw.wmnet for this, i.e reimage + rename as cloudweb2001-dev.wikimedia.org. Please @bd808 ACK

That sounds like a good plan to me.

aborrero changed the task status from Stalled to Open.Apr 11 2019, 10:54 AM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

See https://phabricator.wikimedia.org/T220096#5103616, I just reallocated the striker database to clouddb2001-dev.codfw.wmnet. I will proceed with decommissioning this server.

Change 502966 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labtestweb2001: decommission

https://gerrit.wikimedia.org/r/502966

Change 502966 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] labtestweb2001: decommission

https://gerrit.wikimedia.org/r/502966

aborrero updated the task description. (Show Details)

Please note that asw-b-codfw has the following interface:

ge-5/0/15 up up labcontrol2001

This is actually labtestweb2001, as it was up, is around the right port #, and when I sent the poweroff to labtestweb2001, it went down. So I've disabled that port, but @Papaul will need to trace/confirm this as well.

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labtestweb2001.wikimedia.org

  • labtestweb2001.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

Change 505894 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decommission labtestweb2001 production dns entries

https://gerrit.wikimedia.org/r/505894

Change 505894 merged by RobH:
[operations/dns@master] decommission labtestweb2001 production dns entries

https://gerrit.wikimedia.org/r/505894

Change 505895 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom of labtestweb2001

https://gerrit.wikimedia.org/r/505895

Change 505895 merged by RobH:
[operations/puppet@production] decom of labtestweb2001

https://gerrit.wikimedia.org/r/505895

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

@RobH you mentioned that the port was disable it looks like it is not
https://librenms.wikimedia.org/alerts/

RobH reassigned this task from RobH to Papaul.EditedMay 3 2019, 6:24 PM

@Papaul: Please trace and disable the port, as it is unclear on the stack which port it was.

Edit addition: actually you can trace and just remove it all since you are decommissioning the system! =]

papaul@asw-a-codfw# run show interfaces ge-5/0/15 descriptions 
Interface       Admin Link Description
ge-5/0/15       down  down DISABLED

Change 510536 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS for labtestweb2001

https://gerrit.wikimedia.org/r/510536

Change 510536 merged by Papaul:
[operations/dns@master] DNS: Remove mgmt DNS for labtestweb2001

https://gerrit.wikimedia.org/r/510536

Papaul updated the task description. (Show Details)

complete