Page MenuHomePhabricator

db2064 crashed and totally broken - decommission it
Closed, ResolvedPublic

Description

00:28 < icinga-wm> PROBLEM - Host db2064 is DOWN: PING CRITICAL - Packet loss = 100%

ILO logs:

/system1/log1/record9
  Targets
  Properties
    number=9
    severity=Critical
    date=05/21/2018
    time=00:26
    description=System Power Fault Detected (XR: 14 00 MID: FF 4D FC CE C0 FF FF 32 32 0C 0C 40 9C 00 00 01 0F 47 00 00 00 00 00 00 00 00 00 00 00 00 00 00)

UPDATE

  • This server is not coming back and should be decommissioned **

Decommission Checklist

  • - all system services confirmed offline from production use - should be done by DBA team: https://gerrit.wikimedia.org/r/#/c/434527/
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. https://gerrit.wikimedia.org/r/#/c/434297/
  • - remove system from all lvs/pybal active configuration - should be done by DBA team
  • - any service group puppet/heira/dsh config removed - should be done by DBA team
  • - remove site.pp (system cannot be powered on, so remove it directly from site.pp - no need to add role spare.) - should be done by DBA team

A few of these steps cannot be done as the server is not booting up.

START NON-INTERRUPPTABLE STEPS - please assign to @RobH for the non-interrupt steps

  • - disable puppet on host (cannot be done - system offline)
  • - power down host (already done, the system cannot be back online)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal): asw-d-codfw:ge-6/0/12
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

Event Timeline

Marostegui triaged this task as Medium priority.May 21 2018, 5:03 AM

Change 434295 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2064

https://gerrit.wikimedia.org/r/434295

Change 434295 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2064

https://gerrit.wikimedia.org/r/434295

Mentioned in SAL (#wikimedia-operations) [2018-05-21T05:11:39Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2064 - T195228 (duration: 01m 44s)

Marostegui moved this task from Triage to In progress on the DBA board.

Can you take a look at this server? Maybe power drain it?
I am not even able to power it on:

</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:31 2018



power: server power is currently: Off


</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:33 2018



Server powering on .......



</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:39 2018



power: server power is currently: Off


</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:41 2018



power: server power is currently: Off


</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:46 2018



Server powering on .......



</>hpiLO-> vsp

Virtual Serial Port Active: COM2
 The server is not powered on.  The Virtual Serial Port is not available.

Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.


</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:20:16 2018



power: server power is currently: Off

Change 434297 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2064: Disable notifications

https://gerrit.wikimedia.org/r/434297

Change 434297 merged by Marostegui:
[operations/puppet@production] db2064: Disable notifications

https://gerrit.wikimedia.org/r/434297

Papaul subscribed.

@Marostegui using the power button on the server to power the server doesn't work. Draining the power from the server didn't help as well
The server is not coming on at all.
Server out of warranty since 2018-1-14

Can we try to swap its PSU with another server from the ones we've decommissioned? Are those compatibles?

From my chat with @Papaul

  • We have no compatible PSUs from the servers that were decommissioned (they are different vendors)
  • Changing the power socket/cable didn't have any effect either

So, looks like this server is lost for good.
We have no other similar servers decommissioned we cannot replace spares pieces.

Our DCOps suggestion is to basically decommission it and get it replaced, but I am not sure if it is worth to buy just a single server or group this with future codfw expansion/refresh //cc @mark @jcrespo

Marostegui renamed this task from db2064 crashed to db2064 crashed and totally broken - decommission it.May 22 2018, 4:13 PM
Marostegui updated the task description. (Show Details)
Marostegui added a subscriber: RobH.

Change 434527 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db2064

https://gerrit.wikimedia.org/r/434527

Change 434527 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db2064

https://gerrit.wikimedia.org/r/434527

Mentioned in SAL (#wikimedia-operations) [2018-05-22T16:24:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Remove db2064 from config - T195228 (duration: 01m 19s)

Mentioned in SAL (#wikimedia-operations) [2018-05-22T16:25:59Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Remove db2064 from config - T195228 (duration: 01m 18s)

Change 434530 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Remove db2064

https://gerrit.wikimedia.org/r/434530

Change 434530 merged by jenkins-bot:
[operations/software@master] s2.hosts: Remove db2064

https://gerrit.wikimedia.org/r/434530

Change 434531 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Set db2064 to spare

https://gerrit.wikimedia.org/r/434531

Change 434531 merged by Marostegui:
[operations/puppet@production] mariadb: Set db2064 to spare

https://gerrit.wikimedia.org/r/434531

Marostegui updated the task description. (Show Details)

This system is now ready to be decommissioned :-(

I had a chat with @mark and for now we will not buy a replacement. If we have some more issues with other servers and/or we really feel we cannot make it without this host, we will revisit this.
@RobH you can proceed with the decommission then

Thanks!

Vvjjkkii renamed this task from db2064 crashed and totally broken - decommission it to bkcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii removed RobH as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot lowered the priority of this task from High to Medium.Jul 5 2018, 6:44 PM
CommunityTechBot updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-07-12T09:17:33Z] <moritzm> ran puppet node clean/deactivate on db2064, hardware is broken for good and caused ongoing connection failures in cumin/debdeploy (T195228)

Change 445428 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom db2064, remove prod dns

https://gerrit.wikimedia.org/r/445428

Change 445428 merged by RobH:
[operations/dns@master] decom db2064, remove prod dns

https://gerrit.wikimedia.org/r/445428

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to pending onsite steps (codfw) on the decommission-hardware board.

@RobH we can not do disks wipe on this system. The system can't boot and we do not have any identical server not in use to put the disk in and do the wipe.

We don't need an identical system, just any system we can install the disks into. I advise using a spare system to do this. Make sense?

Or servers that we already decommissioned?

We have no spare system that can take 12 disks I will just use one of the Dell decommissioned server.

@Papaul: I realize there may not be a 12 disk spare or decom, and you'll have to use one of the 4 or 8 disk spares or decoms and do them in batches. That is fine. Thanks!

@RobH we do have a 12 disks decom on site. (db2013)

Moved all disks in one of the decom server (db2013). Disk wipe in progress

switch port information

ge-6/0/12

 show interfaces ge-6/0/12 
Physical interface: ge-6/0/12, Administratively down, Physical link is Down
  Interface index: 1212, SNMP ifIndex: 761
  Description: DISABLED

Change 451362 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS for db2064

https://gerrit.wikimedia.org/r/451362

Change 451362 merged by Marostegui:
[operations/dns@master] DNS: Remove mgmt DNS for db2064

https://gerrit.wikimedia.org/r/451362

Papaul updated the task description. (Show Details)

This is complete resolving it.