⚓ T195228 db2064 crashed and totally broken

Subject	Repo	Branch	Lines +/-
DNS: Remove mgmt DNS for db2064	operations/dns	master	+1 -4
decom db2064, remove prod dns	operations/dns	master	+1 -2
mariadb: Set db2064 to spare	operations/puppet	production	+1 -9
s2.hosts: Remove db2064	operations/software	master	+0 -1
db-eqiad,db-codfw.php: Remove db2064	operations/mediawiki-config	master	+0 -3
db2064: Disable notifications	operations/puppet	production	+1 -0
db-codfw.php: Depool db2064	operations/mediawiki-config	master	+1 -1

Marostegui created this task.May 21 2018, 5:02 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 21 2018, 5:02 AM

Marostegui triaged this task as Medium priority.May 21 2018, 5:03 AM

Change 434295 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2064

https://gerrit.wikimedia.org/r/434295

gerritbot added a project: Patch-For-Review.May 21 2018, 5:05 AM

Change 434295 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2064

https://gerrit.wikimedia.org/r/434295

Marostegui added a project: ops-codfw.May 21 2018, 5:10 AM

Marostegui updated the task description. (Show Details)

Restricted Application added a project: SRE. · View Herald TranscriptMay 21 2018, 5:10 AM

Mentioned in SAL (#wikimedia-operations) [2018-05-21T05:11:39Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2064 - T195228 (duration: 01m 44s)

Can you take a look at this server? Maybe power drain it?
I am not even able to power it on:

</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:31 2018



power: server power is currently: Off


</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:33 2018



Server powering on .......



</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:39 2018



power: server power is currently: Off


</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:41 2018



power: server power is currently: Off


</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:19:46 2018



Server powering on .......



</>hpiLO-> vsp

Virtual Serial Port Active: COM2
 The server is not powered on.  The Virtual Serial Port is not available.

Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.


</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Mon May 21 05:20:16 2018



power: server power is currently: Off

Change 434297 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2064: Disable notifications

https://gerrit.wikimedia.org/r/434297

Change 434297 merged by Marostegui:
[operations/puppet@production] db2064: Disable notifications

https://gerrit.wikimedia.org/r/434297

@Marostegui using the power button on the server to power the server doesn't work. Draining the power from the server didn't help as well
The server is not coming on at all.
Server out of warranty since 2018-1-14

Can we try to swap its PSU with another server from the ones we've decommissioned? Are those compatibles?

From my chat with @Papaul

We have no compatible PSUs from the servers that were decommissioned (they are different vendors)
Changing the power socket/cable didn't have any effect either

@Marostegui no there are not

So, looks like this server is lost for good.
We have no other similar servers decommissioned we cannot replace spares pieces.

Our DCOps suggestion is to basically decommission it and get it replaced, but I am not sure if it is worth to buy just a single server or group this with future codfw expansion/refresh //cc @mark @jcrespo

Marostegui renamed this task from db2064 crashed to db2064 crashed and totally broken - decommission it.May 22 2018, 4:13 PM

Marostegui added a project: decommission-hardware.

Marostegui updated the task description. (Show Details)

Marostegui added a subscriber: RobH.

Change 434527 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db2064

https://gerrit.wikimedia.org/r/434527

Change 434527 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db2064

https://gerrit.wikimedia.org/r/434527

Mentioned in SAL (#wikimedia-operations) [2018-05-22T16:24:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Remove db2064 from config - T195228 (duration: 01m 19s)

Marostegui updated the task description. (Show Details)May 22 2018, 4:24 PM

Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-05-22T16:25:59Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Remove db2064 from config - T195228 (duration: 01m 18s)

Change 434530 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: Remove db2064

https://gerrit.wikimedia.org/r/434530

Change 434530 merged by jenkins-bot:
[operations/software@master] s2.hosts: Remove db2064

https://gerrit.wikimedia.org/r/434530

Change 434531 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Set db2064 to spare

https://gerrit.wikimedia.org/r/434531

Marostegui updated the task description. (Show Details)May 22 2018, 4:35 PM

Change 434531 merged by Marostegui:
[operations/puppet@production] mariadb: Set db2064 to spare

https://gerrit.wikimedia.org/r/434531

This system is now ready to be decommissioned :-(

Marostegui moved this task from In progress to Done on the DBA board.May 22 2018, 4:42 PM

I had a chat with @mark and for now we will not buy a replacement. If we have some more issues with other servers and/or we really feel we cannot make it without this host, we will revisit this.
@RobH you can proceed with the decommission then

Thanks!

• Vvjjkkii renamed this task from db2064 crashed and totally broken - decommission it to bkcaaaaaaa.Jul 1 2018, 1:08 AM

• Vvjjkkii removed RobH as the assignee of this task.

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed subscribers: gerritbot, Aklapper.

Marostegui renamed this task from bkcaaaaaaa to db2064 crashed and totally broken - decommission it.Jul 1 2018, 6:54 PM

Marostegui assigned this task to RobH.

Marostegui removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

Marostegui updated the task description. (Show Details)

CommunityTechBot lowered the priority of this task from High to Medium.Jul 5 2018, 6:44 PM

CommunityTechBot updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-07-12T09:17:33Z] <moritzm> ran puppet node clean/deactivate on db2064, hardware is broken for good and caused ongoing connection failures in cumin/debdeploy (T195228)

MoritzMuehlenhoff updated the task description. (Show Details)Jul 12 2018, 9:18 AM

RobH updated the task description. (Show Details)Jul 12 2018, 3:26 PM

Change 445428 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom db2064, remove prod dns

https://gerrit.wikimedia.org/r/445428

Change 445428 merged by RobH:
[operations/dns@master] decom db2064, remove prod dns

https://gerrit.wikimedia.org/r/445428

RobH reassigned this task from RobH to Papaul.Jul 12 2018, 3:32 PM

RobH removed a project: Patch-For-Review.

RobH updated the task description. (Show Details)

RobH moved this task from Backlog to pending onsite steps (codfw) on the decommission-hardware board.

@RobH we can not do disks wipe on this system. The system can't boot and we do not have any identical server not in use to put the disk in and do the wipe.

We don't need an identical system, just any system we can install the disks into. I advise using a spare system to do this. Make sense?

Or servers that we already decommissioned?

We have no spare system that can take 12 disks I will just use one of the Dell decommissioned server.

@Papaul: I realize there may not be a 12 disk spare or decom, and you'll have to use one of the 4 or 8 disk spares or decoms and do them in batches. That is fine. Thanks!

@RobH we do have a 12 disks decom on site. (db2013)

Marostegui mentioned this in T201245: Degraded RAID on db2054.Aug 6 2018, 3:58 PM

Moved all disks in one of the decom server (db2013). Disk wipe in progress

switch port information

ge-6/0/12

Papaul updated the task description. (Show Details)Aug 8 2018, 1:39 PM

Papaul updated the task description. (Show Details)Aug 8 2018, 3:24 PM

 show interfaces ge-6/0/12 
Physical interface: ge-6/0/12, Administratively down, Physical link is Down
  Interface index: 1212, SNMP ifIndex: 761
  Description: DISABLED

Papaul updated the task description. (Show Details)Aug 8 2018, 3:47 PM

Change 451362 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS for db2064

https://gerrit.wikimedia.org/r/451362

gerritbot added a project: Patch-For-Review.Aug 8 2018, 4:03 PM

Change 451362 merged by Marostegui:
[operations/dns@master] DNS: Remove mgmt DNS for db2064

https://gerrit.wikimedia.org/r/451362