Page MenuHomePhabricator

db1057 does not react to powercycle/powerdown/powerup commands
Closed, ResolvedPublic

Description

My bet is that power supply or something else is fried- although integrated monitoring does not show anything strange.

https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1766

This server was being reimaged- so it can be powerup/down at any time.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I am going to pool db1067 as a replacement- we cannot be without this server for long.

Mentioned in SAL (#wikimedia-operations) [2017-03-15T19:46:39Z] <jynus> shutting down db1067 for maintenance (as a db1057 replacement) T160435

Change 342905 had a related patch set uploaded (by Jcrespo):
[operations/puppet] Move db1067 from s2 to s1 as a db1057 replacement

https://gerrit.wikimedia.org/r/342905

Change 342906 had a related patch set uploaded (by Jcrespo):
[operations/mediawiki-config] Move db1067 from s2 to s1 as a db1057 replacement

https://gerrit.wikimedia.org/r/342906

Change 342905 merged by Jcrespo:
[operations/puppet] Move db1067 from s2 to s1 as a db1057 replacement

https://gerrit.wikimedia.org/r/342905

Change 342906 merged by jenkins-bot:
[operations/mediawiki-config] Move db1067 from s2 to s1 as a db1057 replacement

https://gerrit.wikimedia.org/r/342906

Change 342990 had a related patch set uploaded (by Marostegui):
[operations/software] s1,s2.hosts: Move db1067 from s2 to s1

https://gerrit.wikimedia.org/r/342990

Change 342990 merged by jenkins-bot:
[operations/software] s1,s2.hosts: Move db1067 from s2 to s1

https://gerrit.wikimedia.org/r/342990

Mentioned in SAL (#wikimedia-operations) [2017-03-16T13:15:53Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1026 - T160415, Repool db1067 - T160435 (duration: 00m 42s)

[16:49:04] <cmjohnson1> db1057 has 1 bad pdu 
[16:49:07] <cmjohnson1> do does db1055

It still doesn't come up, not even the bios checks. :-(

@Cmjohnson this is still blocked on you- db1057 does not start. Does it need a new PDU?

@jcrespo: I pulled the 1 pdu out that was working and the server appears to turn on but I am not getting any display or able to establish a connection. I am able to login to the idrac and see a cpu2 error and the power supply. The server is well out of warranty and should be replaced.

@Cmjohnson I might be asking something silly, but I thought I would ask just in case.
Is it doable to take some spare pieces from some other server that have been decommissioned and replace those faulty ones in db1057 to see if we can bring it back to life?

I am not sure which one you think we can pull from? All the db's that are
being decom'd are different server types and older...They're R510's from

  1. These are R720's from 2013. Typically this would be a main board

replacement but since it's out of warranty we are not able to do this.

I am not sure which one you think we can pull from? All the db's that are
being decom'd are different server types and older...They're R510's from

  1. These are R720's from 2013. Typically this would be a main board

replacement but since it's out of warranty we are not able to do this.

Ah right. I wasn't sure about it, hence my question.
It is clear now that we lost that server for good I guess :_(?

We may be able to salvage data.....I can try moving the raid card and disks to a decom R510

Marostegui claimed this task.

We thankfully saved the data before reimaging/rebooting it, it is more about the server itself (and the possibility of using it somewhere else) than the data it contained.
I would say we have to assume we have lost this chassis then :(

I will mark this as resolved, as there is not much else we can do.

Thanks for your help Chris

Change 345535 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Remove db1057 from s1

https://gerrit.wikimedia.org/r/345535

Change 345545 had a related patch set uploaded (by Marostegui):
[operations/puppet@production] site.pp: Remove db1057

https://gerrit.wikimedia.org/r/345545

Change 345535 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Remove db1057 from s1

https://gerrit.wikimedia.org/r/345535

Mentioned in SAL (#wikimedia-operations) [2017-03-30T14:50:58Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Remove db1057 entry from s1 shard - T160435 (duration: 00m 44s)

Change 345856 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-codfw,db-eqiad.php: Remove db1057

https://gerrit.wikimedia.org/r/345856

Change 345856 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw,db-eqiad.php: Remove db1057

https://gerrit.wikimedia.org/r/345856

Mentioned in SAL (#wikimedia-operations) [2017-04-03T06:21:32Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Remove db1057 entry - T160435 (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2017-04-03T06:25:01Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Remove db1057 entry - T160435 (duration: 00m 44s)

Change 345545 merged by Marostegui:
[operations/puppet@production] site.pp,linux-host-entries.ttyS1: Remove db1057

https://gerrit.wikimedia.org/r/345545