Page MenuHomePhabricator

Decommission broken db1058
Closed, ResolvedPublic

Description

/admin1-> racadm serveraction powerstatus
Server power status: OFF
/admin1-> racadm serveraction powerup
ERROR: Timeout while waiting for server to perform requested power action.
/admin1-> racadm serveraction hardreset
ERROR: Timeout while waiting for server to perform requested power action.
/admin1-> racadm serveraction powerstatus
Server power status: OFF
/admin1-> racadm serveraction powerup
ERROR: Timeout while waiting for server to perform requested power action.

Event Timeline

Restricted Application added subscribers: Zppix, Southparkfan, Aklapper. · View Herald Transcript

Resetting the interface does not do anything. Also trying to power it up from the web interface.

Console output after power on is inexistent.

db1058 is most likely cooked. The server was almost too hot to touch. One of the power supplies has failed. I attempted to drain flea power but the server will not power on. I am letting it cool down to see if that helps.

The server is out of warranty now. In the past a main board replacement was the fix.

Thank you. This should be ones of the replaced ones from the new batch. Feel free to unrack it if you need the space.

I will keep this ticket open for decommission purposes.

Cmjohnson renamed this task from db1058 does not come up after restart to Decommission broken db1058.May 5 2016, 8:47 PM
  • Confirm out of cluster/service group
  • Remove from puppet stored configuration files.
  • Remove from site.pp (puppet:///manifests/site.pp)
  • Remove from netboot.cfg
  • Remove from DHCPD lease file
  • Disable puppet
  • Remove from Icinga monitoring
  • Revoke keys from puppet/salt
  • Remove DNS entries for the production and management.
  • Remove from Rack
  • Update Racktables

Change 287145 had a related patch set uploaded (by Southparkfan):
Remove DNS entries of db1058

https://gerrit.wikimedia.org/r/287145

Just to learn how the process works, I've submitted a patch for the DNS adjustments. I noticed db1058 is referenced in the dhcpd and manifests/role/coredb.pp files in puppet but I have no idea how the latter one works, so I'll leave the puppet work to someone else.

Change 287183 had a related patch set uploaded (by Jcrespo):
Depool db1070 for maintenance

https://gerrit.wikimedia.org/r/287183

@Southparkfan We have a pretty strict way of removing servers. It is all documented here https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission.

We do this so we do not break anything else in the process or cause unnecessary alerts.

Change 287183 merged by Jcrespo:
Depool db1070 for maintenance

https://gerrit.wikimedia.org/r/287183

@Cmjohnson yeah, perhaps I have been a bit too fast by already doing the DNS part (despite that's the only thing I can do it seems) :-)

Anyway, ops know more than me, so they can do whatever is necessary.

Change 287224 had a related patch set uploaded (by Jcrespo):
Retire db1058 from the service group

https://gerrit.wikimedia.org/r/287224

Change 287224 merged by Jcrespo:
Retire db1058 from the service group

https://gerrit.wikimedia.org/r/287224

  • Confirm out of cluster/service group

Change 287591 had a related patch set uploaded (by Jcrespo):
Remove (almost) all references to db1058 on puppet

https://gerrit.wikimedia.org/r/287591

Change 287591 merged by Jcrespo:
Remove (almost) all references to db1058 on puppet

https://gerrit.wikimedia.org/r/287591

Change 287593 had a related patch set uploaded (by Jcrespo):
Remove db1058 entries

https://gerrit.wikimedia.org/r/287593

@Cmjohnson I have removed it from "mediawiki" and "puppet", dhcp, salt, puppet certs, neon. I have not removed it from netboot/preseed as a range is used and name should not be reused, but feel free to disagree.

I've left DNS unmerged, in case you want to do something with the management interface still: https://gerrit.wikimedia.org/r/287593

DNS Removed...@jcrespo I do see some entries in puppet

manifests/role/coredb.pp: 'hosts' => { 'eqiad' => [ 'db1021', 'db1026', 'db1037', 'db1045', 'db1049', 'db1058' ] },
manifests/role/coredb.pp: 'masters' => { 'eqiad' => 'db1058' },

That is a deprecated script, and I am waiting for this week's failover to nuke it completely (coredb otherwise is not in use).

db1058 has been removed from rack

Change 287145 abandoned by Dzahn:
Remove DNS entries of db1058

Reason:
already done by chris in commit 2016979ded611256e5f4b321

https://gerrit.wikimedia.org/r/287145

Change 287593 abandoned by Dzahn:
Remove db1058 entries

Reason:
rebased to nothing

https://gerrit.wikimedia.org/r/287593

I have abandoned 2 pending changes in DNS repo for this, that were already duplicate by Chris' change. Just cleaning up.