Page MenuHomePhabricator

snapshot1005 does not power back up
Closed, ResolvedPublic

Description

Saw an icinga alert for snapshot1005 being done.
From the mgmt console, power was off. issuing power on command claimed to power it on, yet power status remained powered off, even after mutiple attempts.

Tried checking the logs: powersupply1 land powersupply2 both look ok, only thing in /system1/log1 records earlier than my typo of some command or other was from two months ago. Nothing useful in /map1/log1 records either, just logins and attempts to power on the dang thing.

Does this need a manual unplug/replug? Is there something I missed?

Event Timeline

ArielGlenn added projects: SRE, ops-eqiad, DC-Ops.
ArielGlenn updated the task description. (Show Details)

If we can't get more info about this before tomorrow morning (UTC morning), I'll move the services on this box to snapshot1009 then; if it turns out it just needs the right kind of kick, I won't bother to do that of course.

Change 443828 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] switch en wiki dumps to run on snapshot1009 for now

https://gerrit.wikimedia.org/r/443828

Because our dc folks are off today, I will likely go ahead and make this switch this evening in time for the next cron cycle to pick up the jobs.

Change 443828 merged by ArielGlenn:
[operations/puppet@production] switch en wiki dumps to run on snapshot1009 for now

https://gerrit.wikimedia.org/r/443828

Changed my mind and merged it through. I'll let the en wp dumps run to completion over here this month, regardless.

I attempted to power off, unplug and power the server back on, unfortunately it does not want to power on...i just get a flashing green led on the front. I am able to access the mgmt web portal and pulled the AHS log a requirement by HP tech support. Worth noting that this is a lease server and it expires in February 2019. We should still fix the server but is this a good opportunity to move the service?

The service was moved to another host already (which had been the spare) but we do still want a spare. February is not that far away though, all things considered. I guess snapshot1006,7 will have the same expiry dates.

Cmjohnson claimed this task.

The system board has been replaced and the server is accessible now