Page MenuHomePhabricator

Phlogiston-1 unresponsive
Closed, ResolvedPublic5 Estimate Story Points

Description

Phlogiston.wmflabs.org was unresponsive by browser this morning (last time I'm sure it was up was middle of last week). It didn't respond to ssh. https://wikitech.wikimedia.org/wiki/Special:NovaInstance showed active but "get console output" returned blank.

I clicked REBOOT, and now it is in state ERROR; web browsing returns a 502, and ssh fails with "no route to host". "Get console output is still blank".

Event Timeline

Restricted Application added a project: User-bd808. · View Herald TranscriptJun 13 2016, 6:12 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
Andrew closed this task as Resolved.Jun 15 2016, 12:35 AM
Andrew added a subscriber: Andrew.

I migrated this VM to labvirt1009 and did a bit of puppet/apt cleanup, and it seems to be working fine. The original cause of the lockup is unknown, but that's a problem for me and not for you :)

bd808 moved this task from To Do to Archive on the User-bd808 board.Jun 21 2016, 7:40 PM
JAufrecht reopened this task as Open.Jun 27 2016, 10:43 PM

Same thing happened today. Do you want me to open a fresh ticket?

thinking this is related to T137857 I tried to recovery this VM but so far no dice.

nova reset-state --active ab9a7a7b-709f-4bb6-9f33-c56942f30ab5

nova start ab9a7a7b-709f-4bb6-9f33-c56942f30ab5
Request to start server ab9a7a7b-709f-4bb6-9f33-c56942f30ab5 has been accepted.

nova list --all-tenants | grep phlog
| ab9a7a7b-709f-4bb6-9f33-c56942f30ab5 | phlogiston-1                         | phlogiston                  | SHUTOFF | -          | Shutdown    | public=10.68.18.67
bd808 removed bd808 as the assignee of this task.Jun 30 2016, 2:00 AM
bd808 removed a project: User-bd808.
bd808 added a subscriber: bd808.

This needs migrating to labvirt1011 since all other instances are shortage.

Andrew closed this task as Resolved.Jul 5 2016, 1:57 PM
Andrew claimed this task.

I started this instance just now and it seems to be up and running OK.

I've also made another attempt to resolve the parent bug and I'm optimistic that this will stop happening.

JAufrecht set the point value for this task to 5.Sep 16 2016, 11:29 PM