Page MenuHomePhabricator

iridium (Phabricator host) went down, Possible cpu heat issue
Closed, ResolvedPublic

Description

[13:06:03] <icinga-wm> PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100%
[13:09:32] <icinga-wm> PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[13:09:43] <icinga-wm> PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[13:10:11] <icinga-wm> PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[13:11:25] <matt_flaschen> YuviPanda, Phabricator is 503'ing for me.
[13:15:08] <greg-g> matt_flaschen: 03:06 < icinga-wm> PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100%
[13:15:12] <greg-g> so, yeah :/
[13:15:43] <greg-g> !log Phabricator is down
[13:15:48] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:18:09] * matt_flaschen has changed the topic to: Wikimedia Platform operations, serious stuff | Status: Phabricator DOWN | Log: http://bit.ly/wikisal | Channel logs: http://ur1.ca/edq22 | Ops Clinic Duty: yuvipanda
[13:39:55] <Krenair> do ops get paged when hosts go down? I don't know how icinga is set up for this sort of thing
[13:59:15] * Krinkle has changed the topic to: Wikimedia Platform operations, serious stuff | Status: Phabricator DOWN | Log: https://bit.ly/wikitech | Channel logs: http://ur1.ca/edq22 | Ops Clinic Duty: yuvipanda
[14:02:07] <TimStarling> depends on the service
[14:08:33] <TimStarling> it says it's powered off
[14:11:39] <TimStarling> http://paste.tstarling.com/p/qbrBRb.html
[14:12:37] <Krenair> the system powered itself off?
[14:12:42] <TimStarling> apparently
[14:12:47] <TimStarling> I guess I will try turning it on
[14:13:40] <TimStarling> !log attempting to turn iridium back on via drac. "getraclog" says it powered itself off after resetting four times
[14:13:44] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:15:56] <TimStarling> I'm on the serial console now, it is booting
[14:16:22] <icinga-wm> RECOVERY - Host iridium is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[14:16:45] <TimStarling> well, phab is back up
[14:18:42] <TimStarling> !log iridium came back up, but mcelog reports high CPU temperature prior to the shutdown
[14:18:46] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:19:52] <icinga-wm> RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy
[14:22:01] <icinga-wm> RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[14:22:23] <icinga-wm> RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy

Event Timeline

Peachey88 created this task.Apr 4 2016, 4:52 AM
Restricted Application added subscribers: scfc, Aklapper. · View Herald TranscriptApr 4 2016, 4:52 AM

There has been several severs with cpu heat issues over the last few months. Re-applying thermal paste has been an effective fix. Iridium is still under warranty however, I think first step is thermal paste.

chasemp triaged this task as High priority.Apr 4 2016, 3:18 PM

Some details from an email @tstarling sent out as a notice

...So I powered it up, and it came up, but /var/log/mcelog from prior to
the shutdown showed it constantly coming in and out of CPU throttling
due to high temperature:

http://paste.tstarling.com/p/TaLyCY.html

@greg, are you ok w/ @Cmjohnson taking this down for maint before the normal wednesday window? (spoiler is it may very well go down before then anyway...so it seems prudent)

chasemp assigned this task to greg.Apr 4 2016, 3:23 PM

tossing your way because of the nature of the issue and need for immediate feedback for onsite

Restricted Application added a project: User-greg. · View Herald TranscriptApr 4 2016, 3:23 PM
Paladox added a subscriber: Paladox.Apr 4 2016, 3:23 PM
greg removed greg as the assignee of this task.Apr 4 2016, 3:25 PM
greg removed a project: User-greg.

@greg Downtime Max 10 minutes but not even that long. Can do whenever you're ready

greg added a comment.Apr 4 2016, 5:06 PM

@greg Downtime Max 10 minutes but not even that long. Can do whenever you're ready

@Cmjohnson Just confirmed with team. Do it at will today if you can.

greg assigned this task to Cmjohnson.Apr 4 2016, 5:06 PM
Cmjohnson removed Cmjohnson as the assignee of this task.Apr 4 2016, 5:53 PM

Clean off the old thermal paste and reapplied. Let's monitor for the next few days. Leaving task open

hashar added a subscriber: hashar.Apr 4 2016, 6:44 PM

@Cmjohnson do we have a system to monitor temperature? lm_sensors comes to mind, also found out Diamond has a collector for IPMI data https://github.com/BrightcoveOS/Diamond/wiki/collectors-IPMISensorCollector

Krinkle removed a subscriber: Krinkle.Apr 4 2016, 11:32 PM
Luke081515 moved this task from To Triage to Misc on the Phabricator board.Apr 5 2016, 4:31 PM

resolving this task. Please re-open if the heat issues return.

Restricted Application added subscribers: Southparkfan, TerraCodes. · View Herald TranscriptMay 5 2016, 4:03 PM
Cmjohnson closed this task as Resolved.May 5 2016, 4:03 PM
Cmjohnson claimed this task.