Page MenuHomePhabricator

labservices1001 down, suspected overheating
Closed, ResolvedPublic

Description

Today labservices1001 went down and toolschecker complained about all tests failing.

The failure itself seems hardware related, host down was at 2:17

$ sudo grep mcelog syslog
Dec  4 01:56:22 labservices1001 mcelog: mcelog read: Function not implemented
Dec  4 01:56:22 labservices1001 mcelog: Processor 1 heated above trip temperature. Throttling enabled.
Dec  4 01:56:22 labservices1001 mcelog: Please check your system cooling. Performance will be impacted
Dec  4 01:56:22 labservices1001 mcelog: Processor 1 below trip temperature. Throttling disabled
Dec  4 01:58:52 labservices1001 mcelog: Processor 3 heated above trip temperature. Throttling enabled.
Dec  4 01:58:52 labservices1001 mcelog: Please check your system cooling. Performance will be impacted
Dec  4 01:58:52 labservices1001 mcelog: Processor 3 below trip temperature. Throttling disabled
Dec  4 02:06:22 labservices1001 mcelog: Processor 3 heated above trip temperature. Throttling enabled.
Dec  4 02:06:22 labservices1001 mcelog: Please check your system cooling. Performance will be impacted
Dec  4 02:06:22 labservices1001 mcelog: Processor 3 below trip temperature. Throttling disabled
Dec  4 02:08:52 labservices1001 mcelog: Processor 3 heated above trip temperature. Throttling enabled.
Dec  4 02:08:52 labservices1001 mcelog: Please check your system cooling. Performance will be impacted
Dec  4 02:08:52 labservices1001 mcelog: Processor 3 below trip temperature. Throttling disabled
Dec  4 02:12:37 labservices1001 mcelog: Processor 1 heated above trip temperature. Throttling enabled.
Dec  4 02:12:37 labservices1001 mcelog: Please check your system cooling. Performance will be impacted
Dec  4 02:12:37 labservices1001 mcelog: Processor 2 heated above trip temperature. Throttling enabled.
Dec  4 02:12:37 labservices1001 mcelog: Please check your system cooling. Performance will be impacted
Dec  4 02:42:00 labservices1001 mcelog: failed to prefill DIMM database from DMI data

And before that labservices1001 saw high disk I/O starting at around 1.00 UTC which might have been a trigger

https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?var-server=labservices1001%3A9100&var-datasource=eqiad%20prometheus%2Fops&from=1480812690403&to=1480818733445

Note that during this time dns from labs instances seemed fine, why toolschecker failed needs investigation.

This task is to track the suspected overheating / hw problem

Event Timeline

Note that during this time dns from labs instances seemed fine, why toolschecker failed needs investigation

Well, it wasn't all fine. CI tests were failing with stuff like:

02:21:30 git.exc.GitCommandError: 'git remote prune --dry-run origin' returned with exit code 128
02:21:30 stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/p/mediawiki/core/': Could not resolve host: gerrit.wikimedia.org'

This is now back up, yes? :) Incident report filed at https://wikitech.wikimedia.org/wiki/Incident_documentation/20161204-labservices1001 Closable? Follow-ups all filed and such?

@greg yeah now back up! this task is one of the followups though, I'll clarify it a bit

fgiunchedi renamed this task from labservices1001 down to labservices1001 down, suspected overheating.Dec 6 2016, 6:57 PM
fgiunchedi updated the task description. (Show Details)
Andrew added a subscriber: Andrew.

A few things that happen that should not when labservices1001 dies (we do not see these same failures when labservices1002 is down):

  • Puppet runs fail (in tools at least) attempting to resolve tools-redis-1001.tools.eqiad.wmflabs
  • Tools checker checks /all/ go offline it seems
  • CI tests (or some significant subset or them) fail to run adequately (see https://phabricator.wikimedia.org/T152340#2845120)

It's probable this is all due to hardcoded DNS resolution or some failure in this category.

fgiunchedi claimed this task.

I don't think we've seen this reoccuring