Page MenuHomePhabricator

Analytics hosts showed high temperature alarms
Closed, ResolvedPublic

Description

Several events during the past days:

19:22  <icinga-wm> PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
04:37 #wikimedia-operations:  <icinga-wm> PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
06:30 #wikimedia-operations:  <icinga-wm> PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager

dmesg -T

[...]
[Sat Apr  9 17:46:09 2016] mce: [Hardware Error]: Machine check events logged
[Sat Apr  9 17:48:18 2016] CPU22: Package temperature above threshold, cpu clock throttled (total events = 378279009)
[Sat Apr  9 17:48:18 2016] CPU8: Package temperature above threshold, cpu clock throttled (total events = 378273836)
[Sat Apr  9 17:48:18 2016] CPU20: Package temperature above threshold, cpu clock throttled (total events = 378279595)
[Sat Apr  9 17:48:18 2016] CPU0: Package temperature above threshold, cpu clock throttled (total events = 378271750)
[Sat Apr  9 17:48:18 2016] CPU4: Package temperature above threshold, cpu clock throttled (total events = 378275623)
[Sat Apr  9 17:48:18 2016] CPU16: Package temperature above threshold, cpu clock throttled (total events = 378279736)
[Sat Apr  9 17:48:18 2016] CPU18: Package temperature above threshold, cpu clock throttled (total events = 378280021)
[...]

and /var/log/mcelog:

mcelog: failed to prefill DIMM database from DMI data
mcelog: Warning: cpu 0 offline?, imc_log not set
: No such file or directory
mcelog: Warning: cpu 1 offline?, imc_log not set
: No such file or directory
mcelog: Warning: cpu 2 offline?, imc_log not set
: No such file or directory
mcelog: Warning: cpu 3 offline?, imc_log not set
: No such file or directory
mcelog: Warning: cpu 4 offline?, imc_log not set
: No such file or directory
[...]
Hardware event. This is not a software error.
MCE 0
CPU 12 THERMAL EVENT TSC da97d489bc000
TIME 1447816783 Wed Nov 18 03:19:43 2015
Processor 12 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 88000bc3 MCGSTATUS 0
MCGCAP 1000c17 APICID 1 SOCKETID 0
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
MCE 1
CPU 12 THERMAL EVENT TSC da97d48bb43f3
TIME 1447816783 Wed Nov 18 03:19:43 2015
Processor 12 below trip temperature. Throttling disabled
STATUS 88010a82 MCGSTATUS 0
MCGCAP 1000c17 APICID 1 SOCKETID 0
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
[...]

Event Timeline

Cmjohnson noted there have been multiple machines lately with CPU temperature issues, where re-applying thermal paste has been an effective fix.

I think it's a good idea to consider taking this server down so he can re-apply thermal paste.

(slightly off-topic as this task is only about analytics1039) There are more analytics servers with the same load pattern though, I think some more of them have similar errors in dmesg?

@Southparkfan: thanks for the info! This is the first host that explicitly shows thermal errors, meanwhile the other one just rebooted for some reason (didn't find it in the logs).

@Cmjohnson: what do you think?

Thanks to all!

Milimetric triaged this task as Medium priority.Apr 12 2016, 4:18 PM
elukey renamed this task from Analytics1039 host showed high temperature alarms to Analytics hosts showed high temperature alarms.Apr 13 2016, 7:21 AM
elukey updated the task description. (Show Details)
elukey@neodymium:~$ sudo -i salt -t 120 analytics10* cmd.run 'grep "Hardware event" /var/log/mcelog | uniq -c'
analytics1041.eqiad.wmnet:
analytics1048.eqiad.wmnet:
analytics1026.eqiad.wmnet:
analytics1043.eqiad.wmnet:
analytics1046.eqiad.wmnet:
analytics1045.eqiad.wmnet:
analytics1027.eqiad.wmnet:
analytics1042.eqiad.wmnet:
analytics1050.eqiad.wmnet:
analytics1049.eqiad.wmnet:
analytics1044.eqiad.wmnet:
analytics1056.eqiad.wmnet:
analytics1051.eqiad.wmnet:
analytics1055.eqiad.wmnet:
          1 Hardware event. This is not a software error.
analytics1047.eqiad.wmnet:
analytics1057.eqiad.wmnet:
analytics1054.eqiad.wmnet:
analytics1053.eqiad.wmnet:
analytics1052.eqiad.wmnet:
analytics1040.eqiad.wmnet:
       9811 Hardware event. This is not a software error.
analytics1037.eqiad.wmnet:
analytics1015.eqiad.wmnet:
         47 Hardware event. This is not a software error.
analytics1031.eqiad.wmnet:
          4 Hardware event. This is not a software error.
analytics1001.eqiad.wmnet:
        374 Hardware event. This is not a software error.
analytics1034.eqiad.wmnet:
analytics1029.eqiad.wmnet:
analytics1002.eqiad.wmnet:
analytics1030.eqiad.wmnet:
analytics1036.eqiad.wmnet:
analytics1028.eqiad.wmnet:
analytics1035.eqiad.wmnet:
analytics1032.eqiad.wmnet:
      32486 Hardware event. This is not a software error.
analytics1038.eqiad.wmnet:
      22790 Hardware event. This is not a software error.
analytics1033.eqiad.wmnet:
        400 Hardware event. This is not a software error.
analytics1039.eqiad.wmnet:
      96934 Hardware event. This is not a software error.
Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.
Nuria removed a project: Analytics.

Updated list (excluding empty results):

elukey@neodymium:~$ sudo -i salt -t 120 analytics10* cmd.run 'grep "Hardware event" /var/log/mcelog | uniq -c'

analytics1001.eqiad.wmnet:
        378 Hardware event. This is not a software error.

analytics1015.eqiad.wmnet:
         47 Hardware event. This is not a software error.

analytics1031.eqiad.wmnet:
          4 Hardware event. This is not a software error.

analytics1055.eqiad.wmnet:
          1 Hardware event. This is not a software error.

analytics1033.eqiad.wmnet:
        620 Hardware event. This is not a software error.

analytics1040.eqiad.wmnet:
      11288 Hardware event. This is not a software error.

analytics1032.eqiad.wmnet:
      37687 Hardware event. This is not a software error.

analytics1038.eqiad.wmnet:
      26164 Hardware event. This is not a software error.

analytics1039.eqiad.wmnet:
     111172 Hardware event. This is not a software error.

@Cmjohnson we'd need some expert opinion for this task! What do you think about these alarms? The cluster is running often at high cpu utilization so I would expect a rise in avg temperatures, but...

@elukey: re-applying thermal paste is needed. There has been several servers that have required lately and it appears to have fixed the issue.

@Cmjohnson sorry for the late answer! Can we schedule maintenance for a couple of servers to see if it fixes the issue? These are part of the Hadoop cluster so we can stop them anytime without any problem (just a minimal graceful stop of Java daemons is required).

elukey added a project: User-Elukey.

@Cmjohnson would you have time next week to apply the thermal paste to a couple of analytics hosts to see if they improve? I'll help shutting them down beforehand.

It turns out we are out of thermal paste onsite, but I'll order more. Chris will be out for the majority of next week, but the paste will arrive while he is gone. Once he is back, he'll be able to apply some to affected hosts. (The order is via T159550.)

@elukey I have the thermal paste....want to plan for this on Thursday morning (my morning)?

@elukey I have the thermal paste....want to plan for this on Thursday morning (my morning)?

Sure! Ping me on IRC when you are ok and I'll shutdown one hadoop node! Thanks!

Mentioned in SAL (#wikimedia-operations) [2017-03-30T17:32:55Z] <elukey> shutdown analytics1039 to apply new thermal paste - T132256

Chris applied the thermal paste and the host is up and running again. Will watch mcelog during the next days to see if anything changes.

I checked today mcelog and the thermal errors stopped right after Chris applied the thermal paste!

New list of affected hosts:

elukey@neodymium:~$ sudo -i salt -t 120 analytics10* cmd.run 'grep "Hardware event" /var/log/mcelog | uniq -c' --output=raw | grep Hardware | sort
{'analytics1001.eqiad.wmnet': '    410 Hardware event. This is not a software error.'}
{'analytics1002.eqiad.wmnet': '      4 Hardware event. This is not a software error.'}
{'analytics1028.eqiad.wmnet': '   6181 Hardware event. This is not a software error.'}
{'analytics1031.eqiad.wmnet': '   1989 Hardware event. This is not a software error.'}
{'analytics1032.eqiad.wmnet': ' 104796 Hardware event. This is not a software error.'}
{'analytics1033.eqiad.wmnet': '   8821 Hardware event. This is not a software error.'}
{'analytics1038.eqiad.wmnet': '  80371 Hardware event. This is not a software error.'}
{'analytics1040.eqiad.wmnet': '   1871 Hardware event. This is not a software error.'}
{'analytics1041.eqiad.wmnet': '    115 Hardware event. This is not a software error.'}
{'analytics1052.eqiad.wmnet': '     18 Hardware event. This is not a software error.'}
{'analytics1055.eqiad.wmnet': '      2 Hardware event. This is not a software error.'}

@Cmjohnson - Would you have time to apply new thermal paste to the above hosts? This is not super pressing and can be done even in a month, but it is definitely not a healthy state for the Analytics cluster :)

Ottomata mentioned this in Unknown Object (Task).Mar 31 2017, 3:15 PM

Tried again today:

===== NODE GROUP =====
(1) analytics1060.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
      1 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) analytics1032.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
   2245 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) analytics1029.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
      8 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) analytics1033.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
   1075 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) analytics1037.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
     24 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) analytics1040.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
   3890 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) analytics1038.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
    615 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) analytics1041.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
    138 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) analytics1028.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
    196 Hardware event. This is not a software error.

@Cmjohnson would it be possible as next step to apply the thermal paste to analytics1032, analytics1033 and analytics1040 ?

This is also interesting:

===== NODE GROUP =====
(1) kafka1018.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
    191 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) kafka1014.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
 370334 Hardware event. This is not a software error.
===== NODE GROUP =====
(1) kafka1012.eqiad.wmnet
----- OUTPUT of 'grep "Hardware e...mcelog | uniq -c' -----
  65438 Hardware event. This is not a software error.

These nodes are critical to us so I'd really like to apply the new thermal paste to one of the above and see how it goes (if possible)

Mentioned in SAL (#wikimedia-operations) [2017-05-09T16:00:59Z] <elukey> stopping Hadoop daemons and shutting down analytics[1032-1033,1040].eqiad.wmnet - T132256

analytics[1032-1033,1040].eqiad.wmnet have had the thermal paste replaced. One observation on all 3 is that cpu1 paste was nearly non-existent

@Cmjohnson thanks! analytics1040 shows up memory errors on boot, I wasn't able to power it on.. Do you mind to check whenever you have time?

Checked on analytics10[32,33] and mcelog shows no events after Chris' maintenance.

Hosts remaining to do:

  • analytics1060.eqiad.wmnet
  • analytics1029.eqiad.wmnet
  • analytics1037.eqiad.wmnet
  • analytics1038.eqiad.wmnet
  • analytics1041.eqiad.wmnet
  • analytics1028.eqiad.wmnet

@Cmjohnson do you have time during the next days to do a couple of hosts?

@elukey we finished these...correct?

Nope still to do :)

Updated list: analytics1028, 1029, 1037, 1060, 1050, 1041

The row distribution is:

elukey@neodymium:~$ sudo cumin 'analytics10[28,29,37,50,60,41]*' 'lldpcli show neighbors | grep SysName'

===== NODE GROUP =====
(2) analytics[1028-1029].eqiad.wmnet
----- OUTPUT of 'lldpcli show nei...s | grep SysName' -----
    SysName:      asw-c-eqiad
===== NODE GROUP =====
(1) analytics1060.eqiad.wmnet
----- OUTPUT of 'lldpcli show nei...s | grep SysName' -----
    SysName:      asw-a-eqiad
===== NODE GROUP =====
(2) analytics[1037,1041].eqiad.wmnet
----- OUTPUT of 'lldpcli show nei...s | grep SysName' -----
    SysName:      asw2-d-eqiad
===== NODE GROUP =====
(1) analytics1050.eqiad.wmnet
----- OUTPUT of 'lldpcli show nei...s | grep SysName' -----
    SysName:      asw-b-eqiad

So if we could do maintenance by row it would be much better. Not an urgent job though :)

elukey changed the task status from Open to Stalled.Oct 6 2017, 1:22 PM

Can we go ahead and close ticket?