Page MenuHomePhabricator

some elasticsearch servers in eqiad have CPU overheating
Closed, ResolvedPublic

Description

elastic1017.eqiad.wmnet shutdown on its own Saturday June 24 08:21. Looking at kern.log I see a number of "Package temperature above threshold, cpu clock throttled" which indicate overheating. A quick look with cumin (see log below) indicates that a number of the older elasticsearch servers have similar behaviour, even if not as bad as elastic1017. The following servers have a non zero count of "Package temperature above threshold" in kern.log:

  • elastic1017
  • elastic[1019-1020]
  • elastic[1023-1026]

It might make sense to reapply thermal paste on elastic1017-elastic1031 (the oldest batch of elasticsearch servers).

@Cmjohnson what do you think? Should we do it? Should we do the whole batch or just the ones which are exposing the issue? Ping me to arrange a schedule to take those servers down if needed. We can easily take 4 of them down at the same time, but it takes some time for the cluster to recover between 2 groups.

gehel@neodymium:~$ sudo cumin 'elastic*.wmnet' 'grep "Package temperature above threshold" /var/log/kern.log | wc -l'
72 hosts will be targeted:
elastic[2001-2036].codfw.wmnet,elastic[1017-1052].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                          
(1) elastic1025.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
4223                                                                            
===== NODE GROUP =====                                                          
(1) elastic1020.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
173                                                                             
===== NODE GROUP =====                                                          
(1) elastic1019.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
4286                                                                            
===== NODE GROUP =====                                                          
(65) elastic[2001-2036].codfw.wmnet,elastic[1018,1021-1022,1027-1052].eqiad.wmnet                                                                               
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
0                                                                               
===== NODE GROUP =====                                                          
(1) elastic1017.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
19611                                                                           
===== NODE GROUP =====                                                          
(1) elastic1023.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
30                                                                              
===== NODE GROUP =====                                                          
(1) elastic1024.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
155                                                                             
===== NODE GROUP =====                                                          
(1) elastic1026.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
4189                                                                            
================                                                                
PASS |███████████████████████| 100% (72/72) [00:00<00:00, 83.64hosts/s]         
FAIL |                                |   0% (0/72) [00:00<?, ?hosts/s]         
100.0% (72/72) success ratio (>= 100.0% threshold) for command: 'grep "Package te...kern.log | wc -l'.
100.0% (72/72) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Event Timeline

I have some suspicion this is also related to our latency warning in icinga triggering for cirrussearch results. Not entirely sure, but the timing seems related.

@Gehel, I have seen the task and will be on vacation July 3-10. Let's plan
on some time after the 10th to do this.

Thanks

@Cmjohnson - @Gehel is on vacation until July 21st, so hopefully you two can re-connect at that time. :)

I think we should be able to take care of this before @Gehel comes back, the main sticking point will be having someone from ops able to depool elasticsearch servers while they are being worked on. @dcausse or I can handle removing nodes from the cluster, but depooling from LVS requires someone in ops. I suppose taking the nodes down should lead to LVS noticing and not sending requests there until it comes back online, but proper depooling is probably preferred.

We think that temp issues may exacerbate the load issues we see on the elasticsearch cluster in eqiad.
Looking at this graph:

Sélection_015.png (725×2 px, 266 KB)

We can see that CPU0 is overheating even when the system load is not under high load.

Mentioned in SAL (#wikimedia-operations) [2017-07-31T12:39:01Z] <gehel> banning elastic10(17|18|19|20) to prepare for thermal paste - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T12:56:08Z] <gehel> un-banning elastic1020 since it seems to have impact on cluster performances - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T14:08:04Z] <gehel> shutting down elastic10(17|18|19) for thermal paste - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T14:36:19Z] <gehel> un-banning and repooling elastic10(17|18|19) - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T14:36:57Z] <gehel> banning and repooling elastic10(20|21) - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T16:15:37Z] <gehel> depooling and shutting down elastic102[0123] for thermal paste - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T16:30:51Z] <gehel> mistaken restart of elastic1030 as part of T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T17:25:28Z] <gehel> un-banning and repooling elastic102[012] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T17:52:40Z] <gehel> un-banning and repooling elastic1023 - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-02T16:43:31Z] <gehel> depooling and shutting down elastic102[4567] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-02T18:18:05Z] <gehel> un-banning and repooling elastic102[4567] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T14:01:10Z] <gehel> depooling and shutting down elastic10(28|29|30|31|32) - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T14:52:18Z] <gehel> unbanning and repooling elastic10(29|30|31|32) - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T15:02:39Z] <gehel> unbanning and repooling elastic1028 - T168816

elastic1017-1031 have had thermal paste applied. Looking at grafana, this seems to have the expected effect of lowering CPU temperature.

Checking a few servers on grafana, it looks like temperature is still down. This can be closed.