some elasticsearch servers in eqiad have CPU overheating
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Jun 26 2017, 7:44 AM

Description

elastic1017.eqiad.wmnet shutdown on its own Saturday June 24 08:21. Looking at kern.log I see a number of "Package temperature above threshold, cpu clock throttled" which indicate overheating. A quick look with cumin (see log below) indicates that a number of the older elasticsearch servers have similar behaviour, even if not as bad as elastic1017. The following servers have a non zero count of "Package temperature above threshold" in kern.log:

elastic1017
elastic[1019-1020]
elastic[1023-1026]

It might make sense to reapply thermal paste on elastic1017-elastic1031 (the oldest batch of elasticsearch servers).

@Cmjohnson what do you think? Should we do it? Should we do the whole batch or just the ones which are exposing the issue? Ping me to arrange a schedule to take those servers down if needed. We can easily take 4 of them down at the same time, but it takes some time for the cluster to recover between 2 groups.

gehel@neodymium:~$ sudo cumin 'elastic*.wmnet' 'grep "Package temperature above threshold" /var/log/kern.log | wc -l'
72 hosts will be targeted:
elastic[2001-2036].codfw.wmnet,elastic[1017-1052].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                          
(1) elastic1025.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
4223                                                                            
===== NODE GROUP =====                                                          
(1) elastic1020.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
173                                                                             
===== NODE GROUP =====                                                          
(1) elastic1019.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
4286                                                                            
===== NODE GROUP =====                                                          
(65) elastic[2001-2036].codfw.wmnet,elastic[1018,1021-1022,1027-1052].eqiad.wmnet                                                                               
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
0                                                                               
===== NODE GROUP =====                                                          
(1) elastic1017.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
19611                                                                           
===== NODE GROUP =====                                                          
(1) elastic1023.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
30                                                                              
===== NODE GROUP =====                                                          
(1) elastic1024.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
155                                                                             
===== NODE GROUP =====                                                          
(1) elastic1026.eqiad.wmnet                                                     
----- OUTPUT of 'grep "Package te...kern.log | wc -l' -----                     
4189                                                                            
================                                                                
PASS |███████████████████████| 100% (72/72) [00:00<00:00, 83.64hosts/s]         
FAIL |                                |   0% (0/72) [00:00<?, ?hosts/s]         
100.0% (72/72) success ratio (>= 100.0% threshold) for command: 'grep "Package te...kern.log | wc -l'.
100.0% (72/72) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Related Objects

Mentioned In: T172447: Investigate 2017-08-02 Save Timing regression (+40-60%)
T170648: CirrusSearch/includes/Elastica/PooledHttp.php:67 Timeout reached waiting for an available pooled curl connection!

Event Timeline

Gehel created this task.Jun 26 2017, 7:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2017, 7:44 AM

Gehel added a project: Discovery-Search (Current work).Jun 26 2017, 4:27 PM

I have some suspicion this is also related to our latency warning in icinga triggering for cirrussearch results. Not entirely sure, but the timing seems related.

@Gehel, I have seen the task and will be on vacation July 3-10. Let's plan
on some time after the 10th to do this.

Thanks

@Cmjohnson - @Gehel is on vacation until July 21st, so hopefully you two can re-connect at that time. :)

I think we should be able to take care of this before @Gehel comes back, the main sticking point will be having someone from ops able to depool elasticsearch servers while they are being worked on. @dcausse or I can handle removing nodes from the cluster, but depooling from LVS requires someone in ops. I suppose taking the nodes down should lead to LVS noticing and not sending requests there until it comes back online, but proper depooling is probably preferred.

EBernhardson mentioned this in T170648: CirrusSearch/includes/Elastica/PooledHttp.php:67 Timeout reached waiting for an available pooled curl connection!.Jul 14 2017, 1:45 AM

• Cmjohnson moved this task from Backlog to Lower Priority Items on the ops-eqiad board.Jul 20 2017, 3:25 PM

We think that temp issues may exacerbate the load issues we see on the elasticsearch cluster in eqiad.
Looking at this graph:

We can see that CPU0 is overheating even when the system load is not under high load.

Mentioned in SAL (#wikimedia-operations) [2017-07-31T12:39:01Z] <gehel> banning elastic10(17|18|19|20) to prepare for thermal paste - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T12:56:08Z] <gehel> un-banning elastic1020 since it seems to have impact on cluster performances - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T14:08:04Z] <gehel> shutting down elastic10(17|18|19) for thermal paste - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T14:36:19Z] <gehel> un-banning and repooling elastic10(17|18|19) - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T14:36:57Z] <gehel> banning and repooling elastic10(20|21) - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T14:59:22Z] <gehel> banning elastic10(22|23) - T168816

Gehel moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jul 31 2017, 4:05 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-31T16:15:37Z] <gehel> depooling and shutting down elastic102[0123] for thermal paste - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T16:30:51Z] <gehel> mistaken restart of elastic1030 as part of T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T17:25:28Z] <gehel> un-banning and repooling elastic102[012] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-07-31T17:52:40Z] <gehel> un-banning and repooling elastic1023 - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-02T12:41:16Z] <gehel> banning elastic102[45] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-02T12:59:38Z] <gehel> banning elastic102[67] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-02T16:43:31Z] <gehel> depooling and shutting down elastic102[4567] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-02T18:18:05Z] <gehel> un-banning and repooling elastic102[4567] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T08:36:58Z] <gehel> banning elastic102[89] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T09:32:56Z] <gehel> banning elastic103[01] - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T13:22:12Z] <gehel> banning elastic1032 - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T14:01:10Z] <gehel> depooling and shutting down elastic10(28|29|30|31|32) - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T14:52:18Z] <gehel> unbanning and repooling elastic10(29|30|31|32) - T168816

Mentioned in SAL (#wikimedia-operations) [2017-08-03T15:02:39Z] <gehel> unbanning and repooling elastic1028 - T168816

Gehel moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Aug 3 2017, 4:27 PM

Krinkle mentioned this in T172447: Investigate 2017-08-02 Save Timing regression (+40-60%).Aug 3 2017, 8:22 PM

elastic1017-1031 have had thermal paste applied. Looking at grafana, this seems to have the expected effect of lowering CPU temperature.

Checking a few servers on grafana, it looks like temperature is still down. This can be closed.

Woohoo!

	F8837096: Sélection_015.png
	Jul 24 2017, 8:49 PM

some elasticsearch servers in eqiad have CPU overheatingClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

some elasticsearch servers in eqiad have CPU overheating
Closed, ResolvedPublic
Actions