Page MenuHomePhabricator

db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad)
Closed, ResolvedPublic

Description

After the reboot for the movement, db1063 has started to show io issues: It has lagging behind, while there is nothing exceptional going on on its current master, after rebooting for T163895, it has lagged as much as one hour behind: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&from=1493385305259&to=1493406905259

This I looked at:

  • It replicates (it is not stuck), just very slowly- lagging 1 hour behind in 4 hours
  • The BBU seems to be ok:

The BBU is too hot:

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3939 mV
Current: 0 mA
Temperature: 78 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : High
  Learn Cycle Requested                   : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x8238 
Relative State of Charge: 100 %
Charger Status: Unknown
Remaining Capacity: 529 mAh
  • The disks do not have errors:
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
Media Error Count: 0
Other Error Count: 0
Drive has flagged a S.M.A.R.T alert : No
  • I enable non-transactional persistence (disable fsync per commit and for binlogs) it recovers well
  • Other slaves do not have issues keeping up with the codfw master with durable settings

Event Timeline

SET GLOBAL innodb_flush_log_at_trx_commit=0;
SET GLOBAL sync_binlog=0;

Seems to be helping. I had tried disabling semi_sync replication, but that didn't work.

Oh, I got it:

Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

The only reason I can see is:

Temperature: 78 C
Temperature                             : High

while on db1062 I see:

Temperature: 47 C
Temperature                             : OK

I am going to do a reboot and see if something happens.

Mentioned in SAL (#wikimedia-operations) [2017-04-28T19:30:00Z] <jynus> shutting down db1063 - I see high temperatures reported, and going up T164107

On boot:

 megacli -AdpBbuCmd -GetBbuStatus -a0 | grep Temperature
Temperature: 64 C
  Temperature                             : OK
$ cat /sys/class/thermal/thermal_zone*/temp
61000
60000

This is now ok, but it is getting hotter:

$ megacli -LDInfo -L0 -a0 | grep "Cache Policy:"
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
$ megacli -AdpBbuCmd -GetBbuStatus -a0 | grep Temperature
Temperature: 68 C
  Temperature                             : OK

labsdb1011 which is in the same rack:

Controller Temperature (C): 60
jcrespo renamed this task from db1063 io (s5 master eqiad) performance is bad to db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad).Apr 28 2017, 8:04 PM
jcrespo added a project: DC-Ops.

I have forced:

megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

The server will get fried, but at least we won't have lag.

Fans and the other sensors look fine though:

12  | Fan1 RPM         | Fan                      | 3960.00    | RPM   | 'OK'
13  | Fan2 RPM         | Fan                      | 4080.00    | RPM   | 'OK'
14  | Fan3 RPM         | Fan                      | 4200.00    | RPM   | 'OK'
15  | Fan4 RPM         | Fan                      | 3000.00    | RPM   | 'OK'
16  | Fan5 RPM         | Fan                      | 3240.00    | RPM   | 'OK'
17  | Fan6 RPM         | Fan                      | 3360.00    | RPM   | 'OK'
18  | Inlet Temp       | Temperature              | 31.00      | C     | 'OK'
19  | Exhaust Temp     | Temperature              | 53.00      | C     | 'OK'

However:

root@db1063:~#  megacli -AdpBbuCmd  -a0 | grep Temper
Temperature: 77 C
  Temperature                             : High

And restbas1008 (which is a HP not a Dell), on the same rack:

11  | Fan1A            | Fan                      | 3360.00    | RPM   | 'OK'
12  | Fan1B            | Fan                      | 3000.00    | RPM   | 'OK'
13  | Fan2A            | Fan                      | 3480.00    | RPM   | 'OK'
14  | Fan2B            | Fan                      | 3000.00    | RPM   | 'OK'
15  | Fan3A            | Fan                      | 3960.00    | RPM   | 'OK'
16  | Fan3B            | Fan                      | 3480.00    | RPM   | 'OK'
17  | Fan4A            | Fan                      | 3840.00    | RPM   | 'OK'
18  | Fan4B            | Fan                      | 3600.00    | RPM   | 'OK'
19  | Fan5A            | Fan                      | 3600.00    | RPM   | 'OK'
20  | Fan5B            | Fan                      | 3360.00    | RPM   | 'OK'
21  | Fan6A            | Fan                      | 3720.00    | RPM   | 'OK'
22  | Fan6B            | Fan                      | 3360.00    | RPM   | 'OK'
23  | Inlet Temp       | Temperature              | 21.00      | C     | 'OK'

There is nothing on the controllers' log apart from the automatic switch to WriteThrough when it first detected the BBU temp was high:

seqNum: 0x00001e36
Time: Fri Apr 28 20:05:18 2017

Code: 0x00000091
Class: 1
Locale: 0x08
Event Description: Battery temperature is high
Event Data:
===========
None


seqNum: 0x00001e37
Time: Fri Apr 28 20:05:19 2017

Code: 0x000000c3
Class: 1
Locale: 0x08
Event Description: BBU disabled; changing WB virtual disks to WT, Forced WB VDs are not affected
Event Data:
===========
None
MCE 0
CPU 2 THERMAL EVENT TSC 3f67e99385dbc7 
TIME 1492766490 Fri Apr 21 09:21:30 2017
Processor 2 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
Running trigger `unknown-error-trigger'
STATUS 88000bc3 MCGSTATUS 0
MCGCAP 1000c19 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
MCE 1
CPU 2 THERMAL EVENT TSC 3f67e993860bd9 
TIME 1492766490 Fri Apr 21 09:21:30 2017
Processor 2 below trip temperature. Throttling disabled
Running trigger `unknown-error-trigger'
mcelog: Too many trigger children running already
STATUS 88030a82 MCGSTATUS 0
MCGCAP 1000c19 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
MCE 0
CPU 2 THERMAL EVENT TSC 436c547679ece9 
TIME 1493201400 Wed Apr 26 10:10:00 2017
Processor 2 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
Running trigger `unknown-error-trigger'
STATUS 88000bc3 MCGSTATUS 0
MCGCAP 1000c19 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
MCE 1
CPU 2 THERMAL EVENT TSC 436c54767a22cf 
TIME 1493201400 Wed Apr 26 10:10:00 2017
Processor 2 below trip temperature. Throttling disabled
Running trigger `unknown-error-trigger'
mcelog: Too many trigger children running already
STATUS 88020a82 MCGSTATUS 0
MCGCAP 1000c19 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 62
MCE 0
CPU 2 THERMAL EVENT TSC 3f67e99385dbc7 
TIME 1492766490 Fri Apr 21 09:21:30 2017
Processor 2 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
Running trigger `unknown-error-trigger'
STATUS 88000bc3 MCGSTATUS 0
MCGCAP 1000c19 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
MCE 1
CPU 2 THERMAL EVENT TSC 3f67e993860bd9 
TIME 1492766490 Fri Apr 21 09:21:30 2017
Processor 2 below trip temperature. Throttling disabled
Running trigger `unknown-error-trigger'
mcelog: Too many trigger children running already
STATUS 88030a82 MCGSTATUS 0
MCGCAP 1000c19 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
MCE 0
CPU 2 THERMAL EVENT TSC 436c547679ece9 
TIME 1493201400 Wed Apr 26 10:10:00 2017
Processor 2 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
Running trigger `unknown-error-trigger'
STATUS 88000bc3 MCGSTATUS 0
MCGCAP 1000c19 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
MCE 1
CPU 2 THERMAL EVENT TSC 436c54767a22cf 
TIME 1493201400 Wed Apr 26 10:10:00 2017
Processor 2 below trip temperature. Throttling disabled
Running trigger `unknown-error-trigger'
mcelog: Too many trigger children running already
STATUS 88020a82 MCGSTATUS 0
MCGCAP 1000c19 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 62

Is that error coming from the CPU or BBU?

At least it has not gone up:

Temperature: 78 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : High

I have been trying to increase the fans speed via ipmitool raw commands but apparently on the R720 it is not possible to do that. The documentation is very vague.
Don't know if @RobH or @Cmjohnson might know more details about how the fan/ipmitool work on this chassis.
However, I am not sure it would help the BBU issue anyways..

This is what I found at http://www.dell.com/support/article/us/en/19/SLN285596/drac---how-to-set-fan-speed-offset-values-in-idrac7-without-reboot?lang=EN

FAN speed Offset settings could be modified using ipmitool command. This will be supported feature in next racadm release for iDRAC7.

Install Dell BMC tools on a management station and ensure IPMI over LAN is enabled in iDRA Network Configuration. (Click here for Dell BMC Utility download.)

To set value to Low FAN speed offset run command:

ipmitool -I lanplus -H ipaddress of idrac - U username - P password raw 0x30 0xCE 0x00 0x09 0x07 0x00 0x00 0x00 0x07 0x00 0x02 0x02 0x02 0x00 0x00

To set value to high FAN speed offset run command:

ipmitool -I lanplus -H ipaddress of idrac -U username -P password raw 0x30 0xCE 0x00 0x09 0x07 0x00 0x00 0x00 0x07 0x00 0x02 0x02 0x02 0x01 0x00

FAN speed would change immediately after running the command successfully.

$ ssh db1059.eqiad.wmnet 
root@db1059:~$ cat /sys/class/thermal/thermal_zone*/temp
56000
49000
root@db1059:~$ megacli -AdpBbuCmd -GetBbuStatus -a0 | grep Temperature
Temperature: 50 C
  Temperature                             : OK

$ ssh db1060.eqiad.wmnet 
root@db1060:~$ cat /sys/class/thermal/thermal_zone*/temp
52000
49000
root@db1060:~$ megacli -AdpBbuCmd -GetBbuStatus -a0 | grep Temperature
Temperature: 44 C
  Temperature                             : OK

Mentioned in SAL (#wikimedia-operations) [2017-05-01T15:46:55Z] <jynus> shutting down db1063 for maintenance T164107

I would support doing a switchover to db1049 just to be on the safe side for next week switchover

Manuel, I am looking at some options now with Chris (air flow, PSU, ...) , we (and I mean I) will think about that on Tuesday depending on the state in which we will end up today. Don't worry about it for now.

root@db1063:~$ megacli -AdpBbuCmd -GetBbuStatus -a0 | grep Temperature
Temperature: 49 C
  Temperature                             : OK
root@db1063:~$ cat /sys/class/thermal/thermal_zone*/temp
68000
52000

Apparently, the only thing needed:

[16:24:19] <cmjohnson1> marostegui: draining the flea power is what dell recommends
[16:24:26] <cmjohnson1> for just about everything!

But @Cmjohnson feel free to add extra comments here to help yourself or anyone else solving similar issues in the future. Thank you!

I will now revert the RAID changes.

jcrespo assigned this task to Cmjohnson.

Executed:

megacli -LDSetProp -NoCachedBadBBU -Immediate -Lall -aAll
$ megacli -LDInfo -LAll -aAll | grep 'Cache Policy'
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Disk Cache Policy   : Disk's Defaul

Excellent news!Thanks guys for working this out successfully!

db1063 is definitely in good shape to keep being the master for the switchover:

root@db1063:~#  megacli -AdpBbuCmd  -a0 | grep Tem
Temperature: 51 C
  Temperature

Actually, temperatures have apparently shifted side:

$ cat /sys/class/thermal/thermal_zone*/temp
67000
53000