After the reboot for the movement, db1063 has started to show io issues: It has lagging behind, while there is nothing exceptional going on on its current master, after rebooting for T163895, it has lagged as much as one hour behind: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&from=1493385305259&to=1493406905259
This I looked at:
- It replicates (it is not stuck), just very slowly- lagging 1 hour behind in 4 hours
- The BBU seems to be ok:
The BBU is too hot:
BBU status for Adapter: 0 BatteryType: BBU Voltage: 3939 mV Current: 0 mA Temperature: 78 C Battery State: Optimal BBU Firmware Status: Charging Status : None Voltage : OK Temperature : High Learn Cycle Requested : No Learn Cycle Active : No Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : No Remaining Capacity Low : No Periodic Learn Required : No Transparent Learn : No No space to cache offload : No Pack is about to fail & should be replaced : No Cache Offload premium feature required : No Module microcode update required : No BBU GasGauge Status: 0x8238 Relative State of Charge: 100 % Charger Status: Unknown Remaining Capacity: 529 mAh
- The disks do not have errors:
Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No Media Error Count: 0 Other Error Count: 0 Drive has flagged a S.M.A.R.T alert : No
- I enable non-transactional persistence (disable fsync per commit and for binlogs) it recovers well
- Other slaves do not have issues keeping up with the codfw master with durable settings