db1072 is broken and data isn't trustable anymore.
We should reimage it and clone it from another host with the correct schema and data.
db1072 is broken and data isn't trustable anymore.
|operations/mediawiki-config : master||db-eqiad.php: Repool db1072|
|operations/mediawiki-config : master||db-eqiad.php: Add comment about bad BBU|
|operations/mediawiki-config : master||db-eqiad.php: Restore db1073 original weight|
|operations/mediawiki-config : master||db-eqiad.php: Repool db1073 with less load|
|operations/mediawiki-config : master||db-eqiad.php: Depool db1073|
|operations/mediawiki-config : master||db-codfw,db-eqiad.php: Change db1072 IP and rack|
|Resolved||faidon||T155875 asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues|
|Resolved||None||T155999 DBA plan to mitigate asw-c2-eqiad reboots|
|Resolved||Marostegui||T156166 db1072 change_tag schema and dataset is not consistent|
|Resolved||Marostegui||T156226 Reimage and clone db1072|
Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:
The log can be found in /var/log/wmf-auto-reimage/201701300722_marostegui_15517.log.
Not sure why it says it failed, as it was done correctly as per all the logs.
Notice: Revoked certificate with serial 1768 Notice: Removing file Puppet::SSL::Certificate db1072.eqiad.wmnet at '/var/lib/puppet/server/ssl/ca/signed/db1072.eqiad.wmnet.pem' Notice: db1072.eqiad.wmnet storeconfigs removed db1072.eqiad.wmnet cleaning salt key cache for db1072.eqiad.wmnet Set Boot Device to pxe Current power status is on, power cycling Chassis Power Control: Cycle Seeking the Puppet certificate to sign .........................................................................................+ Notice: Signed certificate request for db1072.eqiad.wmnet Notice: Removing file Puppet::SSL::CertificateRequest db1072.eqiad.wmnet at '/var/lib/puppet/server/ssl/ca/requests/db1072.eqiad.wmnet.pem' Stopping default puppet agent and enabling puppet Spawning the first puppet run as well The first puppet run is ongoing, you can see what the result is in the file /root/db1072.eqiad.wmnet.puppetrun.log Seeking the SALT node key to add This is the time to start a puppet run on the host. ............................+ Node db1072.eqiad.wmnet is now signed and both puppet and salt should work.
It got rebooted finely and:
marostegui@db1066:~$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.7 (jessie) Release: 8.7 Codename: jessie marostegui@db1066:~$ uname -r 4.4.0-3-amd64
Which is exactly the same version and kernel that a host that got reimaged last week (db1066) got.
It even got the latest mariadb package that was built (wmf-mariadb10 10.0.29-2)
So I believe this reimage is actually fine.
And from the reimage command outout:
sudo -E wmf-auto-reimage -p T156226 db1072.eqiad.wmnet START To monitor the full log: tail -F /var/log/wmf-auto-reimage/201701300722_marostegui_15517.log IPMI Password: Successfully set Icinga downtime for hosts: ['db1072.eqiad.wmnet'] Running wmf-reimage on hosts: ['db1072.eqiad.wmnet'] wmf-reimage log is on puppetmaster1001.eqiad.wmnet:/root/db1072.eqiad.wmnet.log Run wmf-reimage on hosts: ['db1072.eqiad.wmnet'] Successfully completed wmf-reimage for hosts: ['db1072.eqiad.wmnet'] Successful uptime check on hosts:  Puppet run check completed for hosts:  Rebooted hosts:  Successful reboot on hosts:  Successful uptime check on hosts:  Puppet run check completed for hosts:  END
@Volans maybe you want to check something out?
I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key and difficult to fix in the current design of wmf-auto-reimage.
Yesterday I recloned db1072 but it has not been able to catch up with the master whereas db1073 (the server it was recloned from) had no problems.
In order to test if the storage was the bottleneck I switched to 0 innodb_flush_log_at_trx_commit and the server started to slowly catch up.
All the disks looked fine and no media errors, or possible degradation, so I started to check the RAID controller itself and noticed this:
root@db1072:~# megacli -LDInfo -L0 -a0 | grep "Current Cache Policy:" Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
My initial reaction was to check if the BBU was there and was on a healthy state and doesn't look so:
root@db1072:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging' Charging Status : None Charger Status: Unknown Remaining Capacity: 599 mAh isSOHGood: Yes
This is the output of a server that was recently moved to another rack:
root@db1054:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging' Charging Status : None Charger Status: Complete Remaining Capacity: 554 mAh isSOHGood: Yes
There is also the issue with temperature on db1072.
root@db1072:~# megacli -AdpBbuCmd -a0 | grep Tempe Temperature: 78 C Temperature : High
db1073 (the host db1072 was cloned from):
root@db1073:~# megacli -AdpBbuCmd -a0 | grep Tempe Temperature: 53 C Temperature : OK
And another host:
root@db1054:~# megacli -AdpBbuCmd -a0 | grep Tempe Temperature: 53 C Temperature : OK
I have tried to change the Policy status on the fly to see if the controller would complain about no BBU present (or in a bad state) but it didn't:
root@db1072:~# megacli -LDSetProp WB -LALL -aALL Set Write Policy to WriteBack on Adapter 0, VD 0 (target id: 0) success Exit Code: 0x00
I have forced a relearn cycle before and after a reboot, but it doesn't work:
root@db1072:~# megacli -AdpBbuCmd -BbuLearn -aALL -NoLog Adapter 0: BBU Learn Failed Exit Code: 0x01
@Cmjohnson can we re-seat the battery cables? I can power down the server for you when you are around.
Looks like forcing it to be WriteBack works (but it is bad anyways if the BBU is really broken):
root@db1072:~# megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success Exit Code: 0x00
I will leave it like that to help the server to catch up until @Cmjohnson is on site (if we have bad luck and the server crashes we can reclone it, it is not pooled anyways).
The BBU looks good now and has been charging and now fully charged:
root@db1072:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging' Charging Status : None Charger Status: Complete Remaining Capacity: 539 mAh isSOHGood: Yes
I have set the policy back to WriteThrough when the BBU fails:
root@db1072:~# megacli -LDInfo -LAll -aAll | grep "Cache Policy:" Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Server has caught up and tomorrow I will slowly pool it in to see how it behaves.