db1072 is broken and data isn't trustable anymore.
We should reimage it and clone it from another host with the correct schema and data.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | faidon | T155875 asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues | |||
Resolved | None | T155999 DBA plan to mitigate asw-c2-eqiad reboots | |||
Resolved | • Marostegui | T156166 db1072 change_tag schema and dataset is not consistent | |||
Resolved | • Marostegui | T156226 Reimage and clone db1072 |
Event Timeline
We should take the opportunity, now that it is depooled, to move it to another rack as all the API hosts for s1 are on D1, including this host.
We can move it to B2 for instance.
Mentioned in SAL (#wikimedia-operations) [2017-01-27T15:48:28Z] <marostegui> Stop mysql and shutdown db1072 for maintenance - T156226
Change 334669 had a related patch set uploaded (by Cmjohnson):
Updating IP for db1072 to match rack change to B2. T156226
Change 334670 had a related patch set uploaded (by Marostegui):
db-codfw,db-eqiad.php: Change db1072 IP and rack
Change 334669 merged by Cmjohnson:
Updating IP for db1072 to match rack change to B2. T156226
Change 334670 merged by jenkins-bot:
db-codfw,db-eqiad.php: Change db1072 IP and rack
Mentioned in SAL (#wikimedia-operations) [2017-01-27T16:07:26Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: db1072 change IP - T156226 (duration: 00m 40s)
Mentioned in SAL (#wikimedia-operations) [2017-01-27T16:08:15Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: db1072 change IP - T156226 (duration: 00m 40s)
db1072 has been moved to B2
DNS updated
db-eqiad,codfw files updated
mysql and replication started finely.
tendril updated
Pending: reimage and reclone
Thanks @Cmjohnson
Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:
['db1072.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201701300722_marostegui_15517.log.
Completed auto-reimage of hosts:
['db1072.eqiad.wmnet']
Of which those FAILED:
set(['db1072.eqiad.wmnet'])
Not sure why it says it failed, as it was done correctly as per all the logs.
Notice: Revoked certificate with serial 1768 Notice: Removing file Puppet::SSL::Certificate db1072.eqiad.wmnet at '/var/lib/puppet/server/ssl/ca/signed/db1072.eqiad.wmnet.pem' Notice: db1072.eqiad.wmnet storeconfigs removed db1072.eqiad.wmnet cleaning salt key cache for db1072.eqiad.wmnet Set Boot Device to pxe Current power status is on, power cycling Chassis Power Control: Cycle Seeking the Puppet certificate to sign .........................................................................................+ Notice: Signed certificate request for db1072.eqiad.wmnet Notice: Removing file Puppet::SSL::CertificateRequest db1072.eqiad.wmnet at '/var/lib/puppet/server/ssl/ca/requests/db1072.eqiad.wmnet.pem' Stopping default puppet agent and enabling puppet Spawning the first puppet run as well The first puppet run is ongoing, you can see what the result is in the file /root/db1072.eqiad.wmnet.puppetrun.log Seeking the SALT node key to add This is the time to start a puppet run on the host. ............................+ Node db1072.eqiad.wmnet is now signed and both puppet and salt should work.
It got rebooted finely and:
marostegui@db1066:~$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.7 (jessie) Release: 8.7 Codename: jessie marostegui@db1066:~$ uname -r 4.4.0-3-amd64
Which is exactly the same version and kernel that a host that got reimaged last week (db1066) got.
It even got the latest mariadb package that was built (wmf-mariadb10 10.0.29-2)
So I believe this reimage is actually fine.
Change 334973 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1073
And from the reimage command outout:
sudo -E wmf-auto-reimage -p T156226 db1072.eqiad.wmnet START To monitor the full log: tail -F /var/log/wmf-auto-reimage/201701300722_marostegui_15517.log IPMI Password: Successfully set Icinga downtime for hosts: ['db1072.eqiad.wmnet'] Running wmf-reimage on hosts: ['db1072.eqiad.wmnet'] wmf-reimage log is on puppetmaster1001.eqiad.wmnet:/root/db1072.eqiad.wmnet.log Run wmf-reimage on hosts: ['db1072.eqiad.wmnet'] Successfully completed wmf-reimage for hosts: ['db1072.eqiad.wmnet'] Successful uptime check on hosts: [] Puppet run check completed for hosts: [] Rebooted hosts: [] Successful reboot on hosts: [] Successful uptime check on hosts: [] Puppet run check completed for hosts: [] END
@Volans maybe you want to check something out?
Mentioned in SAL (#wikimedia-operations) [2017-01-30T08:01:06Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1073 - T156226 (duration: 02m 45s)
I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key and difficult to fix in the current design of wmf-auto-reimage.
Mentioned in SAL (#wikimedia-operations) [2017-01-30T08:04:53Z] <marostegui> Stop mysql db1073 to use it to clone db1072 - T156226
Thanks for the heads up :-)
As I said, to me it looked fine, but it is good to know that this is not the first time it happens!
Thanks again
Change 334987 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1072,db1073 with less load
Mentioned in SAL (#wikimedia-operations) [2017-01-30T10:03:50Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1073 with less weight - T156226 (duration: 00m 49s)
Change 335023 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Restore db1073 original weight
Mentioned in SAL (#wikimedia-operations) [2017-01-30T15:39:55Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1073 with its original weight - T156226 (duration: 00m 52s)
Mentioned in SAL (#wikimedia-operations) [2017-01-31T07:49:45Z] <marostegui> Reboot db1072 to force BBU recharge - T156226
Yesterday I recloned db1072 but it has not been able to catch up with the master whereas db1073 (the server it was recloned from) had no problems.
In order to test if the storage was the bottleneck I switched to 0 innodb_flush_log_at_trx_commit and the server started to slowly catch up.
All the disks looked fine and no media errors, or possible degradation, so I started to check the RAID controller itself and noticed this:
root@db1072:~# megacli -LDInfo -L0 -a0 | grep "Current Cache Policy:" Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
My initial reaction was to check if the BBU was there and was on a healthy state and doesn't look so:
root@db1072:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging' Charging Status : None Charger Status: Unknown Remaining Capacity: 599 mAh isSOHGood: Yes
This is the output of a server that was recently moved to another rack:
root@db1054:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging' Charging Status : None Charger Status: Complete Remaining Capacity: 554 mAh isSOHGood: Yes
There is also the issue with temperature on db1072.
root@db1072:~# megacli -AdpBbuCmd -a0 | grep Tempe Temperature: 78 C Temperature : High
db1073 (the host db1072 was cloned from):
root@db1073:~# megacli -AdpBbuCmd -a0 | grep Tempe Temperature: 53 C Temperature : OK
And another host:
root@db1054:~# megacli -AdpBbuCmd -a0 | grep Tempe Temperature: 53 C Temperature : OK
I have tried to change the Policy status on the fly to see if the controller would complain about no BBU present (or in a bad state) but it didn't:
root@db1072:~# megacli -LDSetProp WB -LALL -aALL Set Write Policy to WriteBack on Adapter 0, VD 0 (target id: 0) success Exit Code: 0x00
I have forced a relearn cycle before and after a reboot, but it doesn't work:
root@db1072:~# megacli -AdpBbuCmd -BbuLearn -aALL -NoLog Adapter 0: BBU Learn Failed Exit Code: 0x01
@Cmjohnson can we re-seat the battery cables? I can power down the server for you when you are around.
Thanks!
Looks like forcing it to be WriteBack works (but it is bad anyways if the BBU is really broken):
root@db1072:~# megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success Exit Code: 0x00
I will leave it like that to help the server to catch up until @Cmjohnson is on site (if we have bad luck and the server crashes we can reclone it, it is not pooled anyways).
Change 335195 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Add comment about bad BBU
Mentioned in SAL (#wikimedia-operations) [2017-01-31T08:40:33Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Add a warning about a possible bad BBU on db1072 - T156226 (duration: 00m 46s)
Mentioned in SAL (#wikimedia-operations) [2017-02-01T07:31:09Z] <marostegui> Force WB policy on the raid controller db1072 - T156226
Requested a new BBU from Dell..will update task with shipping information.
Confirmed: Request 943118173 was successfully submitted.
Mentioned in SAL (#wikimedia-operations) [2017-02-06T16:51:40Z] <marostegui> Stop MySQL and shutdown db1072 for raid and BBU replacement - T156226
The BBU looks good now and has been charging and now fully charged:
root@db1072:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging' Charging Status : None Charger Status: Complete Remaining Capacity: 539 mAh isSOHGood: Yes
I have set the policy back to WriteThrough when the BBU fails:
root@db1072:~# megacli -LDInfo -LAll -aAll | grep "Cache Policy:" Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Server has caught up and tomorrow I will slowly pool it in to see how it behaves.
Thanks @Cmjohnson
Change 336376 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1072
Mentioned in SAL (#wikimedia-operations) [2017-02-07T06:58:49Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1072 - T156226 (duration: 00m 50s)