Reimage and clone db1072
Closed, ResolvedPublic

Description

db1072 is broken and data isn't trustable anymore.
We should reimage it and clone it from another host with the correct schema and data.

We should take the opportunity, now that it is depooled, to move it to another rack as all the API hosts for s1 are on D1, including this host.
We can move it to B2 for instance.

Marostegui moved this task from Triage to Next on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2017-01-27T15:48:28Z] <marostegui> Stop mysql and shutdown db1072 for maintenance - T156226

Change 334669 had a related patch set uploaded (by Cmjohnson):
Updating IP for db1072 to match rack change to B2. T156226

https://gerrit.wikimedia.org/r/334669

Change 334670 had a related patch set uploaded (by Marostegui):
db-codfw,db-eqiad.php: Change db1072 IP and rack

https://gerrit.wikimedia.org/r/334670

Change 334669 merged by Cmjohnson:
Updating IP for db1072 to match rack change to B2. T156226

https://gerrit.wikimedia.org/r/334669

Change 334670 merged by jenkins-bot:
db-codfw,db-eqiad.php: Change db1072 IP and rack

https://gerrit.wikimedia.org/r/334670

Mentioned in SAL (#wikimedia-operations) [2017-01-27T16:07:26Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: db1072 change IP - T156226 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-01-27T16:08:15Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: db1072 change IP - T156226 (duration: 00m 40s)

db1072 has been moved to B2
DNS updated
db-eqiad,codfw files updated
mysql and replication started finely.
tendril updated

Pending: reimage and reclone

Thanks @Cmjohnson

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db1072.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201701300722_marostegui_15517.log.

Completed auto-reimage of hosts:

['db1072.eqiad.wmnet']

Of which those FAILED:

set(['db1072.eqiad.wmnet'])

Completed auto-reimage of hosts:

['db1072.eqiad.wmnet']

Of which those FAILED:

set(['db1072.eqiad.wmnet'])

Not sure why it says it failed, as it was done correctly as per all the logs.

Notice: Revoked certificate with serial 1768
Notice: Removing file Puppet::SSL::Certificate db1072.eqiad.wmnet at '/var/lib/puppet/server/ssl/ca/signed/db1072.eqiad.wmnet.pem'
Notice: db1072.eqiad.wmnet storeconfigs removed
db1072.eqiad.wmnet
cleaning salt key cache for db1072.eqiad.wmnet
Set Boot Device to pxe
Current power status is on, power cycling
Chassis Power Control: Cycle
Seeking the Puppet certificate to sign
.........................................................................................+
Notice: Signed certificate request for db1072.eqiad.wmnet
Notice: Removing file Puppet::SSL::CertificateRequest db1072.eqiad.wmnet at '/var/lib/puppet/server/ssl/ca/requests/db1072.eqiad.wmnet.pem'
Stopping default puppet agent and enabling puppet
Spawning the first puppet run as well
The first puppet run is ongoing, you can see what the result is in the file /root/db1072.eqiad.wmnet.puppetrun.log
Seeking the SALT node key to add
This is the time to start a puppet run on the host.
............................+
Node db1072.eqiad.wmnet is now signed and both puppet and salt should work.

It got rebooted finely and:

marostegui@db1066:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 8.7 (jessie)
Release:	8.7
Codename:	jessie
marostegui@db1066:~$ uname -r
4.4.0-3-amd64

Which is exactly the same version and kernel that a host that got reimaged last week (db1066) got.
It even got the latest mariadb package that was built (wmf-mariadb10 10.0.29-2)

So I believe this reimage is actually fine.

Change 334973 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1073

https://gerrit.wikimedia.org/r/334973

And from the reimage command outout:

sudo -E wmf-auto-reimage -p T156226 db1072.eqiad.wmnet
START
To monitor the full log:
tail -F /var/log/wmf-auto-reimage/201701300722_marostegui_15517.log
IPMI Password:
Successfully set Icinga downtime for hosts: ['db1072.eqiad.wmnet']
Running wmf-reimage on hosts: ['db1072.eqiad.wmnet']
wmf-reimage log is on puppetmaster1001.eqiad.wmnet:/root/db1072.eqiad.wmnet.log
Run wmf-reimage on hosts: ['db1072.eqiad.wmnet']
Successfully completed wmf-reimage for hosts: ['db1072.eqiad.wmnet']
Successful uptime check on hosts: []
Puppet run check completed for hosts: []
Rebooted hosts: []
Successful reboot on hosts: []
Successful uptime check on hosts: []
Puppet run check completed for hosts: []
END

@Volans maybe you want to check something out?

Change 334973 merged by jenkins-bot:
db-eqiad.php: Depool db1073

https://gerrit.wikimedia.org/r/334973

Mentioned in SAL (#wikimedia-operations) [2017-01-30T08:01:06Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1073 - T156226 (duration: 02m 45s)

I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key and difficult to fix in the current design of wmf-auto-reimage.

Mentioned in SAL (#wikimedia-operations) [2017-01-30T08:04:53Z] <marostegui> Stop mysql db1073 to use it to clone db1072 - T156226

I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key and difficult to fix in the current design of wmf-auto-reimage.

Thanks for the heads up :-)
As I said, to me it looked fine, but it is good to know that this is not the first time it happens!

Thanks again

Change 334987 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1072,db1073 with less load

https://gerrit.wikimedia.org/r/334987

Change 334987 merged by jenkins-bot:
db-eqiad.php: Repool db1073 with less load

https://gerrit.wikimedia.org/r/334987

Mentioned in SAL (#wikimedia-operations) [2017-01-30T10:03:50Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1073 with less weight - T156226 (duration: 00m 49s)

Change 335023 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Restore db1073 original weight

https://gerrit.wikimedia.org/r/335023

Change 335023 merged by jenkins-bot:
db-eqiad.php: Restore db1073 original weight

https://gerrit.wikimedia.org/r/335023

Mentioned in SAL (#wikimedia-operations) [2017-01-30T15:39:55Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1073 with its original weight - T156226 (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2017-01-31T07:49:45Z] <marostegui> Reboot db1072 to force BBU recharge - T156226

Yesterday I recloned db1072 but it has not been able to catch up with the master whereas db1073 (the server it was recloned from) had no problems.
In order to test if the storage was the bottleneck I switched to 0 innodb_flush_log_at_trx_commit and the server started to slowly catch up.

All the disks looked fine and no media errors, or possible degradation, so I started to check the RAID controller itself and noticed this:

root@db1072:~# megacli -LDInfo -L0 -a0 | grep "Current Cache Policy:"
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

My initial reaction was to check if the BBU was there and was on a healthy state and doesn't look so:

root@db1072:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging'
  Charging Status              : None
Charger Status: Unknown
Remaining Capacity: 599 mAh
isSOHGood: Yes

This is the output of a server that was recently moved to another rack:

root@db1054:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging'
  Charging Status              : None
Charger Status: Complete
Remaining Capacity: 554 mAh
isSOHGood: Yes

There is also the issue with temperature on db1072.

root@db1072:~# megacli -AdpBbuCmd  -a0 | grep Tempe
Temperature: 78 C
  Temperature                             : High

db1073 (the host db1072 was cloned from):

root@db1073:~# megacli -AdpBbuCmd  -a0 | grep Tempe
Temperature: 53 C
  Temperature                             : OK

And another host:

root@db1054:~#  megacli -AdpBbuCmd  -a0 | grep Tempe
Temperature: 53 C
  Temperature                             : OK

I have tried to change the Policy status on the fly to see if the controller would complain about no BBU present (or in a bad state) but it didn't:

root@db1072:~# megacli -LDSetProp WB -LALL -aALL

Set Write Policy to WriteBack on Adapter 0, VD 0 (target id: 0) success

Exit Code: 0x00

I have forced a relearn cycle before and after a reboot, but it doesn't work:

root@db1072:~#  megacli -AdpBbuCmd -BbuLearn -aALL -NoLog

Adapter 0: BBU Learn Failed

Exit Code: 0x01

@Cmjohnson can we re-seat the battery cables? I can power down the server for you when you are around.
Thanks!

Looks like forcing it to be WriteBack works (but it is bad anyways if the BBU is really broken):

root@db1072:~# megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success

Exit Code: 0x00

I will leave it like that to help the server to catch up until @Cmjohnson is on site (if we have bad luck and the server crashes we can reclone it, it is not pooled anyways).

Change 335195 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Add comment about bad BBU

https://gerrit.wikimedia.org/r/335195

Change 335195 merged by jenkins-bot:
db-eqiad.php: Add comment about bad BBU

https://gerrit.wikimedia.org/r/335195

Mentioned in SAL (#wikimedia-operations) [2017-01-31T08:40:33Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Add a warning about a possible bad BBU on db1072 - T156226 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2017-02-01T07:31:09Z] <marostegui> Force WB policy on the raid controller db1072 - T156226

Requested a new BBU from Dell..will update task with shipping information.
Confirmed: Request 943118173 was successfully submitted.

Mentioned in SAL (#wikimedia-operations) [2017-02-06T16:51:40Z] <marostegui> Stop MySQL and shutdown db1072 for raid and BBU replacement - T156226

Marostegui closed this task as "Resolved".Feb 6 2017, 5:58 PM

The BBU looks good now and has been charging and now fully charged:

root@db1072:~# megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging'
  Charging Status              : None
Charger Status: Complete
Remaining Capacity: 539 mAh
isSOHGood: Yes

I have set the policy back to WriteThrough when the BBU fails:

root@db1072:~# megacli -LDInfo -LAll -aAll | grep "Cache Policy:"
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

Server has caught up and tomorrow I will slowly pool it in to see how it behaves.
Thanks @Cmjohnson

Change 336376 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1072

https://gerrit.wikimedia.org/r/336376

Change 336376 merged by jenkins-bot:
db-eqiad.php: Repool db1072

https://gerrit.wikimedia.org/r/336376

Mentioned in SAL (#wikimedia-operations) [2017-02-07T06:58:49Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1072 - T156226 (duration: 00m 50s)