db1067 will be the future s1 master, but it needs to be moved to row C (any rack) in order to have the enwiki master in a row that requires no more switch maintenance in the near future.
@Cmjohnson please confirm if C6 is doable from your side.
Description
Details
Event Timeline
Change 433346 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1067
Change 433346 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1067
Mentioned in SAL (#wikimedia-operations) [2018-05-16T10:16:55Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1067 as it will be moved to a different rack - T193835 (duration: 01m 21s)
Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:01:21Z] <marostegui> Stop MySQL on db1067 - T193835
Change 433416 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] New production IP db1067
Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:18:28Z] <marostegui> Power off db1067 for rack move - T193835
Change 433417 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db1067 IP
Change 433417 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db1067 IP
Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:29:01Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Change db1067 IP - T193835 (duration: 01m 17s)
Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:34:44Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Change db1067 IP - T193835 (duration: 01m 21s)
This has been successfully moved.
MySQL is back up, I am waiting for the DNS to totally propagate before repooling and closing this task
Thanks @Cmjohnson
I am investigating why it has the cache policy set to WriteThru
root@db1067:~# megacli -ldinfo -l0 -a0 | grep Policy Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Default Access Policy: Read/Write Current Access Policy: Read/Write Disk Cache Policy : Disk's Default Default Power Savings Policy: Controller Defined Current Power Savings Policy: None
The BBU looks good (apart from the temperature alert):
root@db1067:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3957 mV Current: 0 mA Temperature: 76 C Battery State: Optimal BBU Firmware Status: Charging Status : None Voltage : OK Temperature : High Learn Cycle Requested : No Learn Cycle Active : No Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : No Remaining Capacity Low : No Periodic Learn Required : No Transparent Learn : No No space to cache offload : No Pack is about to fail & should be replaced : No Cache Offload premium feature required : No Module microcode update required : No BBU GasGauge Status: 0x0238 Relative State of Charge: 100 % Charger Status: Complete Remaining Capacity: 542 mAh Full Charge Capacity: 542 mAh isSOHGood: Yes Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 0 Relative State of Charge: 100 % Absolute State of charge: 0 % Remaining Capacity: 542 mAh Full Charge Capacity: 542 mAh Run time to empty: Battery is not being charged. Average time to empty: 43 Min. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 1 Max Error = 0 % Remaining Capacity Alarm = 0 mAh Remining Time Alarm = 0 Min BBU Design Info for Adapter: 0 Date of Manufacture: 07/18, 2011 Design Capacity: 90 mAh Design Voltage: 0 mV Specification Info: 0 Serial Number: 0 Pack Stat Configuration: 0x0000 Manufacture Name: Firmware Version : 0148 03 Device Name: Device Chemistry: Battery FRU: N/A Module Version = 0148 03 Transparent Learn = 1 App Data = 0 BBU Properties for Adapter: 0 Auto Learn Period: 90 Days Next Learn time: None Learn Delay Interval:0 Hours Auto-Learn Mode: Disabled Exit Code: 0x00
The temperature of the BBU is super high compare to other hosts, so I think we should probably replace it with another one. As this is the candidate master for s1, better to be on the safe side, and better to replace the BBU now that it is not a master yet.
@Cmjohnson do you have spare BBUs?
As spoken with @Cmjohnson I am closing this task and create a new one for the BBU issues. As it will be easier to look for it in the future with an specific task
Mentioned in SAL (#wikimedia-operations) [2018-05-22T05:24:53Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1067 - T193835 (duration: 01m 19s)