Page MenuHomePhabricator

Move db1067 to row C
Closed, ResolvedPublic

Description

db1067 will be the future s1 master, but it needs to be moved to row C (any rack) in order to have the enwiki master in a row that requires no more switch maintenance in the near future.
@Cmjohnson please confirm if C6 is doable from your side.

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptMay 4 2018, 5:42 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as Medium priority.May 4 2018, 5:42 AM
Marostegui moved this task from Triage to In progress on the DBA board.
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.May 4 2018, 2:57 PM
Marostegui moved this task from In progress to Next on the DBA board.May 5 2018, 2:28 PM

This has been scheduled for Wed 16th

Change 433346 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1067

https://gerrit.wikimedia.org/r/433346

Change 433346 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1067

https://gerrit.wikimedia.org/r/433346

Mentioned in SAL (#wikimedia-operations) [2018-05-16T10:16:55Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1067 as it will be moved to a different rack - T193835 (duration: 01m 21s)

Marostegui moved this task from Next to In progress on the DBA board.May 16 2018, 10:31 AM

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:01:21Z] <marostegui> Stop MySQL on db1067 - T193835

Change 433416 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] New production IP db1067

https://gerrit.wikimedia.org/r/433416

Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:18:28Z] <marostegui> Power off db1067 for rack move - T193835

Change 433416 merged by Marostegui:
[operations/dns@master] New production IP db1067

https://gerrit.wikimedia.org/r/433416

Change 433417 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db1067 IP

https://gerrit.wikimedia.org/r/433417

Change 433417 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db1067 IP

https://gerrit.wikimedia.org/r/433417

Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:29:01Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Change db1067 IP - T193835 (duration: 01m 17s)

Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:34:44Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Change db1067 IP - T193835 (duration: 01m 21s)

This has been successfully moved.
MySQL is back up, I am waiting for the DNS to totally propagate before repooling and closing this task

Thanks @Cmjohnson

I am investigating why it has the cache policy set to WriteThru

root@db1067:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None

The BBU looks good (apart from the temperature alert):

root@db1067:~#  megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3957 mV
Current: 0 mA
Temperature: 76 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : High
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0238
Relative State of Charge: 100 %
Charger Status: Complete
Remaining Capacity: 542 mAh
Full Charge Capacity: 542 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 100 %
  Absolute State of charge: 0 %
  Remaining Capacity: 542 mAh
  Full Charge Capacity: 542 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 43 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 1
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 07/18, 2011
  Design Capacity: 90 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name:
  Firmware Version   : 0148 03
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0148 03
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

The temperature of the BBU is super high compare to other hosts, so I think we should probably replace it with another one. As this is the candidate master for s1, better to be on the safe side, and better to replace the BBU now that it is not a master yet.

@Cmjohnson do you have spare BBUs?

Marostegui closed this task as Resolved.May 16 2018, 8:04 PM

As spoken with @Cmjohnson I am closing this task and create a new one for the BBU issues. As it will be easier to look for it in the future with an specific task

Mentioned in SAL (#wikimedia-operations) [2018-05-22T05:24:53Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1067 - T193835 (duration: 01m 19s)

Vvjjkkii renamed this task from Move db1067 to row C to 0mdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from 0mdaaaaaaa to Move db1067 to row C.Jul 1 2018, 6:30 PM
Marostegui closed this task as Resolved.
Marostegui assigned this task to Cmjohnson.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)