Page MenuHomePhabricator

db1078 s3 primary DB master BBU pre-failure
Closed, ResolvedPublic

Description

I noticed that db1078 (s3 primary master) has had its battery being re-charged for the last 3 days. Which is strange, taking a look at the HW logs I found this:

/system1/log1/record7
  Targets
  Properties
    number=7
    severity=Caution
    date=02/14/2019
    time=19:10
    description=Smart Storage Battery pre-failure (Battery 1). Action: 1. Consult server troubleshooting guide. 2 Gather AHS log and contact Support

@Cmjohnson or @RobH please advise.

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptMar 25 2019, 6:42 AM
Marostegui triaged this task as High priority.Mar 25 2019, 6:42 AM

Setting this to high priority as this is s3 primary database master.

Marostegui moved this task from Triage to In progress on the DBA board.Mar 25 2019, 6:43 AM

Your case was successfully submitted. Please note your Case ID: 5337355107 for future reference.

Any update from HP?

Thanks for the update @Cmjohnson!
Are the HP hosts that can have the BBU changed with no disruption or should we plan a failover for this host?
Thanks!

@Cmjohnson let us know that the BBU arrived and he'll need to put the server down to be able to replace it.
So we need to do a failover and failback to db1075 (the previous master until we had to emergency do a failover due to a PDU maintenance T213858: s3 master emergency failover (db1075).
We need to create a task to request for read only once we agreed on possible dates. Considering the fact that we should normally give around 2 weeks notice (unless we are in an emergency, which I think we fully aren't) and given that we have Easter coming soon.
I propose Wednesday 24th at 5:00AM UTC
@jcrespo would that be ok with you? if so, I will create a task for the read-only.

Marostegui added a comment.EditedApr 4 2019, 8:08 AM

We have agreed that we want to aim for 11th April to avoid the risk of the master going down unexpectedly during the upcoming Easter holidays where there will be less coverage

Change 501479 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1075 to master

https://gerrit.wikimedia.org/r/501479

Change 501480 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set s3 to read only

https://gerrit.wikimedia.org/r/501480

Change 501481 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1075 to master

https://gerrit.wikimedia.org/r/501481

Change 501483 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s3-master alias

https://gerrit.wikimedia.org/r/501483

@jcrespo would you mind taking a look at the above patches ^
I have also updated our etherpad with the plan

Thanks!

Mentioned in SAL (#wikimedia-operations) [2019-04-11T04:18:49Z] <marostegui> Start topology changes to move s3 slaves under db1075 T219115

Mentioned in SAL (#wikimedia-operations) [2019-04-11T04:32:22Z] <marostegui> Disable puppet on db1078 and db1075 T219115

Change 501479 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1075 to master

https://gerrit.wikimedia.org/r/501479

Change 501480 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set s3 to read only

https://gerrit.wikimedia.org/r/501480

Change 501481 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1075 to master

https://gerrit.wikimedia.org/r/501481

Mentioned in SAL (#wikimedia-operations) [2019-04-11T05:00:18Z] <marostegui> Starting s3 failover from db1078 to db1075 - T219115

Mentioned in SAL (#wikimedia-operations) [2019-04-11T05:01:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s3 on read-only T219115 (duration: 00m 37s)

Mentioned in SAL (#wikimedia-operations) [2019-04-11T05:02:48Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Switchover s3 master eqiad from db1078 to db1075 T219115 (duration: 00m 36s)

Mentioned in SAL (#wikimedia-operations) [2019-04-11T05:03:55Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove s3 ready only T219115 (duration: 00m 36s)

Change 501483 merged by Marostegui:
[operations/dns@master] wmnet: Update s3-master alias

https://gerrit.wikimedia.org/r/501483

@Cmjohnson can we schedule the BBU replacement for Monday 15th? db1078 is no longer a master.

The failover was performed successfully:
Times in UTC:

Read only starts: 05:01:05
Read only stops: 05:03:55

Total read only time: 02:50 minutes

Change 504019 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Reduce db1078 load in preparation for depool

https://gerrit.wikimedia.org/r/504019

Change 504022 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1078 for hardware maintenance

https://gerrit.wikimedia.org/r/504022

Change 504019 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Reduce db1078 load in preparation for depool

https://gerrit.wikimedia.org/r/504019

Change 504022 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1078 for hardware maintenance

https://gerrit.wikimedia.org/r/504022

Mentioned in SAL (#wikimedia-operations) [2019-04-16T16:41:25Z] <jynus> disabling notifications on db1078 T219115

Mentioned in SAL (#wikimedia-operations) [2019-04-16T16:43:55Z] <jynus> upgrading and shutting down db1078 T219115

jcrespo claimed this task.Apr 16 2019, 6:16 PM

This is fixed, but not closing because I cannot repool the server yet (Deployment schedule conflic). I will repool it tomorrow and resolve it.

Change 504513 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1078 with full weight

https://gerrit.wikimedia.org/r/504513

Change 504513 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db1078 with full weight

https://gerrit.wikimedia.org/r/504513

jcrespo closed this task as Resolved.Apr 17 2019, 10:01 AM
jcrespo reassigned this task from jcrespo to Cmjohnson.

Repooled and fixed.