Page MenuHomePhabricator

codfw: (1) spare pool system for temp allocation as database failover
Closed, ResolvedPublic

Description

After T125827 we have run out of failovers for x1. A permanent solution will arrive at T158669, but it will be delayed a few weeks (codfw failover will happen earlier). We could pool some of the existing servers (the main ones for enwiki, commons, etc.), but for separation of resources I would prefer not to- a separate server, even if less powerful (not much resources are required), would be preferred for reliability reasons.

We need a small slave just in case of the worst case scenario happens: we only need 200G, do not need RAID or SSDs and 8 or 16 GB of RAM. And probably will never get used- except if a problem appears on the production machine.

We are bargaining for a spare for just 1 month's time, until the failover is done and the new boxes arrive.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-codfw,db-eqiad.php: Remove tempdb2001
operations/puppet : productionmariadb: Get ready to decomission tempdb2001
operations/mediawiki-config : masterdb-codfw.php: Depool tempdb2001

Event Timeline

jcrespo created this task.Mar 29 2017, 3:14 PM
Restricted Application added a project: Operations. · View Herald TranscriptMar 29 2017, 3:14 PM
jcrespo renamed this task from Adquire temporary box for x1 failover (spare available?) to Adquire temporary box for x1 failover on codfw (spare available?).Mar 29 2017, 3:14 PM
Reedy renamed this task from Adquire temporary box for x1 failover on codfw (spare available?) to Aquire temporary box for x1 failover on codfw (spare available?).Mar 29 2017, 3:19 PM

I think we could use temporarily es2002, but asking first in case there is a more suitable machine available.

jcrespo renamed this task from Aquire temporary box for x1 failover on codfw (spare available?) to Acquire temporary box for x1 failover on codfw (spare available?).Mar 29 2017, 3:21 PM

I think that is the right term, sorry, I had to look it up in the dictionary. English is not my strong point as a non-native-speaker. Sorry again.

RobH renamed this task from Acquire temporary box for x1 failover on codfw (spare available?) to codfw: (1) spare pool system for temp allocation as database failover.Mar 29 2017, 3:42 PM
jcrespo updated the task description. (Show Details)Mar 29 2017, 3:48 PM
RobH assigned this task to faidon.Mar 29 2017, 3:52 PM
RobH added a subscriber: faidon.

So, for this I have 3 spare machines in codfw. One of them is being used to restore graphite data (so short term one month or less use) and this use will use 1 of the other 3.

WMF6407 - Dual Intel® Xeon® Processor E5-2640 (2.6/8c), 64GB RAM, Dual 1TB SATA.

Discussion with @jcrespo in IRC resulted in his approval of this specification for the temporary allocation.

I've escalated to @faidon to approve the spare system allocation.

RobH edited projects, added hardware-requests; removed procurement.Mar 29 2017, 3:53 PM
RobH moved this task from Backlog to Pending Approval on the hardware-requests board.

That's totally fine, approved.

RobH claimed this task.Mar 29 2017, 4:12 PM
jcrespo moved this task from Triage to In progress on the DBA board.Apr 3 2017, 3:59 PM
RobH changed the task status from Open to Stalled.Apr 5 2017, 5:21 PM
RobH triaged this task as Medium priority.

I'm setting this to stalled and normal priority, as this task will also serve to reclaim the system in a month's time. I've created T162290 to track the setup of the tempdb2001.

Marostegui moved this task from In progress to Done on the DBA board.Apr 17 2017, 7:17 AM

I am moving this to "Done" on the DBA workboard as it is done from our side along with: T162290 so it is easier for us (DBAs) to see what's going on at the moment on our land :-)
Once we have set up the definitive server for this, I will update this task so we can decommission it.

tempdb2001 is not going to be used anymore, but before returning to the pool of spares, we need to retire if from puppet and mediawiki-config.

tempdb2001 is not going to be used anymore, but before returning to the pool of spares, we need to retire if from puppet and mediawiki-config.

I will take care of that tomorrow, as I need to also remove it from the .my.cnf.erb file as we added an exception for it there.

Mentioned in SAL (#wikimedia-operations) [2017-05-04T06:17:29Z] <marostegui> Stop MySQL on tempdb2001 to take a backup and prepare to decomission - T161712

Change 351769 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool tempdb2001

https://gerrit.wikimedia.org/r/351769

Change 351769 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool tempdb2001

https://gerrit.wikimedia.org/r/351769

Mentioned in SAL (#wikimedia-operations) [2017-05-04T06:26:25Z] <marostegui@naos> Synchronized wmf-config/db-codfw.php: Depool tempdb2001, no longer needed - T161712 (duration: 01m 08s)

Change 351772 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Get ready to decomission tempdb2001

https://gerrit.wikimedia.org/r/351772

Change 351772 merged by Marostegui:
[operations/puppet@production] mariadb: Get ready to decomission tempdb2001

https://gerrit.wikimedia.org/r/351772

Change 351777 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Remove tempdb2001

https://gerrit.wikimedia.org/r/351777

Change 351777 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw,db-eqiad.php: Remove tempdb2001

https://gerrit.wikimedia.org/r/351777

Mentioned in SAL (#wikimedia-operations) [2017-05-04T07:58:42Z] <marostegui@naos> Synchronized wmf-config/db-codfw.php: Remove tempdb2001 from config files as it will be decommissioned - T161712 (duration: 01m 25s)

Mentioned in SAL (#wikimedia-operations) [2017-05-04T07:59:57Z] <marostegui@naos> Synchronized wmf-config/db-eqiad.php: Remove tempdb2001 from config files as it will be decommissioned - T161712 (duration: 01m 07s)

Marostegui added a comment.EditedMay 4 2017, 8:00 AM

Position where tempdb2001 slave was stopped at: https://phabricator.wikimedia.org/P5374
Backup at: dbstore2002:/srv/tmp/tempdb2001.tar.gz
MySQL remains stopped.

@RobH I have merged: https://gerrit.wikimedia.org/r/#/c/351772/, https://gerrit.wikimedia.org/r/#/c/351769/ and https://gerrit.wikimedia.org/r/#/c/351777/ so the host is all yours to be decommissioned
If I have missed something, please let me know

"Decommissioned" in the software sense, I think they will want it to return to spares. Just clarifying in the very very unlikely event that robh doesn't remember this server was "his" and we start to put it off the racks. :-)

Thanks for the clarification :-)

RobH closed this task as Resolved.May 4 2017, 4:59 PM

Yes, this is a new system, so it's reclaimed to spares, not decommissioned.

I've created T164513 to track that, so this is being resolved.