Page MenuHomePhabricator

normalize eqiad restbase cluster - replace restbase1001-1006
Closed, ResolvedPublic

Description

This task will outline the proposed resolution to the issue of restbase1001-1006 being off-spec/4 LFF disk bays, when all other restbase systems (both HP and Dell) are SFF disk bays and can hold twice the disks as the original restbase1001-1006.

The majority of this proposal is an update from IRC conversation between @mark & @RobH on 2016-02-04. This task will be the master tracking task for the ordering of 6 new restbase systems for eqiad, and the reclaiming of the existing restbase1001-1006 into operations spares.

After a discussion with the services team, its been decided that operations will absorb the hp restbase1001-1006 into the spare misc servers pool (as we own them, they are not leased), and we'll generate new quotes/orders for 6 new restbase systems in eqiad that better match the specification of restbase1007-1009 & restbase2001-2006.

We'll be removing the samgsung 1TB SSDs from restbase1001-1006, plus the newly ordered additional SSDs, for installation in the newly ordered restbase systems. The current systems have LFF disk bays, but the sleds are for SFF disks going into LFF bays, so we'll want to swap the NEW orders non samgsung 2.5" SFF disks into the existing old systems and put the existing samgsung SSDs into the new systems. Each new system will need to ship with 5 SFF disks in each system.

As this is the master tracking task for the system swaps, there will be multiple procurement and hardware-requests sub-tasks/blockers linked from this.

Event Timeline

RobH claimed this task.
RobH raised the priority of this task from to High.
RobH updated the task description. (Show Details)
RobH added projects: hardware-requests, SRE, RESTBase.
RobH added subscribers: RobH, mark, GWicke.
RobH mentioned this in Unknown Object (Task).Feb 4 2016, 5:49 PM
RobH mentioned this in Unknown Object (Task).Feb 4 2016, 5:51 PM
RobH mentioned this in Unknown Object (Task).Feb 5 2016, 7:17 PM
RobH added a subtask: Unknown Object (Task).Feb 5 2016, 7:20 PM
RobH closed subtask Unknown Object (Task) as Declined.Feb 9 2016, 5:44 PM

for the sake of normalization, at the end of the current expansion we'll have:

eqiad: 9x machines / 128GB ram / 2x processors / 5x 1TB SSD = 45TB
codfw: 6x machines / 128GB ram / 2x processors / 7x 1TB SSD = 42TB

IOW in terms of storage we'll have 3TB more in eqiad than codfw, ram-wise they are different too but easier to deal with for e.g. capacity planning

I was reviewing the size or codfw vs eqiad and 7x 1TB for each machine in codfw appears to be too big of a 'blast radius'. In the interest of normalization codfw vs eqiad I think we should normalize also on the number of machines and ssd per machine. IOW get 3x machines in codfw and have all machines in codfw with 5x 1TB ssd like eqiad is.

I've put together the following for a proposed sequence of tasks, and an estimation of the time required for each. Hopefully this will be helpful in composing an overall timeline with expected completion date.

Completing the expansion of restbase100[7-9]

taskest. durationcomments
bootstrap 1008-b1.3don-going
decomm 1008-a (128 tokens)0.7d
bootstrap 1008-a (256 tokens)1.0d
bootstrap 1009-b1.6d
decomm 1009-a (256 tokens)0.9d
bootstrap 1009-a (256 tokens)1.2d

Note: These times take into account the quantity of data to be moved, at a concurrency of 3 streams of 4.5MB/s.

Note: The timing of 1009 depends on the completion of the currently on-going RAID expansion.

Replacing restbase100[1-6]

For each in rack A, B, and D:

seq.taskest. duration
1bootstrap 10xx-a.9d
2bootstrap 10xx-b.7d
3bootstrap 10xx-a.6d
4bootstrap 10xx-b.5d
5decomm.6d
6decomm.7d

The idea here is to work rack-by-rack, adding two new hardware nodes, and bootstrapping two instances each. Finally, the two existing nodes can be decommisioned and the hardware repurposed. The process then moves to the next rack.

Note: The timing here depends on the arrival, and racking of new hardware.

Note: These times take into account the quantity of data to be moved, at a concurrency of 3 streams of 4.5MB/s. However, as more nodes are added, higher stream concurrencies are possible; Impact to production nodes providing, we might be able to achieve even higher rates. The potential for this higher throughput is even greater for steps 1, 3, and to a lesser degree 5 and 6, as contention becomes less of a factor.

restbase200[1-6] / codfw datacenter

This is still something of a question mark, as it's not clear (to me at least) whether the plan is to add disks to the existing nodes, or to add 3 additional ones (though either way it should look like some combination of the above).

The new restbase servers are on-site. Let's coordinate which 2 servers to start with.

1001/1002 are in row A
1003/1004 are in row C
1005/1006 are in row D

The new restbase servers are on-site. Let's coordinate which 2 servers to start with.

That's great news; Thanks!

1001/1002 are in row A
1003/1004 are in row C
1005/1006 are in row D

It probably doesn't matter a whole lot; I'd vote for starting at the beginning, row 'A' (which also happens to be a bit heavier on utilization than the others).

@Cmjohnson thanks! seems fine to go with row A to me too, let me know how I can help

mark closed subtask Unknown Object (Task) as Resolved.Mar 2 2016, 11:51 AM

Mentioned in SAL [2016-03-09T20:19:45Z] <urandom> decommissioning restbase1001.eqiad.wmnet : T125842

Mentioned in SAL [2016-03-10T10:09:16Z] <godog> decommissioning restbase1002.eqiad.wmnet : T125842

Mentioned in SAL [2016-03-10T14:08:00Z] <urandom> increasing outbound stream throughput on restbase1002.eqiad.wmnet to 200mbps : T125842

Mentioned in SAL [2016-03-10T23:53:47Z] <urandom> Starting Cassandra cleanup op on restbase10{07,10,11}-{a,b}.eqiad.wmnet : T125842

Mentioned in SAL [2016-03-16T12:36:55Z] <godog> bootstrapping restbase1012-a T125842

Change 277843 had a related patch set uploaded (by Eevans):
restbase1012.eqiad.wmnet: enable instance 'b'

https://gerrit.wikimedia.org/r/277843

Change 277843 merged by Gehel:
restbase1012.eqiad.wmnet: enable instance 'b'

https://gerrit.wikimedia.org/r/277843

Mentioned in SAL [2016-03-17T16:18:52Z] <urandom> bootstrapping restbase1012-b.eqiad.wmnet : T125842

Change 278285 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: bootstrap restbase1013-a

https://gerrit.wikimedia.org/r/278285

Change 278285 merged by Filippo Giunchedi:
cassandra: bootstrap restbase1013-a

https://gerrit.wikimedia.org/r/278285

@fgiunchedi Regarding the remaining 2 restbases...I will not have enough ssds to add the last 2 (can only do 1). Any chance you could offline 1003-1006?

If you can go with only the 4 new restbase servers than I can get restbase1014/15 updated all at once.

@Cmjohnson yup, we'll be decomissioning restbase1003 and restbase1004 early next week once restbase1013 is fully in service

Change 278402 had a related patch set uploaded (by Eevans):
enable instance 'b'; restbase1013-b

https://gerrit.wikimedia.org/r/278402

Change 278402 merged by Ori.livneh:
enable instance 'b'; restbase1013-b

https://gerrit.wikimedia.org/r/278402

Mentioned in SAL [2016-03-19T01:54:05Z] <urandom> bootstrapping restbase1013-b.eqiad.wmnet : T125842

Mentioned in SAL [2016-03-22T12:14:15Z] <godog> nodetool decommission restbase1003 T125842

Mentioned in SAL [2016-03-22T20:46:11Z] <urandom> decommissioning restbase1004-a.eqiad.wmnet : T125842

Mentioned in SAL [2016-03-23T15:17:34Z] <urandom> Starting cleanups on restbase10{08,12,13}-{a,b}.eqiad.wmnet : T125842

Mentioned in SAL [2016-03-23T15:18:35Z] <urandom> CORRECTION: Starting cleanups on restbase10{08,10,11}-{a,b}.eqiad.wmnet : T125842

Change 284145 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: remove restbase100[56]

https://gerrit.wikimedia.org/r/284145

Change 284146 had a related patch set uploaded (by Filippo Giunchedi):
remove restbase100[56]

https://gerrit.wikimedia.org/r/284146

Change 284145 merged by Filippo Giunchedi:
cassandra: remove restbase100[56]

https://gerrit.wikimedia.org/r/284145

Change 284146 merged by Filippo Giunchedi:
remove restbase100[56]

https://gerrit.wikimedia.org/r/284146

These systems are being replaced via the sub-tasks. Since the hardware request is granted, I'm resolving this task to clear up the requests board.