Page MenuHomePhabricator

Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002
Closed, ResolvedPublic

Description

We are adding a new service, "mobileapps", and we want to sca to jessie too. So we create a new service cluster, based on jessie, where we can progressively migrate services when they are ready to work on jessie. At the end, our plan is to reimage the sca100x servers once all services work on jessie as well, and to re-distribute the services across the two clusters.

Event Timeline

Joe created this task.Jul 29 2015, 2:39 PM
Joe updated the task description. (Show Details)
Joe raised the priority of this task from to Needs Triage.
Joe added a subscriber: Joe.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 29 2015, 2:39 PM
RobH assigned this task to mark.Jul 29 2015, 3:42 PM
RobH added a subscriber: RobH.

This is being discussed in IRC, as it needs Mark to also sign off on it.

I'll note that this will use up two really high performance misc systems we have in eqiad: Dell PowerEdge R420, dual Intel Xeon E5-2450 v2 2.50GHz, 64GB Memory, (4) 3TB Disks. (We have two more similar with SSDs and H310 controllers, otherwise everything else is lesser.)

I think this was mentioned during the ops meeting (Marko confirms in irc), so it isn't an unexpected request.

Assigning to Mark for his review.

Joe added a comment.Jul 29 2015, 3:46 PM

I can add I tried to avoid spares with SSDs, and the ones out of warranty. I expect 48 GB of RAM to be a minimum requirements for a service cluster onestly; 64 GB to be needed. Also, while most node services are pretty light on the cpu, having a decent number of cores is pretty important.

@RobH if you have any better suggestion, please do so :)

mark added a comment.Jul 30 2015, 9:26 AM

We can also buy (or rent) servers, if that would be better. :)

We can also buy (or rent) servers, if that would be better. :)

For this temporary transition, these are fine.

A couple of notes:

  • The SCA cluster at this point has really minimal usage in pretty much all aspects (CPU, Memory, Disk Space/IOPS, Network). The boxes that power it are already overprovisioned (more on that below). http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Service%2520Cluster%2520A%2520eqiad&tab=m&vn=&hide-hf=false
  • SCA cluster machines are: Poweredge R420, 16 HT-enabled core machines, 64GB RAM, 2x 7200RPM SATA disks. Already overpowered for what they do. I would be fine with getting lowered spec'ed boxes for this cluster and returning them to the spare pool. SSDs obviously not needed, RAM usage is relatively low (zotero aside which being xulrunner powered tends to leak memory), CPU usage is minimal.
  • Coupling mobileapps with the migration to Jessie for SCA is not something I am fond of. I see no reason for unnecessary coupling as it will only cause bottlenecks and further delays. I would rather we decoupled it, moved with deploying mobileapps independently of the upgrade.
  • I honestly don't see SCB as a temporary measure/transition. Since temporary is the new permanent and given at least zotero and apertium which are going to take a long time to become jessie compliant (zotero might not ever happen btw), SCB should be under warranty for sure and should be comparable to SCA. Should ever all the services be migrated out of SCA, renaming SCB to SCA (assuming we are pedantic enough about it) and returning the original SCA machines to the spare pool sounds a better plan to me

With the above in mind I propose:

  • WMF5842
  • WMF5843

Different rack rows, poweredge R420 with 32 GB Memory, (2) 500GB Disks and well in warranty

MobileApps might be a disruptor of that. In the first phase, the service won't have RESTBase storage/caching, which means that all requests from all of the BetaApp users (currently ~150K of them, I believe) will hit the service directly. Also, in September VE becomes the new editing default on enwiki, which means Citoid is likely to see a major increase in traffic. Zotero likewise, ofc.

  • Coupling mobileapps with the migration to Jessie for SCA is not something I am fond of. I see no reason for unnecessary coupling as it will only cause bottlenecks and further delays. I would rather we decoupled it, moved with deploying mobileapps independently of the upgrade.

I think it provides just the right opportunity to do so. Putting a new service on a system we know we are going to have to re-image seems less than ideal. Besides, it puts just the right amount of incentive for other services to be moved. Let's decrease out technical debt!

  • I honestly don't see SCB as a temporary measure/transition. Since temporary is the new permanent and given at least zotero and apertium which are going to take a long time to become jessie compliant (zotero might not ever happen btw), SCB should be under warranty for sure and should be comparable to SCA. Should ever all the services be migrated out of SCA, renaming SCB to SCA (assuming we are pedantic enough about it) and returning the original SCA machines to the spare pool sounds a better plan to me

In my view, there is no such thing as SCB as a temporary solution. We can either (a) take the spares, put the services there and move them back; or (b) have a permanent SCB cluster LB-ed with SCA (with the same services). I'd like to go in the direction of (b), frankly.

I am not an expert in apertium, but is seems there are Jessie packages for it. Zotero is, indeed, a separate issue completely. So maybe we should start investigating that one before moving forward? If not moveable, could an option be to put it in a VM (say, in Ganeti) and have Citoid call it there?

With the above in mind I propose:

  • WMF5842
  • WMF5843

    Different rack rows, poweredge R420 with 32 GB Memory, (2) 500GB Disks and well in warranty

Sounds good to me.

MobileApps might be a disruptor of that. In the first phase, the service won't have RESTBase storage/caching, which means that all requests from all of the BetaApp users (currently ~150K of them, I believe) will hit the service directly. Also, in September VE becomes the new editing default on enwiki, which means Citoid is likely to see a major increase in traffic. Zotero likewise, ofc.

Indeed. But we have no estimates though about either in terms of req/s, do we ? Until we do, and given that currently we got a lot of room to spare according to those graphs I think we should assume the current cluster is enough. If not, we can always scale horizontally.

  • Coupling mobileapps with the migration to Jessie for SCA is not something I am fond of. I see no reason for unnecessary coupling as it will only cause bottlenecks and further delays. I would rather we decoupled it, moved with deploying mobileapps independently of the upgrade.

I think it provides just the right opportunity to do so. Putting a new service on a system we know we are going to have to re-image seems less than ideal. Besides, it puts just the right amount of incentive for other services to be moved. Let's decrease out technical debt!

I 've never said anything about reimaging. In fact, the way this ticket goes, I am reading it more as a gradually replacement of SCA with SCB. Which is a better way to move forward anyway. Slower and with a lot more room for maneuvers. And hence no coupling of SCB to mobileapps is needed.

  • I honestly don't see SCB as a temporary measure/transition. Since temporary is the new permanent and given at least zotero and apertium which are going to take a long time to become jessie compliant (zotero might not ever happen btw), SCB should be under warranty for sure and should be comparable to SCA. Should ever all the services be migrated out of SCA, renaming SCB to SCA (assuming we are pedantic enough about it) and returning the original SCA machines to the spare pool sounds a better plan to me

In my view, there is no such thing as SCB as a temporary solution. We can either (a) take the spares, put the services there and move them back; or (b) have a permanent SCB cluster LB-ed with SCA (with the same services). I'd like to go in the direction of (b), frankly.

(b) but with the services not being permanently LB-ed with SCA but rather only during the migration. let's call it (c) for now.

I am not an expert in apertium, but is seems there are Jessie packages for it.

Those are maintained by @KartikMistry and I am sincerely hoping they can be used. I would love to move apertium to jessie and so does Kartik as it seems. There still is work to be done though according to T106385

Zotero is, indeed, a separate issue completely. So maybe we should start investigating that one before moving forward? If not moveable, could an option be to put it in a VM (say, in Ganeti) and have Citoid call it there?

if we go with (c) we won't have to. It will give us the time we need to do the last bullet in T92468 (is there a ticket for that btw ? There should be one if not) which would be great. Moving zotero to jessie provides us pretty much with 0 benefits as far as the service itself goes. So let's leave it for last, hoping we can ditch it in the meantime.

With the above in mind I propose:

  • WMF5842
  • WMF5843

    Different rack rows, poweredge R420 with 32 GB Memory, (2) 500GB Disks and well in warranty

Sounds good to me.

OK, I 'll amend the subject and create the necessary paperwork (tasks, DNS changes etc etc etc)

akosiaris renamed this task from Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 to Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002.Aug 5 2015, 11:32 AM
akosiaris set Security to None.

Change 229700 had a related patch set uploaded (by Alexandros Kosiaris):
Introduce scb100{1,2}.eqiad.wmnet

https://gerrit.wikimedia.org/r/229700

Change 229710 had a related patch set uploaded (by Alexandros Kosiaris):
Introduce scb100{1,2}.eqiad.wmnet

https://gerrit.wikimedia.org/r/229710

Indeed. But we have no estimates though about either in terms of req/s, do we ? Until we do, and given that currently we got a lot of room to spare according to those graphs I think we should assume the current cluster is enough. If not, we can always scale horizontally.

No, we don't, unfortunately. But I agree that we can react later if need be.

I 've never said anything about reimaging. In fact, the way this ticket goes, I am reading it more as a gradually replacement of SCA with SCB. Which is a better way to move forward anyway. Slower and with a lot more room for maneuvers. And hence no coupling of SCB to mobileapps is needed.

Hm, this sheds a different light to the issue. I was under the impression that we need to discuss whether or not to have SCB alongside SCA and thought SCA is here to stay, we'd just switch the servers to Jessie.

Keeping them both would be a good option when new services come along (and/or increased traffic on current ones). This is, of course, highly dependant on the actual number of services, but IMHO keeping this trend of putting more and more services on the same machine(s) cannot scale.

In the midst of this, I still do not understand your point about MobileApps. Are you saying that, effectively, we could put it on SCA now and then migrate it later to SCB? Since we're getting a move on setting up SCB, I think it's worth the (reasonable) wait and go directly on SCB.

(b) but with the services not being permanently LB-ed with SCA but rather only during the migration. let's call it (c) for now.

Let's :)

Those are maintained by @KartikMistry and I am sincerely hoping they can be used. I would love to move apertium to jessie and so does Kartik as it seems. There still is work to be done though according to T106385

Yup, we had a chat with @KartikMistry and @santhosh the other day and they explained they plan to complete the packing work soon(TM).

if we go with (c) we won't have to. It will give us the time we need to do the last bullet in T92468 (is there a ticket for that btw ? There should be one if not) which would be great. Moving zotero to jessie provides us pretty much with 0 benefits as far as the service itself goes. So let's leave it for last, hoping we can ditch it in the meantime.

If SCA is to live for a little while longer, yes, that makes sense to do.

OK, I 'll amend the subject and create the necessary paperwork (tasks, DNS changes etc etc etc)

Great, thnx! Could you give a guesstimate for when we could put the first service there?

In the midst of this, I still do not understand your point about MobileApps. Are you saying that, effectively, we could put it on SCA now and then migrate it later to SCB?

I am saying exactly that

Since we're getting a move on setting up SCB, I think it's worth the (reasonable) wait and go directly on SCB.

T108184 is already a first blocker. Maybe it will be solved soon, maybe not. If not, the wait might not be reasonable and hence the coupling just caused unnecessary delay. Not that this service has not seen it's ton of delays already.

T108184 is already a first blocker. Maybe it will be solved soon, maybe not. If not, the wait might not be reasonable and hence the coupling just caused unnecessary delay.

*sigh*

Not that this service has not seen it's ton of delays already.

Yes, literally. That's why I am a bit eager to get it out.

Let's give T108184 a bit of time (couple of days) and decide then.

scb1001 => wmf5843 (row A)
scb1002 => wmf5842 (row B)

Change 229700 merged by Alexandros Kosiaris:
Introduce scb100{1,2}.eqiad.wmnet

https://gerrit.wikimedia.org/r/229700

Change 229710 merged by Alexandros Kosiaris:
Introduce scb100{1,2}.eqiad.wmnet

https://gerrit.wikimedia.org/r/229710

Cluster has been installed and is ready to start accepting services

akosiaris closed this task as Resolved.Aug 10 2015, 6:08 PM

Resolving this

For posterity and clarity's sake, I just briefed ops in meeting about the hardware, got a implicit OK we are good to go