Page MenuHomePhabricator

Switch ORES to dedicated cluster
Closed, ResolvedPublic

Description

Once T165171: rack/setup/install ores1001-1009 and T165170: rack/setup/install ores2001-2009 are complete, we'll want to switch our deployment from scb100[1-4].eqiad.wmnet and scb200[1-6] to ores100[1-9].eqiad.wmnet. ores200[1-9].codfw.wmnet.

  • Confirm setup of ORES to ores100[1-9] / ores200[1-9]
  • Update scap config in deploy repository to deploy to new ores100[1-9] / ores200[1-9]
  • Re-route traffic to ores100[1-9] / ores200[1-9]
  • Disable and uninstall ORES from scb100[1-4] / scb200[1-6]

Related Objects

StatusSubtypeAssignedTask
Resolvedakosiaris
Resolvedakosiaris
ResolvedRobH
ResolvedRobH
Resolvedakosiaris
ResolvedNone
DeclinedNone
Resolvedakosiaris
Resolvedakosiaris
Resolved mmodell
Resolvedawight
Resolvedakosiaris
Resolvedakosiaris
InvalidNone
ResolvedHalfak
Resolvedakosiaris
Resolvedawight
ResolvedHalfak
Resolvedawight
ResolvedNone
Resolvedawight
Resolvedawight
ResolvedHalfak
ResolvedHalfak
ResolvedHalfak
Resolvedakosiaris
ResolvedHalfak
ResolvedHalfak
ResolvedHalfak
ResolvedSumit

Event Timeline

Halfak renamed this task from Switch ORES to dedicated cluster (CODFW) to Switch ORES to dedicated cluster (EQIAD).Jun 16 2017, 3:14 PM
Halfak updated the task description. (Show Details)
Halfak updated the task description. (Show Details)
Halfak renamed this task from Switch ORES to dedicated cluster (EQIAD) to Switch ORES to dedicated cluster.Jun 16 2017, 7:03 PM
Halfak updated the task description. (Show Details)
Ladsgroup moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

@Halfak @Ladsgroup
I'm curious about how we plan to use the codfw boxes--will they simply be warm standby, or will they be included in the worker pool?

I think the goal here is to totally switch from the SCB* nodes to the new ORES* nodes.

@akosiaris Hi! It looks like we're ready to rock... I'd be happy to patch or deploy anything here, when the time is right. Let me know!

Yeah I am still finishing some other tasks and will circle back to this next week. There's a good question I 've been asking myself. Should we fully switch from scb* nodes to ores* nodes ? Or could we finally split the functionality of web services and scoring ? Namely have celery workers exclusively on ores* nodes and scb* nodes handle the web requests ?

The former is definitely faster. We will probably be up and running in a day or so from the new nodes.

The latter is cleaner, decouples the web request serving infrastructure from the scoring infrastructure allowing us to handle future expansions in a more fine grained way as it allows easier guesstimations and making operations slightly easier (rebooting an ores* node in this scenario requires 0 preparation in this setup). It will take a bit more time (1-2 days ?) to get the new cluster up and running as some refactoring of puppet code will have to take place. But it will also help clean up puppet code removing some tech debt. It will also resemble a bit more the labs environment were celery and uwsgi reside in different VMs, since ORES was designed right for the start with that architecture in mind.

Just to answer to the above question, we definitely don't want the new boxes to be warm standby. We want the to assume at least the scoring role, if not both the scoring role and the web request serving role.

+1 to @akosiaris' notes. I'm not sure how services will feel about us keeping the web workers on scb* nodes, but personally, I don't see an issue. They require a fraction of the memory and mostly block on HTTP-based IO to APIs.

I'd like to run a stress test on the new hardware as we prepare to bring it online. I want to get a clear measure of the type of throughput we can expect and where we start to overload. I feel like we've been relying on assumptions for a bit too long and this is a good opportunity to re-calibrate. If we move ORES fully to the ores* nodes, we can safely do such a test before moving over fully. If we want to do the hybrid setup, I'd like to switch from active-active to one datacenter to run the test against the other datacenter.

+1 to @akosiaris' notes. I'm not sure how services will feel about us keeping the web workers on scb* nodes, but personally, I don't see an issue. They require a fraction of the memory and mostly block on HTTP-based IO to APIs.

I don't see an issue either. Given that indeed all complaints for ORES was practically for celery workers consuming memory, it should be fine.

I'd like to run a stress test on the new hardware as we prepare to bring it online. I want to get a clear measure of the type of throughput we can expect and where we start to overload. I feel like we've been relying on assumptions for a bit too long and this is a good opportunity to re-calibrate. If we move ORES fully to the ores* nodes, we can safely do such a test before moving over fully. If we want to do the hybrid setup, I'd like to switch from active-active to one datacenter to run the test against the other datacenter.

OK. We can arrange for that. How do you plan to do that stress test ? Would you require the hosts to be fully installed with the ORES software ? Both roles (web service and celery workers) ? Only one ? None of the above ?

I'd like to stress the production-like setup so I'd want to send requests directly to load balancer that would send it to the web nodes. That might be crazy in which case, I'd be OK with simulating the load balancer. Let's move discussion to the stress test details to T169246: Stress/capacity test new ores* cluster. :)

akosiaris changed the task status from Open to Stalled.Jul 7 2017, 9:13 AM

Stalling this while T169246 takes place

awight changed the task status from Stalled to Open.Dec 14 2017, 9:22 PM

Unstalling, now that the stress testing is complete.

And https://gerrit.wikimedia.org/r/#/c/410398/ finishes the migration. All that is left is to disable ORES on scb* boxes, which I find prudent to stall for a bit (couple of days at most)

I've been investigating some other SCB issue related to memory usage and noticed that for some reason ORES on scb1002 is still receiving precache requests on scb1002 from ChangeProp ( See grafana )

I have no clue where are these requests coming from as ChangeProp uses LVS to access ORES, so after @akosiaris change there should be nothing coming from CP to SCB. I will try to locate where is this coming from.

Also, interesting that this graph shows non-zero rates of requests to SCB

Ok, the LVS interface on scb is still registered, so on scb ores.svc.eqiad.wmnet still resolves locally: inet 10.2.2.10/32 scope global lo:LVS valid_lft forever preferred_lft forever

Ok, the LVS interface on scb is still registered, so on scb ores.svc.eqiad.wmnet still resolves locally: inet 10.2.2.10/32 scope global lo:LVS valid_lft forever preferred_lft forever

Yup, that's exactly it. Will be fixed by https://gerrit.wikimedia.org/r/#/c/408560/9/hieradata/role/common/scb.yaml,unified. With the way things are going (really good that is, /me knocks on wood) ORES-wise, to be merged on Tuesday EU morning.

akosiaris claimed this task.

And ORES is finally no longer residing on scb boxes :-). I 've cleaned up code, logs, systemd units, stopped the software instances for both uwsgi, celery in all nodes. I think we can close this