Page MenuHomePhabricator

New Service Request: ORES
Closed, ResolvedPublic

Description

Description: https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service
Timeline: We're hoping to deploy by the end of Nov.
Diagram:

ORES.png (302×664 px, 21 KB)

Technologies: python(flask, sklearn, celery), redis
Point person: Aaron Halfaker (@Halfak)

The diagram above depicts a common request flow that comes from the user's browser.

  1. nginx load balancer
  2. two web app servers
  3. if the score is cached in redis, return that
  4. if the score is not cached in redis, use celery cluster to generate it and store the score in redis.

The only modification that a user can perform is to request that a score be generated and cached. Otherwise the service is read-only.

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis too!) on the same hosts - on the SCB cluster (they've 64 cores and a lot of RAM each...). So it'll be two different celery setups rather than one (I presume?). We'd also need to find a common redis host somewhere for the actual caching.

chasemp triaged this task as Medium priority.Nov 3 2015, 7:38 PM

From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis too!) on the same hosts - on the SCB cluster (they've 64 cores and a lot of RAM each...). So it'll be two different celery setups rather than one (I presume?). We'd also need to find a common redis host somewhere for the actual caching.

I guess that two celery instances can be set up on scb100[12] and use the same Redis instance/cluster.

Please provide some estimates related to:

  • the (projected) request rate
  • memory consumption

I also think that we should be careful about the interaction with other production services running on this cluster. There is not much isolation between services currently (only processes and firejail), so one mis-behaving service can wreak a lot of havoc.

@GWicke, what interactions are you worried about? Is this specific to ORES or maybe a more general concern?

@Halfak, it's a general concern, but something computationally intense and research-driven like ORES is especially difficult to gauge in that regard.

What other services are currently running in scb? Is there any isolation between the multiple services running in sca?

From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis too!) on the same hosts - on the SCB cluster (they've 64 cores and a lot of RAM each...). So it'll be two different celery setups rather than one (I presume?). We'd also need to find a common redis host somewhere for the actual caching.

Yes, two different celery workers. One on each SCB node. For now we will have the workers shared with the app servers. Resource wise at least, although it is not optimal architecturally. The nginx in-front is obviously not gonna be part of the software deployed on scb. It's user-facing and we got a whole stack for that in production. We do need to figure out though what we are going to do with that redis. The way I read this, we could host it on the same nodes again but it is not really good architecturally and I rather we did not. Since it stores the results of the revscoring process, I assume that's data we do not want to lose so we want the datastore to exhibit reliability, right ?

In fact, ORES is the first service (aside from RESTBase, which is a special case on it's own) that uses a datastore and a worker model. We should start architecting service clusters on the appserver node/worker node/datastore node pattern I guess.

What other services are currently running in scb? Is there any isolation between the multiple services running in sca?

a) mobileapps. minimal CPU/Memory requirements.
https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_report&s=by+name&c=Service%2520Cluster%2520B%2520eqiad&tab=m&vn=&hide-hf=false
a2) There T96017 which basically means "consolidate SCA to SCB" (that's why SCB was created for anyway)

Given b) reasoning below, that Task's completion will not change much as far as a) goes

b) there is some isolation, namely firejail. It is not however CPU/Memory isolation. We had no real reason up to now do so as all of the services on sca require minimal CPU/Memory as well.

https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=Service+Cluster+A+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name

@Halfak, it's a general concern, but something computationally intense and research-driven like ORES is especially difficult to gauge in that regard.

+1

Yes, two different celery workers. One on each SCB node. For now we will have the workers shared with the app servers. Resource wise at least, although it is not optimal architecturally. The nginx in-front is obviously not gonna be part of the software deployed on scb. It's user-facing and we got a whole stack for that in production. We do need to figure out though what we are going to do with that redis. The way I read this, we could host it on the same nodes again but it is not really good architecturally and I rather we did not. Since it stores the results of the revscoring process, I assume that's data we do not want to lose so we want the datastore to exhibit reliability, right ?

I was thinking the same. Relying on redis might be a good first step, but I'm inclined to say we'd need to switch to something more permanent.

In fact, ORES is the first service (aside from RESTBase, which is a special case on it's own) that uses a datastore and a worker model. We should start architecting service clusters on the appserver node/worker node/datastore node pattern I guess.

The first thing that pops to mind is having a dedicated server/cluster for data stores, but this is rather lame, as it might create more problems than needed. But, yeah, we need to see if this master/worker/db pattern will be more used in production and if so, we need to come up with a solution and not do one-offs.

a) mobileapps. minimal CPU/Memory requirements.
https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_report&s=by+name&c=Service%2520Cluster%2520B%2520eqiad&tab=m&vn=&hide-hf=false

Note that we soon plan to implement pre-generation for the MobileApps service, which means that the service will get called on each article edit, so the graph will look completely differently.

b) there is some isolation, namely firejail. It is not however CPU/Memory isolation. We had no real reason up to now do so as all of the services on sca require minimal CPU/Memory as well.

We should implement this ASAP.

akosiaris claimed this task.

Resolving since ORES has been in production for the past 2 weeks