New Service Request: ORES
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Nov 3 2015, 4:16 PM

Description

Description: https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service
Timeline: We're hoping to deploy by the end of Nov.
Diagram:

Technologies: python(flask, sklearn, celery), redis
Point person: Aaron Halfaker (@Halfak)

The diagram above depicts a common request flow that comes from the user's browser.

nginx load balancer
two web app servers
if the score is cached in redis, return that
if the score is not cached in redis, use celery cluster to generate it and store the score in redis.

The only modification that a user can perform is to request that a score be generated and cached. Otherwise the service is read-only.

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T117560 New Service Request: ORES
Resolved	Halfak	T110072 Security Review of Revscoring
Resolved	Halfak	T115534 Set up backpressure for ORES (Limit queue sizes in Celery)

Event Timeline

Halfak created this task.Nov 3 2015, 4:16 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added projects: SRE, Service-deployment-requests, Services, Machine-Learning-Team (Active Tasks).

Halfak subscribed.

Restricted Application added a project: acl*sre-team. · View Herald TranscriptNov 3 2015, 4:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Halfak added a subtask: T110072: Security Review of Revscoring.Nov 3 2015, 4:17 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptNov 3 2015, 4:17 PM

Halfak added a subtask: T115534: Set up backpressure for ORES (Limit queue sizes in Celery).Nov 3 2015, 4:18 PM

Halfak set Security to None.

Halfak added a subscriber: yuvipanda.

• chasemp subscribed.Nov 3 2015, 5:36 PM

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 3 2015, 5:36 PM

RobH removed a project: acl*sre-team.Nov 3 2015, 7:22 PM

From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis too!) on the same hosts - on the SCB cluster (they've 64 cores and a lot of RAM each...). So it'll be two different celery setups rather than one (I presume?). We'd also need to find a common redis host somewhere for the actual caching.

• chasemp triaged this task as Medium priority.Nov 3 2015, 7:38 PM

Legoktm subscribed.Nov 3 2015, 7:40 PM

In T117560#1778582, @yuvipanda wrote:

From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis too!) on the same hosts - on the SCB cluster (they've 64 cores and a lot of RAM each...). So it'll be two different celery setups rather than one (I presume?). We'd also need to find a common redis host somewhere for the actual caching.

I guess that two celery instances can be set up on scb100[12] and use the same Redis instance/cluster.

Please provide some estimates related to:

the (projected) request rate
memory consumption

I also think that we should be careful about the interaction with other production services running on this cluster. There is not much isolation between services currently (only processes and firejail), so one mis-behaving service can wreak a lot of havoc.

@GWicke, what interactions are you worried about? Is this specific to ORES or maybe a more general concern?

@Halfak, it's a general concern, but something computationally intense and research-driven like ORES is especially difficult to gauge in that regard.

What other services are currently running in scb? Is there any isolation between the multiple services running in sca?

In T117560#1778582, @yuvipanda wrote:

From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis too!) on the same hosts - on the SCB cluster (they've 64 cores and a lot of RAM each...). So it'll be two different celery setups rather than one (I presume?). We'd also need to find a common redis host somewhere for the actual caching.

Yes, two different celery workers. One on each SCB node. For now we will have the workers shared with the app servers. Resource wise at least, although it is not optimal architecturally. The nginx in-front is obviously not gonna be part of the software deployed on scb. It's user-facing and we got a whole stack for that in production. We do need to figure out though what we are going to do with that redis. The way I read this, we could host it on the same nodes again but it is not really good architecturally and I rather we did not. Since it stores the results of the revscoring process, I assume that's data we do not want to lose so we want the datastore to exhibit reliability, right ?

In fact, ORES is the first service (aside from RESTBase, which is a special case on it's own) that uses a datastore and a worker model. We should start architecting service clusters on the appserver node/worker node/datastore node pattern I guess.

In T117560#1783345, @yuvipanda wrote:

What other services are currently running in scb? Is there any isolation between the multiple services running in sca?

a) mobileapps. minimal CPU/Memory requirements.
https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_report&s=by+name&c=Service%2520Cluster%2520B%2520eqiad&tab=m&vn=&hide-hf=false
a2) There T96017 which basically means "consolidate SCA to SCB" (that's why SCB was created for anyway)

Given b) reasoning below, that Task's completion will not change much as far as a) goes

b) there is some isolation, namely firejail. It is not however CPU/Memory isolation. We had no real reason up to now do so as all of the services on sca require minimal CPU/Memory as well.

https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=Service+Cluster+A+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name

In T117560#1783344, @GWicke wrote:

@Halfak, it's a general concern, but something computationally intense and research-driven like ORES is especially difficult to gauge in that regard.

In T117560#1785218, @akosiaris wrote:

Yes, two different celery workers. One on each SCB node. For now we will have the workers shared with the app servers. Resource wise at least, although it is not optimal architecturally. The nginx in-front is obviously not gonna be part of the software deployed on scb. It's user-facing and we got a whole stack for that in production. We do need to figure out though what we are going to do with that redis. The way I read this, we could host it on the same nodes again but it is not really good architecturally and I rather we did not. Since it stores the results of the revscoring process, I assume that's data we do not want to lose so we want the datastore to exhibit reliability, right ?

I was thinking the same. Relying on redis might be a good first step, but I'm inclined to say we'd need to switch to something more permanent.

In fact, ORES is the first service (aside from RESTBase, which is a special case on it's own) that uses a datastore and a worker model. We should start architecting service clusters on the appserver node/worker node/datastore node pattern I guess.

The first thing that pops to mind is having a dedicated server/cluster for data stores, but this is rather lame, as it might create more problems than needed. But, yeah, we need to see if this master/worker/db pattern will be more used in production and if so, we need to come up with a solution and not do one-offs.

a) mobileapps. minimal CPU/Memory requirements.
https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_report&s=by+name&c=Service%2520Cluster%2520B%2520eqiad&tab=m&vn=&hide-hf=false

Note that we soon plan to implement pre-generation for the MobileApps service, which means that the service will get called on each article edit, so the graph will look completely differently.

b) there is some isolation, namely firejail. It is not however CPU/Memory isolation. We had no real reason up to now do so as all of the services on sca require minimal CPU/Memory as well.

We should implement this ASAP.

Halfak closed subtask T115534: Set up backpressure for ORES (Limit queue sizes in Celery) as Resolved.Nov 20 2015, 2:38 PM

Halfak closed subtask T110072: Security Review of Revscoring as Resolved.Jan 21 2016, 3:40 PM

• mobrovac moved this task from Inbox to Backlog on the Service-deployment-requests board.Jan 28 2016, 7:11 PM

Ladsgroup moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.Feb 18 2016, 9:18 PM

Halfak moved this task from Backlog to Monitor (long term) on the Machine-Learning-Team (Active Tasks) board.Feb 20 2016, 5:04 PM