Page MenuHomePhabricator

[Epic] Deploy Revscoring/ORES service in Prod
Closed, ResolvedPublic

Description

This card is done when ORES is deployed in the production network.

Note that this card was originally a [Discussion] card and was later changed to an engineering task.

Aaron/Dario: Talk to Mark and Gabriel about scaling and the move towards Prod:

  • more sys than prod
  • Aaron, Dario, Yuvi to meet w/Gabriel whether they will adopt this as a service, if yes, then would need to work on our process with them
  • already started conversation with Mark re: where services like Revscoring will live:
    • need non-Prod/non-Labs place for Revscoring to live = meso-level support

In parallel with T106860: Write down current process and ideal process for Revscoring (request from Wikimania 2015)

Related Objects

StatusAssignedTask
ResolvedLadsgroup
ResolvedHalfak
ResolvedLadsgroup
InvalidHalfak
ResolvedHalfak
DeclinedNone
Resolved yuvipanda
Resolvedawight
Resolved yuvipanda
Resolved yuvipanda
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
DeclinedNone
ResolvedHalfak
ResolvedHalfak
ResolvedHalfak
DeclinedHalfak
DuplicateNone
Resolvedakosiaris
Resolvedakosiaris
ResolvedRobH
ResolvedCmjohnson
ResolvedCmjohnson
ResolvedMoritzMuehlenhoff
Resolvedakosiaris
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
DuplicateNone
ResolvedLadsgroup
ResolvedLadsgroup
Resolvedakosiaris
Resolvedakosiaris

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Meeting scheduled for 7/28 @ 10:30 PDT

Halfak updated the task description. (Show Details)Jul 24 2015, 7:25 PM
DarTar triaged this task as High priority.Jul 24 2015, 11:00 PM
DarTar updated the task description. (Show Details)
DarTar moved this task from Staged to In Progress on the Research board.
GWicke added a subscriber: GWicke.EditedJul 28 2015, 7:20 PM

Notes from the meeting: https://etherpad.wikimedia.org/p/revscoring_and_services

My summary:

  • to ensure security and allow experimentation, we'd ideally want to deploy this in a semi-prod vlan, accessible from prod but without access to the prod-internal network; however, this depends on scarce network engineering resources, which makes it unrealistic in the short term
  • as a result, decision is to deploy service on the prod network (for now)
    • the service is already puppetized
    • yuvi will take lead on python packaging
    • services will help with general deploy workflow, monitoring, logging, in collaboration with research & ops
    • hardware requirements are moderate (currently ~2 cores?), two hw boxes for redundancy should be sufficient with caching / storage
      • use SCB cluster / see T96017?
    • services will provide a public API and caching via RESTBase
GWicke renamed this task from Talk to Mark and Gabriel about scaling and moving Revscoring towards Prod to Revscoring in Production.Jul 28 2015, 7:21 PM
GWicke added subscribers: mobrovac, yuvipanda.
Halfak renamed this task from Revscoring in Production to [Discussion] Revscoring in Production.Jul 30 2015, 8:59 PM
He7d3r updated the task description. (Show Details)Jul 30 2015, 10:54 PM

Timeline from my perspective:

  1. Get packaging / puppet conversion to use packages done by end of Month August. Helped by @awight and @madhuvishy
  2. Get Extension:ORES into a deployable state by end of Month August. @Legoktm has been doing great on this
  3. Start the process for provisioning some hardware for this. I think one of the server spares can run the celery bits (so it can take the CPU load) and we can keep the uwsgi server in SCA.
  4. Get Extension:ORES out as a beta feature by end of next month!!!!!!1
  5. Everyone buys everyone else involved in this lots of alcohol or other drinks of choice.

Need to check if we need performance / security review of this.

Need to check if we need performance / security review of this.

For the MW extension? We will need a security review at least, perf reviews are optional.

@Legoktm yeah, but also for the service itself.

@yuvipanda, yes, service will need its own review if it's running on production hardware or on a project domain. If someone can make separate Tasks for each and tag them with Security-Review, that would be best.

(also hahaa at optimistic schedules :P)

Joe added a subscriber: Joe.Aug 24 2015, 1:40 PM

Just for the record, there is no such thing as a "semi prod vlan". Please wait for @csteipp and maybe Moritz to take a look at this.

So list of things that need to be done to actually get this deployed from an operational perspective:

  1. Security Review
  2. Performance Review(??)
  3. Figure out how we're going to expose this to the internet
  4. Figure out which hardware this will live on

Other things that can happen in parallel:

  1. Graphite metrics
  2. Centralized logging.

Just for the record, there is no such thing as a "semi prod vlan".

Indeed, sadly. It would be great if we could partition off services that don't need access to any internal infrastructure from the regular production network. We want to be able to do requests *from* production to this service, but the service's network access should ideally be limited to public production APIs only.

I think there is a wider need for better network isolation, and a semi-prod vlan could be a stepping stone in that direction. Another option that was brought up for use cases like HTML dumps was bare metal in the labs network. This is a wider discussion, which I think is just starting to happen.

@GWicke @Joe let's take that discussion to T95185? Suffice to say, it's irrelevant to ORES at this point.

Halfak closed this task as Resolved.Sep 19 2015, 2:10 PM

This was deployed with all the blockers still open?

Can someone point to the production url where the service is running?

This is not deployed in a prod network. The service lives in wmflabs.

See ores.wmflabs.org and https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service

yuvipanda reopened this task as Open.Nov 20 2015, 4:18 AM

Not sure why this was closed...

This card is a [Discussion]. That discussion happened. We should either have a new card for revscoring actually making it into production, or rewrite this card's description.

Halfak renamed this task from [Discussion] Revscoring in Production to Deploy Revscoring/ORES service in Prod.Nov 20 2015, 2:35 PM
Halfak updated the task description. (Show Details)

I've seen a few of these tags popping up in task titles. Where are they documented?

No documentation I know of. We just use them as a folksonomy within the revscoring project.

Things that still need to happen:

  1. Import and build debs into production repository
  2. Modify puppet to use debs instead of pip
  3. Setup redises on oresdb hosts
  4. Setup ores on scb cluster
  5. Setup LVS for ORES
  6. Setup varnish endpoint

@akosiaris, I just updated the blocked-by tasks to include tasks for each of the notes that @yuvipanda left. I didn't fill in much for details. Please feel free to ping me if you need more.

Ladsgroup renamed this task from Deploy Revscoring/ORES service in Prod to [Epic] Deploy Revscoring/ORES service in Prod.Mar 12 2016, 7:08 AM
Ladsgroup added a project: Epic.
ggellerman moved this task from Backlog to Radar on the Research-Backlog board.Mar 17 2016, 10:28 PM

@akosiaris, can this be assigned to you since you already started work.

The Epic one ? Er, yeah sure.

Krinkle added a subscriber: Krinkle.Jun 8 2016, 9:21 PM
Ladsgroup closed this task as Resolved.Jun 14 2016, 9:53 PM
He7d3r added a subscriber: He7d3r.Jun 26 2016, 12:37 PM

Now we have https://ores.wmflabs.org/ and https://ores.wikimedia.org/ Was this task about implementing the latter one?

It was not about the first one.

Yeah. This was about getting ores.wikimedia.org online.

Our plan is to keep ores.wmflabs.org online for the forseeable future. We'll have a deprecation announcement coming soon to encourage people to move over to ores.wikimedia.org. Eventually ores.wmflabs.org will be reserved for experimental modeling and processing strategies. So, we'll likely have tools that use experimental/new models using it and that will provide us with real usage patterns to test out performance improvements and that sort of thing.