Page MenuHomePhabricator

[Discuss] ORES model development and deployment processes
Closed, DeclinedPublic

Description

Development:

  • Models are trained and tested on stat100* machines
  • Models are pushed to git LFS from stat100* machines

Deployment

  • Models are pulled via git LFS on deployment-prep
  • Deployment-prep + scap push models out to ORES cluster

This task is done when the discussion arrives at some consensus on next steps. Next steps can be just the filing of a task for "later" or real action to be taken immediately.

Event Timeline

In response to T214089#4954811:

I think that we build models in hadoop is an excellent proposal. Regretfully, it's very painful as a developer to do something like this in hadoop right now. We researcher/development folk still often use the computational resources on the stat machines because it's a better tool for the job. Still, even if we were to build them in hadoop, we'd still be pushing then to git via LFS though. Having models versioned with code is highly desirable.

One other note is that, should the stat machines be turned off, hadoop would become unavailable to us. So considering our means for accessing hadoop in the same class as the hadoop cluster seems reasonable, no?

As I'd said, the deployment pipeline will continue to work without the stat machines. However, we'll struggle to do new development on some models that require massive CPU and memory resource. We still do have backup options in WMFLabs with a large RAM VM configured with ORES' production environment. Ultimately, we can use vagrant on our laptops if all else fails.

As we mentioned earlier, stats machines are not to be used to deploy to prod. There are models being trained in hadoop right now but as you said that process needs to be easier. We also need to find an easier path to deploy binaries/data calculated in cluster to prod,
you can follow discussions about this here: https://phabricator.wikimedia.org/T213976

One other note is that, should the stat machines be turned off, hadoop would become unavailable to us.

No, that is incorrect, hadoop jobs and stats machines are decoupled.

Stat machines are not used to deploy to prod for ORES.

Meeting scheduled for Thursday, Feb 28th @ 1630UTC. I've preemptively made an a notes document here: https://etherpad.wikimedia.org/p/ores_usecases_for_ml_infrastructure

Halfak triaged this task as Medium priority.

This discussion seems to be stalled. I'm not sure that it should be assigned to me. @Nuria, did you have any specific goals you intend to achieve with this discussion. Can we resolve based on our last meeting or would you need some follow-up?

Rather than close you can move to blocked and leave it open , I do not think anything is happening in the near future maybe some work will start in this area on Q1 but i am not sure that will be the case.

Nuria changed the task status from Open to Stalled.Mar 19 2019, 3:09 PM

Moved to ML from radar column.

Aklapper changed the task status from Stalled to Open.Jun 9 2020, 6:08 AM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status.

(Smallprint, as general orientation for task management: If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead. If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks. If this task is stalled on an upstream project, then the Upstream tag should be added. If this task requires info from the task reporter, then there should be instructions which info is needed. If this task needs retesting, then the TestMe tag should be added. If this task is either out of scope and nobody should ever work on this, or nobody else managed to reproduce the problem described in this task, then this task should have the "Declined" status. If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

elukey subscribed.

We are moving to Lift Wing: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing

I am closing old tasks related to ORES since it is being deprecated, please re-open if you feel that any work could be done on Lift Wing.