Page MenuHomePhabricator

[Discuss] ORES model development and deployment processes
Open, Stalled, NormalPublic

Description

Development:

  • Models are trained and tested on stat100* machines
  • Models are pushed to git LFS from stat100* machines

Deployment

  • Models are pulled via git LFS on deployment-prep
  • Deployment-prep + scap push models out to ORES cluster

This task is done when the discussion arrives at some consensus on next steps. Next steps can be just the filing of a task for "later" or real action to be taken immediately.

Event Timeline

Halfak created this task.Feb 15 2019, 3:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2019, 3:03 PM

In response to T214089#4954811:

I think that we build models in hadoop is an excellent proposal. Regretfully, it's very painful as a developer to do something like this in hadoop right now. We researcher/development folk still often use the computational resources on the stat machines because it's a better tool for the job. Still, even if we were to build them in hadoop, we'd still be pushing then to git via LFS though. Having models versioned with code is highly desirable.

One other note is that, should the stat machines be turned off, hadoop would become unavailable to us. So considering our means for accessing hadoop in the same class as the hadoop cluster seems reasonable, no?

As I'd said, the deployment pipeline will continue to work without the stat machines. However, we'll struggle to do new development on some models that require massive CPU and memory resource. We still do have backup options in WMFLabs with a large RAM VM configured with ORES' production environment. Ultimately, we can use vagrant on our laptops if all else fails.

Halfak updated the task description. (Show Details)Feb 15 2019, 3:07 PM
Halfak updated the task description. (Show Details)
fdans moved this task from Incoming to Radar on the Analytics board.Feb 18 2019, 4:28 PM
Nuria added a subscriber: Nuria.Feb 18 2019, 5:26 PM

As we mentioned earlier, stats machines are not to be used to deploy to prod. There are models being trained in hadoop right now but as you said that process needs to be easier. We also need to find an easier path to deploy binaries/data calculated in cluster to prod,
you can follow discussions about this here: https://phabricator.wikimedia.org/T213976

One other note is that, should the stat machines be turned off, hadoop would become unavailable to us.

No, that is incorrect, hadoop jobs and stats machines are decoupled.

Stat machines are not used to deploy to prod for ORES.

Meeting scheduled for Thursday, Feb 28th @ 1630UTC. I've preemptively made an a notes document here: https://etherpad.wikimedia.org/p/ores_usecases_for_ml_infrastructure

Halfak triaged this task as Normal priority.Feb 26 2019, 10:18 PM
Halfak claimed this task.
Halfak removed Halfak as the assignee of this task.Mar 19 2019, 2:55 PM

This discussion seems to be stalled. I'm not sure that it should be assigned to me. @Nuria, did you have any specific goals you intend to achieve with this discussion. Can we resolve based on our last meeting or would you need some follow-up?

Nuria added a comment.EditedMar 19 2019, 3:09 PM

Rather than close you can move to blocked and leave it open , I do not think anything is happening in the near future maybe some work will start in this area on Q1 but i am not sure that will be the case.

Nuria changed the task status from Open to Stalled.Mar 19 2019, 3:09 PM

Moved to ML from radar column.