Page MenuHomePhabricator

Design how we'll train models which depend on private data
Open, LowPublic

Description

For example, the draftquality models require non-public, deleted article content. We shouldn't be copying that to labs boxes.

One workaround might be to calculate features on our development machine, then export just the evaluated feature values to ores-compute, omitting the article text.

A longer-term solution would be to secure the training compute box, but that would have to be in the production cluster. I'm not sure if the security tradeoff makes sense there.

Event Timeline

I checked on the new incoming stat* boxes. They will be using Debian Stretch -- so we'll be able to use them to train model. We should use the labs boxes and permissions in the meantime.

We discussed a two-step solution,

  1. For now, protect the files on labs by making them readable by a Un*x group including only NDA users.
  2. Extract and build these models on stat machines in the future.

The new stat machines are ready. See T165366: rack/setup/install replacement stat1006 (stat1003 replacement). We should be able to start training models there. Note that this machine is in the prod cluster -- all who have access are NDA'd.

We'll need to double check our enchant dictionaries. They are the only thing that is *really* OS dependent.