Page MenuHomePhabricator

[Spike] Semi-supervised machine learning
Open, LowPublicSpike


Pattern is roughly:

  1. Label small random sample
  2. Train model
  3. Make predictions on new data -- auto-label confident observations
  4. GOTO 2

This task is done when we experiment with training a model and comparing against a (labeled) test set. We'll need a solid testing strategy.

Event Timeline

Halfak renamed this task from Semi-supervised machine learning to [Spike] Semi-supervised machine learning.Aug 16 2016, 4:48 PM
Halfak added a project: Spike.
Halfak updated the task description. (Show Details)

This will likely be especially useful when we have large feature vectors implemented (T132580) and we start working with hashing vectorization in the wild (T128087).

I talked to @Sabya in IRC. Here's the steps that I recommended.

  1. Read up on methods.
  2. Take our labeled data for damaging/not and split into train/test set
  3. Build model on training set.
  4. Run model against a random sample of revisions and take the revisions that are strongly scored (high confidence of "damaging"/not)
  5. Train a new model on the training set + the strongly-labeled observations.
  6. Test against the test set and see if we do better.

@Halfak which classifier algorithm should I use? Current production algorithms or HashingVector + GradientBoosting?

If it's easy to do so, I'd say "both".

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJan 19 2021, 10:38 PM