Page MenuHomePhabricator

Implement native NN model in revscoring
Open, Needs TriagePublic


Here's the basic interface of a scorer model:

Gist is that a ScorerModel contains the following members:

  • features: A list of revscoring.Feature or FeatureVector
  • score: A method that takes a set of extracted features as an argument and produces a JSON blob as output
  • info: A ModelInfo object that contains a model name, version, fitness statistics, etc.

See for how we implement this for scikit-learn-based models.

@Isaac already did a bunch of work with this for topic modeling. You can find his code on stat1007.eqiad.wmnet.

This is the script that I used to preprocess your article text + labels to fastText format (and split into training/val/test): /home/isaacj/fastText/drafttopic/
This is the script that I used to train / evaluate a model on the preprocessed text: /home/isaacj/fastText/drafttopic/

Event Timeline

Halfak created this task.Jan 6 2020, 6:42 PM
Halfak renamed this task from Implement native fasttext model in revscoring to Implement native NN topic model in revscoring.Jan 13 2020, 5:49 PM

We looked into fasttext but it is too single-purpose. @Isaac made some great progress implementing the last layer in Keras. Could you share your work with us here?

@Halfak : I moved the code to stat1005 so I can hopefully get access to the GPUs there for any further testing. But here's what I have thusfar:

  • code for training these models is as follows: stat1005:/home/isaacj/topic_models/
  • you can use stat1005: source activate /home/isaacj/p3_ml/bin/activate to access my Python3 virtualenv with the appropriate libraries.
  • running it as python with no additional parameters will train a fastText model. Running it with --model_type keras will train a Keras model. And then there are a bunch of arguments you can pass per the code.

Also some results looking at different models trained under this paradigm:

For all models: learning Rate 0.1; epochs 10; vocab size = 300K unless otherwise stated

== fastText model using 50-dimensional embeddings but no pre-trained embeddings; minCount == 10 (vocab size = 89k) ==
Precision: 0.847 micro; 0.829 macro
Recall: 0.677 micro; 0.601 macro
F1: 0.745 micro; 0.690 macro
PR-AUC: 0.967 micro; 0.959 macro
Avg pre.: 0.813 micro; 0.762 macro

== fastText model using pretrained 50-dimensional skipgram fastText embeddings (with further fine-tuning) ==
Precision: 0.842 micro; 0.824 macro
Recall: 0.731 micro; 0.678 macro
F1: 0.780 micro; 0.741 macro
PR-AUC: 0.974 micro; 0.972 macro
Avg pre.: 0.839 micro; 0.801 macro

== Keras model using 50-dimensional skipgram fastText embeddings with no fine-tuning ==
Precision: 0.738 micro; 0.690 macro
Recall: 0.326 micro; 0.239 macro
F1: 0.430 micro; 0.334 macro
Avg pre.: 0.615 micro; 0.517 macro

== Keras model using fastText-finetuned 50-dimensional skipgram fastText embeddings with no additional fine-tuning == 
Precision: 0.835 micro; 0.809 macro
Recall: 0.505 micro; 0.395 macro
F1: 0.609 micro; 0.511 macro
Avg pre.: 0.756 micro; 0.690 macro
Halfak removed kevinbazira as the assignee of this task.Mon, Feb 3, 5:43 PM
Halfak added a subscriber: kevinbazira.
Halfak renamed this task from Implement native NN topic model in revscoring to Implement native NN model in revscoring.Wed, Feb 12, 9:42 PM
Halfak removed a project: drafttopic-modeling.
Halfak updated the task description. (Show Details)
Halfak moved this task from Untriaged to Research on the Scoring-platform-team board.