Page MenuHomePhabricator

Implement native NN model in revscoring
Closed, ResolvedPublic

Description

Here's the basic interface of a scorer model: https://github.com/wikimedia/revscoring/blob/master/revscoring/scoring/models/model.py

Gist is that a ScorerModel contains the following members:

  • features: A list of revscoring.Feature or FeatureVector
  • score: A method that takes a set of extracted features as an argument and produces a JSON blob as output
  • info: A ModelInfo object that contains a model name, version, fitness statistics, etc.

See https://github.com/wikimedia/revscoring/blob/master/revscoring/scoring/models/sklearn.py for how we implement this for scikit-learn-based models.

@Isaac already did a bunch of work with this for topic modeling. You can find his code on stat1007.eqiad.wmnet.

This is the script that I used to preprocess your article text + labels to fastText format (and split into training/val/test): /home/isaacj/fastText/drafttopic/drafttopic_article_fasttext_preprocess.py
This is the script that I used to train / evaluate a model on the preprocessed text: /home/isaacj/fastText/drafttopic/drafttopic_article_fasttext_model.py

Event Timeline

Halfak renamed this task from Implement native fasttext model in revscoring to Implement native NN topic model in revscoring.Jan 13 2020, 5:49 PM

We looked into fasttext but it is too single-purpose. @Isaac made some great progress implementing the last layer in Keras. Could you share your work with us here?

@Halfak : I moved the code to stat1005 so I can hopefully get access to the GPUs there for any further testing. But here's what I have thusfar:

  • code for training these models is as follows: stat1005:/home/isaacj/topic_models/topic_modeling.py
  • you can use stat1005: source activate /home/isaacj/p3_ml/bin/activate to access my Python3 virtualenv with the appropriate libraries.
  • running it as python topic_modeling.py with no additional parameters will train a fastText model. Running it with --model_type keras will train a Keras model. And then there are a bunch of arguments you can pass per the code.

Also some results looking at different models trained under this paradigm:

For all models: learning Rate 0.1; epochs 10; vocab size = 300K unless otherwise stated

== fastText model using 50-dimensional embeddings but no pre-trained embeddings; minCount == 10 (vocab size = 89k) ==
Precision: 0.847 micro; 0.829 macro
Recall: 0.677 micro; 0.601 macro
F1: 0.745 micro; 0.690 macro
PR-AUC: 0.967 micro; 0.959 macro
Avg pre.: 0.813 micro; 0.762 macro

== fastText model using pretrained 50-dimensional skipgram fastText embeddings (with further fine-tuning) ==
Precision: 0.842 micro; 0.824 macro
Recall: 0.731 micro; 0.678 macro
F1: 0.780 micro; 0.741 macro
PR-AUC: 0.974 micro; 0.972 macro
Avg pre.: 0.839 micro; 0.801 macro

== Keras model using 50-dimensional skipgram fastText embeddings with no fine-tuning ==
Precision: 0.738 micro; 0.690 macro
Recall: 0.326 micro; 0.239 macro
F1: 0.430 micro; 0.334 macro
Avg pre.: 0.615 micro; 0.517 macro

== Keras model using fastText-finetuned 50-dimensional skipgram fastText embeddings with no additional fine-tuning == 
Precision: 0.835 micro; 0.809 macro
Recall: 0.505 micro; 0.395 macro
F1: 0.609 micro; 0.511 macro
Avg pre.: 0.756 micro; 0.690 macro
Halfak added a subscriber: kevinbazira.
Halfak renamed this task from Implement native NN topic model in revscoring to Implement native NN model in revscoring.Feb 12 2020, 9:42 PM
Halfak removed a project: drafttopic-modeling.
Halfak updated the task description. (Show Details)
Halfak moved this task from Unsorted to Research on the Machine-Learning-Team board.

I realized after going through mwtext that while ideally we would have a pipeline that does not depend on fasttext whatsoever (neither for training nor prediction), it's actually okay to depend on fasttext for training so long as the end models do not require fasttext (and all the headache of deploying that to worker nodes / fitting in with its strict requirements about input format etc.). This is a much better situation because in the past I haven't been able to get to Keras or other libraries to train nearly as nicely as fasttext trains, so I would still like to be able to use fasttext for the training part of any model development. I took a swing at taking a trained fasttext model and making predictions with it via numpy + the extracted embeddings / weights matrices. The code below produces the expected output so I think we're in a good place to "use" fasttext models directly for prediction in production (not just training embeddings).

Note, I'm still using fasttext in the code below but just to look up word embeddings so that could easily be replaced with gensim etc.:

import fasttext
import math
import numpy as np

def sigmoid(x):
   return 1 / (1 + math.exp(-x))

def standard_fasttext(m, input_str, threshold=0.5):
   # get all fasttext predictions above threshold for input data
   print("Normal fasttext prediction: {0}".format([(l.replace('__label__', ''), '{0:.3f}'.format(s)) for l,s in zip(*m.predict(input_str, k=-1)) if s >= threshold]))

def manual_fasttext(m, input_str, threshold=0.5):
   # run fasttext model "manually"
   # 1) average together input word embeddings for document vector
   #    ignore OOV words
   #    add end-of-sentence token ('</s>') to input data
   docvec = np.average([m.get_word_vector(w) for w in input_str.split() if w in m.words] + [m.get_word_vector('</s>')], axis=0)
   
   # 2) take dot product of model weights matrix (64 x 50) and document vector (50, 1) to get output (64, 1) vector
   raw_out = np.dot(m.get_output_matrix(), docvec)
   
   # 3) convert these values to probabilities by passing each through a sigmoid function
   pro_out = [sigmoid(s) for s in list(raw_out)]
   
   # 4) get sorted rank of predicted labels
   top_k = np.argsort(pro_out)[::-1]
   
   # 5) show results above threshold
   print("Manual fasttext prediction:", [(m.get_labels()[i].replace('__label__', ''), '{0:.3f}'.format(pro_out[i])) for i in top_k if pro_out[i] >= threshold])
# example below is Galapagos (video game): https://www.wikidata.org/wiki/Q16981582
# you'll note a small difference in output probabilities that presumably comes from rounding errors somewhere
# in all my testing, while it will affect the ranking of high confidence topics like below, 
# I never saw it grow larger than a ~0.001 difference

>>> m = fasttext.load_model('model.bin')
>>> input_str = 'P577 P31 Q7889 P404 Q208850 P1933 P400 Q1406 P136 Q270948 P495 Q30 P123 Q173941'
>>> threshold = 0.02  # note: normally this would be 0.5 but I set it low for debugging purposes

>>> standard_fasttext(m, input_str, threshold=0.5)
Normal fasttext prediction: [('Culture.Media.Media*', '1.000'), ('Culture.Media.Video_games', '1.000'), ('Culture.Internet_culture', '1.000'), ('STEM.STEM*', '0.024'), ('STEM.Technology', '0.021'), ('Culture.Media.Software', '0.020')]

>>> manual_fasttext(m, input_str, threshold=threshold)
Manual fasttext prediction: [('Culture.Media.Video_games', '1.000'), ('Culture.Internet_culture', '1.000'), ('Culture.Media.Media*', '1.000'), ('STEM.STEM*', '0.025'), ('STEM.Technology', '0.021'), ('Culture.Media.Software', '0.021')]
Isaac claimed this task.

I'm going to resolve this -- we did some exploration of what this would mean and Dibya (Outreachy intern) even implemented a fastText model within drafttopic (T254289) but found that while its training was significantly faster, the performance was only slightly better. Revscoring is currently largely on hold though so if this work is picked up, it will likely look somewhat different.