Page MenuHomePhabricator

Add wikidata to articletopic pipeline
Closed, ResolvedPublic

Assigned To
Authored By
Isaac
Jun 2 2020, 9:46 PM
Referenced Files
F31963054: wikidata.unbalanced.GBC_50k.md
Aug 4 2020, 2:00 PM
F31962904: wikidata.articletopic.ft.unbalanced_50k.md
Aug 4 2020, 10:47 AM
F31962711: wikidata.articletopic.ft.unbalanced.md
Aug 4 2020, 6:18 AM
F31954899: wikidata.ft.bs_4000.50k.md
Jul 30 2020, 9:01 PM
F31954679: wikidata-ft.md
Jul 30 2020, 5:10 PM
F31952573: wikidata.articletopic_50k.md
Jul 29 2020, 4:44 AM
F31951914: wikidata.articletopic.md
Jul 28 2020, 5:57 PM

Description

The current ORES articletopic model pipeline trains topic classification models for Arabic, Czech, English, Korean, and Vietnamese: https://github.com/wikimedia/drafttopic/blob/master/Makefile

We would like to add wikidata to this pipeline. The current drafttopic pipeline is:

  • pre-train word embeddings via fastText (actually in mwtext library)
  • download training data via APIs and merge with labels
  • freeze these embeddings and form article embeddings by taking the average of the embeddings for words in an article and then train a gradient-boosted classifier in sklearn over this article embedding

We will definitely retain the first two parts (pre-trained embeddings + download data via APIs) as they help to fix the vocabulary size and are generally useful for other models as well. While shifting from APIs to dumps would be useful for scaling up training, it is not an essential change. The rest we will experiment with, namely:

  • Performance of gradient-boosted classifier vs. a fastText model (we will implement fasttext in revscoring and compare)
  • Impact of balanced vs. imbalance data -- i.e. in Wikidata, biographies occur very frequently and Mathematics-related items much less so. In the existing pipeline, the data is artificially balanced so that there are close to the same number of Biographies and Mathematics articles. The model statistics are then adjusted to take into account this balancing. We will test how the model performs when this balancing is not done -- i.e. the model is trained on the original distribution of topics.
  • Impact of adding more training data -- currently, the model is trained on ~64,000 data points (at least 1000 data points per topic). Fasttext trains more quickly and so should allow us to increase the training data without substantially impacting the time it takes to train the model. We have almost 6M labeled data points so there is a lot of opportunity to grow the training data set if this has a substantial positive impact on model performance.
  • Other standard hyperparameter tuning (e.g., learning rate, embeddings dimensionality, vocab size)

Out of scope:

  • there are two branches of the ORES pipeline: drafttopic (predict topics for first drafts of an article) and articletopic (predict topics for current versions of an article). For Wikidata, we are focusing just on the articletopic facet. Future work could be expanding this to also include a drafttopic facet, though maintaining separate models might be unnecessary for Wikidata.

Event Timeline

Restricted Application added a subscriber: revi. · View Herald TranscriptJun 2 2020, 9:46 PM

Subsequent comments will have the performance reports for different Wikidata models. This is to get an idea about the changes in their performance while varying different factors like classifier (Fasttext vs GradientBoosting), vocab size, training data size, etc.


Run-1

ClassifierGradientBoosting
Parametersn_estimators=150 max_depth=5 max_features="log2" learning_rate=0.1
Number of Samples63944, balanced samples
Vocab Size10000
Embeddings dimension50

Overall performance

recall(micro=0.719, macro=0.621)
precision(micro=0.7, macro=0.554)
f1(micro=0.703, macro=0.571)
accuracy(micro=0.978, macro=0.99)
roc_auc(micro=0.956, macro=0.951)
pr_auc(micro=0.721, macro=0.543)

Run-2

ClassifierGradientBoosting
Parametersn_estimators=150 max_depth=5 max_features="log2" learning_rate=0.1
Number of Samples63944, balanced samples
Vocab Size50000
Embeddings dimension50

Overall performance

recall(micro=0.732, macro=0.644)
precision(micro=0.724, macro=0.574)
f1(micro=0.724, macro=0.593)
accuracy(micro=0.981, macro=0.991)
roc_auc(micro=0.963, macro=0.961)
pr_auc(micro=0.746, macro=0.576)

Note
Except for vocab size (which was increased from 10k to 50k), this model was trained under the same conditions as Run-1. The overall performance went up by 2-3%. There wasn't much improvement while increasing the vocabulary size further.

Run-3

ClassifierFasttext
Parametersloss=ova epoch=25 dim=50 lr=0.1 pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.vec
Number of Samples63944, balanced samples
Pretrained vectors- vocab size10000
Pretrained vectors- Embeddings dimension50

Overall performance

recall(micro=0.735, macro=0.628)
precision(micro=0.672, macro=0.545)
f1(micro=0.696, macro=0.573)
accuracy(micro=0.977, macro=0.989)
roc_auc(micro=0.957, macro=0.95)
pr_auc(micro=0.722, macro=0.56)

Run-4

ClassifierFasttext
Parametersloss=ova epoch=25 dim=50 lr=0.1 pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.50k.vec minCount=1000000
Number of Samples255785, balanced samples (atleast 4000 per label)
Pretrained vectors- vocab size50000
Pretrained vectors- Embeddings dimension50

Overall performance

recall(micro=0.755, macro=0.661)
precision(micro=0.688, macro=0.556)
f1(micro=0.714, macro=0.59)
accuracy(micro=0.978, macro=0.99)
roc_auc(micro=0.963, macro=0.957)
pr_auc(micro=0.751, macro=0.597)

This uses a pretty large training set compared to the previous ones. Consequently, there seems to be some improvements in the performance - eg. better recall. The performance wasn't as good with the pre-trained vectors with 10k vocab size.

Run-5

ClassifierFasttext
Parametersloss=ova epoch=25 dim=50 lr=0.1 pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.vec minCount=1000000
Number of Samples63961, imbalanced samples
Pretrained vectors- vocab size10000
Pretrained vectors- Embeddings dimension50

Overall performance

recall(micro=0.752, macro=0.625)
precision(micro=0.87, macro=0.805)
f1(micro=0.803, macro=0.699)
accuracy(micro=0.963, macro=0.982)
roc_auc(micro=0.964, macro=0.958)
pr_auc(micro=0.849, macro=0.714)

Run-6

ClassifierFasttext
Parametersloss=ova epoch=25 dim=50 lr=0.1 pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.50k.vec minCount=1000000
Number of Samples63961, imbalanced samples
Pretrained vectors- vocab size50000
Pretrained vectors- Embeddings dimension50

Overall performance

recall(micro=0.774, macro=0.651)
precision(micro=0.874, macro=0.82)
f1(micro=0.819, macro=0.721)
accuracy(micro=0.965, macro=0.983)
roc_auc(micro=0.97, macro=0.964)
pr_auc(micro=0.867, macro=0.746)

there was a nice improvement in the performance compared to Run-5 just by using the pre-trained vectors with larger vocab size.

Run-7

ClassifierGradientBoosting
Parametersn_estimators=150 max_depth=5 max_features="log2" learning_rate=0.1
Number of Samples63961, imbalanced samples
Vocab Size50000
Embeddings dimension50

Overall performance

recall(micro=0.75, macro=0.591)
precision(micro=0.887, macro=0.79)
f1(micro=0.809, macro=0.67)
accuracy(micro=0.966, macro=0.983)
roc_auc(micro=0.967, macro=0.947)
pr_auc(micro=0.858, macro=0.68)

This is great -- what I'm seeing here @Dibyaaaaax is that the GBC model mostly performs very similarly to the fasttext model when given the same data, but its recall does suffer for low-data topics. We'll have to discuss whether this slightly higher performance in fasttext warrants the complexity of adding the new fasttext class permanently to revscoring and making sure that it would work in production. I'll mention some other things we discussed but jump in if you have more concrete data:

  • GBC models train in >2 hours whereas fasttext trains in ~2 minutes. Makes me wonder whether the HistGradientBoostingClassifier would provide the same performance as GBC (and be super easy to implement) but train much more quickly.
  • Even though you've got fastText setup for training, I'm not certain how it would look like in production if we decided the performance was worth it. It fine-tunes the word embeddings that it's provided so produces a second set of embeddings that are slightly different from the ones trained via mwtext. We maybe just dump those fine-tuned embeddings to a file and reproduce how fastText works with numpy like T242013#6155316.

I'm excited to hear how the unbalanced vs. balanced models compare on the exact same test set without any corrections to their performance!

I tested out some of the above-mentioned models using a testing dataset (with ~150k items) that is completely different from the data the model was trained on. It was observed that all those models performed more or less the same on the new dataset. Since none of these models had a significantly better performance compared to others, and all other models in drafttopic are trained on a balanced dataset using GBC, we have decided to stick to the same for Wikidata too.


n: 149873
Vocab: 10k

ClassifierTrained onrecallprecisionf1accuracyroc_aucpr_auc
Fasttext63961, unbalanced dataset(micro=0.794, macro=0.655)(micro=0.813, macro=0.737)(micro=0.801, macro=0.688)(micro=0.966, macro=0.985)(micro=0.969, macro=0.959)(micro=0.84, macro=0.69)
Fasttext63944, balanced dataset(micro=0.791, macro=0.69)(micro=0.8, macro=0.681)(micro=0.792, macro=0.675)(micro=0.965, macro=0.984)(micro=0.967, macro=0.961)(micro=0.833, macro=0.686)
Gradient Boosting63961, unbalanced dataset(micro=0.775, macro=0.614)(micro=0.83, macro=0.725)(micro=0.798, macro=0.66)(micro=0.968, macro=0.985)(micro=0.966, macro=0.951)(micro=0.83, macro=0.642)
Gradient Boosting63944, balanced dataset(micro=0.789, macro=0.674)(micro=0.805, macro=0.7)(micro=0.792, macro=0.679)(micro=0.964, macro=0.984)(micro=0.966, macro=0.962)(micro=0.828, macro=0.664)