Add wikidata to articletopic pipeline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Jun 2 2020, 9:46 PM

Description

The current ORES articletopic model pipeline trains topic classification models for Arabic, Czech, English, Korean, and Vietnamese: https://github.com/wikimedia/drafttopic/blob/master/Makefile

We would like to add wikidata to this pipeline. The current drafttopic pipeline is:

pre-train word embeddings via fastText (actually in mwtext library)
download training data via APIs and merge with labels
freeze these embeddings and form article embeddings by taking the average of the embeddings for words in an article and then train a gradient-boosted classifier in sklearn over this article embedding

We will definitely retain the first two parts (pre-trained embeddings + download data via APIs) as they help to fix the vocabulary size and are generally useful for other models as well. While shifting from APIs to dumps would be useful for scaling up training, it is not an essential change. The rest we will experiment with, namely:

Performance of gradient-boosted classifier vs. a fastText model (we will implement fasttext in revscoring and compare)
Impact of balanced vs. imbalance data -- i.e. in Wikidata, biographies occur very frequently and Mathematics-related items much less so. In the existing pipeline, the data is artificially balanced so that there are close to the same number of Biographies and Mathematics articles. The model statistics are then adjusted to take into account this balancing. We will test how the model performs when this balancing is not done -- i.e. the model is trained on the original distribution of topics.
Impact of adding more training data -- currently, the model is trained on ~64,000 data points (at least 1000 data points per topic). Fasttext trains more quickly and so should allow us to increase the training data without substantially impacting the time it takes to train the model. We have almost 6M labeled data points so there is a lot of opportunity to grow the training data set if this has a substantial positive impact on model performance.
Other standard hyperparameter tuning (e.g., learning rate, embeddings dimensionality, vocab size)

Out of scope:

there are two branches of the ORES pipeline: drafttopic (predict topics for first drafts of an article) and articletopic (predict topics for current versions of an article). For Wikidata, we are focusing just on the articletopic facet. Future work could be expanding this to also include a drafttopic facet, though maintaining separate models might be unnecessary for Wikidata.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Isaac	T245848 Productionize Wikidata-based Topic Model on ORES
		Resolved		Dibyaaaaax	T254289 Add wikidata to articletopic pipeline

Event Timeline

Isaac created this task.Jun 2 2020, 9:46 PM

Restricted Application added a subscriber: revi. · View Herald TranscriptJun 2 2020, 9:46 PM

Halfak assigned this task to Dibyaaaaax.Jul 13 2020, 4:43 PM

Halfak triaged this task as High priority.

Halfak edited projects, added Machine-Learning-Team (Active Tasks), drafttopic-modeling; removed Machine-Learning-Team.

Isaac updated the task description. (Show Details)Jul 28 2020, 2:11 PM

Subsequent comments will have the performance reports for different Wikidata models. This is to get an idea about the changes in their performance while varying different factors like classifier (Fasttext vs GradientBoosting), vocab size, training data size, etc.

Run-1

Classifier	GradientBoosting
Parameters	`n_estimators=150` `max_depth=5` `max_features="log2"` `learning_rate=0.1`
Number of Samples	63944, balanced samples
Vocab Size	10000
Embeddings dimension	50

Overall performance

recall	(micro=0.719, macro=0.621)
precision	(micro=0.7, macro=0.554)
f1	(micro=0.703, macro=0.571)
accuracy	(micro=0.978, macro=0.99)
roc_auc	(micro=0.956, macro=0.951)
pr_auc	(micro=0.721, macro=0.543)

wikidata.articletopic.md58 KBDownload

Run-2

Classifier	GradientBoosting
Parameters	`n_estimators=150` `max_depth=5` `max_features="log2"` `learning_rate=0.1`
Number of Samples	63944, balanced samples
Vocab Size	50000
Embeddings dimension	50

Overall performance

recall	(micro=0.732, macro=0.644)
precision	(micro=0.724, macro=0.574)
f1	(micro=0.724, macro=0.593)
accuracy	(micro=0.981, macro=0.991)
roc_auc	(micro=0.963, macro=0.961)
pr_auc	(micro=0.746, macro=0.576)

Note
Except for vocab size (which was increased from 10k to 50k), this model was trained under the same conditions as Run-1. The overall performance went up by 2-3%. There wasn't much improvement while increasing the vocabulary size further.

wikidata.articletopic_50k.md58 KBDownload

Run-3

Classifier	Fasttext
Parameters	`loss=ova` `epoch=25` `dim=50` `lr=0.1` `pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.vec`
Number of Samples	63944, balanced samples
Pretrained vectors- vocab size	10000
Pretrained vectors- Embeddings dimension	50

Overall performance

recall	(micro=0.735, macro=0.628)
precision	(micro=0.672, macro=0.545)
f1	(micro=0.696, macro=0.573)
accuracy	(micro=0.977, macro=0.989)
roc_auc	(micro=0.957, macro=0.95)
pr_auc	(micro=0.722, macro=0.56)

wikidata-ft.md57 KBDownload

Run-4

Classifier	Fasttext
Parameters	`loss=ova` `epoch=25` `dim=50` `lr=0.1` `pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.50k.vec` `minCount=1000000`
Number of Samples	255785, balanced samples (atleast 4000 per label)
Pretrained vectors- vocab size	50000
Pretrained vectors- Embeddings dimension	50

Overall performance

recall	(micro=0.755, macro=0.661)
precision	(micro=0.688, macro=0.556)
f1	(micro=0.714, macro=0.59)
accuracy	(micro=0.978, macro=0.99)
roc_auc	(micro=0.963, macro=0.957)
pr_auc	(micro=0.751, macro=0.597)

This uses a pretty large training set compared to the previous ones. Consequently, there seems to be some improvements in the performance - eg. better recall. The performance wasn't as good with the pre-trained vectors with 10k vocab size.

wikidata.ft.bs_4000.50k.md57 KBDownload

Run-5

Classifier	Fasttext
Parameters	`loss=ova` `epoch=25` `dim=50` `lr=0.1` `pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.vec` `minCount=1000000`
Number of Samples	63961, imbalanced samples
Pretrained vectors- vocab size	10000
Pretrained vectors- Embeddings dimension	50

Overall performance

recall	(micro=0.752, macro=0.625)
precision	(micro=0.87, macro=0.805)
f1	(micro=0.803, macro=0.699)
accuracy	(micro=0.963, macro=0.982)
roc_auc	(micro=0.964, macro=0.958)
pr_auc	(micro=0.849, macro=0.714)

wikidata.articletopic.ft.unbalanced.md56 KBDownload

Run-6

Classifier	Fasttext
Parameters	`loss=ova` `epoch=25` `dim=50` `lr=0.1` `pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.50k.vec` `minCount=1000000`
Number of Samples	63961, imbalanced samples
Pretrained vectors- vocab size	50000
Pretrained vectors- Embeddings dimension	50

Overall performance

recall	(micro=0.774, macro=0.651)
precision	(micro=0.874, macro=0.82)
f1	(micro=0.819, macro=0.721)
accuracy	(micro=0.965, macro=0.983)
roc_auc	(micro=0.97, macro=0.964)
pr_auc	(micro=0.867, macro=0.746)

there was a nice improvement in the performance compared to Run-5 just by using the pre-trained vectors with larger vocab size.

wikidata.articletopic.ft.unbalanced_50k.md56 KBDownload

Run-7

Classifier	GradientBoosting
Parameters	`n_estimators=150` `max_depth=5` `max_features="log2"` `learning_rate=0.1`
Number of Samples	63961, imbalanced samples
Vocab Size	50000
Embeddings dimension	50

Overall performance

recall	(micro=0.75, macro=0.591)
precision	(micro=0.887, macro=0.79)
f1	(micro=0.809, macro=0.67)
accuracy	(micro=0.966, macro=0.983)
roc_auc	(micro=0.967, macro=0.947)
pr_auc	(micro=0.858, macro=0.68)

wikidata.unbalanced.GBC_50k.md57 KBDownload

This is great -- what I'm seeing here @Dibyaaaaax is that the GBC model mostly performs very similarly to the fasttext model when given the same data, but its recall does suffer for low-data topics. We'll have to discuss whether this slightly higher performance in fasttext warrants the complexity of adding the new fasttext class permanently to revscoring and making sure that it would work in production. I'll mention some other things we discussed but jump in if you have more concrete data:

GBC models train in >2 hours whereas fasttext trains in ~2 minutes. Makes me wonder whether the HistGradientBoostingClassifier would provide the same performance as GBC (and be super easy to implement) but train much more quickly.
Even though you've got fastText setup for training, I'm not certain how it would look like in production if we decided the performance was worth it. It fine-tunes the word embeddings that it's provided so produces a second set of embeddings that are slightly different from the ones trained via mwtext. We maybe just dump those fine-tuned embeddings to a file and reproduce how fastText works with numpy like T242013#6155316.

I'm excited to hear how the unbalanced vs. balanced models compare on the exact same test set without any corrections to their performance!

I tested out some of the above-mentioned models using a testing dataset (with ~150k items) that is completely different from the data the model was trained on. It was observed that all those models performed more or less the same on the new dataset. Since none of these models had a significantly better performance compared to others, and all other models in drafttopic are trained on a balanced dataset using GBC, we have decided to stick to the same for Wikidata too.

n: 149873
Vocab: 10k

Classifier	Trained on	recall	precision	f1	accuracy	roc_auc	pr_auc
Fasttext	63961, unbalanced dataset	(micro=0.794, macro=0.655)	(micro=0.813, macro=0.737)	(micro=0.801, macro=0.688)	(micro=0.966, macro=0.985)	(micro=0.969, macro=0.959)	(micro=0.84, macro=0.69)
Fasttext	63944, balanced dataset	(micro=0.791, macro=0.69)	(micro=0.8, macro=0.681)	(micro=0.792, macro=0.675)	(micro=0.965, macro=0.984)	(micro=0.967, macro=0.961)	(micro=0.833, macro=0.686)
Gradient Boosting	63961, unbalanced dataset	(micro=0.775, macro=0.614)	(micro=0.83, macro=0.725)	(micro=0.798, macro=0.66)	(micro=0.968, macro=0.985)	(micro=0.966, macro=0.951)	(micro=0.83, macro=0.642)
Gradient Boosting	63944, balanced dataset	(micro=0.789, macro=0.674)	(micro=0.805, macro=0.7)	(micro=0.792, macro=0.679)	(micro=0.964, macro=0.984)	(micro=0.966, macro=0.962)	(micro=0.828, macro=0.664)

Isaac mentioned this in T242013: Implement native NN model in revscoring.Sep 10 2020, 1:52 PM

Yay! https://github.com/wikimedia/drafttopic/pull/54
Write-up: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Wikidata_model_productionization

	F31963054: wikidata.unbalanced.GBC_50k.md
	Aug 4 2020, 2:00 PM

	F31962904: wikidata.articletopic.ft.unbalanced_50k.md
	Aug 4 2020, 10:47 AM

	F31962711: wikidata.articletopic.ft.unbalanced.md
	Aug 4 2020, 6:18 AM

	F31954679: wikidata-ft.md
	Jul 30 2020, 5:10 PM

Add wikidata to articletopic pipelineClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add wikidata to articletopic pipeline
Closed, ResolvedPublic
Actions

Related Objects
Search...