The current ORES articletopic model pipeline trains topic classification models for Arabic, Czech, English, Korean, and Vietnamese: https://github.com/wikimedia/drafttopic/blob/master/Makefile
We would like to add wikidata to this pipeline. The current drafttopic pipeline is:
- pre-train word embeddings via fastText (actually in mwtext library)
- download training data via APIs and merge with labels
- freeze these embeddings and form article embeddings by taking the average of the embeddings for words in an article and then train a gradient-boosted classifier in sklearn over this article embedding
We will definitely retain the first two parts (pre-trained embeddings + download data via APIs) as they help to fix the vocabulary size and are generally useful for other models as well. While shifting from APIs to dumps would be useful for scaling up training, it is not an essential change. The rest we will experiment with, namely:
- Performance of gradient-boosted classifier vs. a fastText model (we will implement fasttext in revscoring and compare)
- Impact of balanced vs. imbalance data -- i.e. in Wikidata, biographies occur very frequently and Mathematics-related items much less so. In the existing pipeline, the data is artificially balanced so that there are close to the same number of Biographies and Mathematics articles. The model statistics are then adjusted to take into account this balancing. We will test how the model performs when this balancing is not done -- i.e. the model is trained on the original distribution of topics.
- Impact of adding more training data -- currently, the model is trained on ~64,000 data points (at least 1000 data points per topic). Fasttext trains more quickly and so should allow us to increase the training data without substantially impacting the time it takes to train the model. We have almost 6M labeled data points so there is a lot of opportunity to grow the training data set if this has a substantial positive impact on model performance.
- Other standard hyperparameter tuning (e.g., learning rate, embeddings dimensionality, vocab size)
Out of scope:
- there are two branches of the ORES pipeline: drafttopic (predict topics for first drafts of an article) and articletopic (predict topics for current versions of an article). For Wikidata, we are focusing just on the articletopic facet. Future work could be expanding this to also include a drafttopic facet, though maintaining separate models might be unnecessary for Wikidata.