The current ORES articletopic model pipeline trains topic classification models for Arabic, Czech, English, Korean, and Vietnamese: [[https://github.com/wikimedia/drafttopic/blob/master/Makefile]]
We would like to add wikidata to this pipeline. The current drafttopic pipeline is:
* pre-train word embeddings via fastText (actually in mwtext library)
* download training data via APIs and merge with labels
* freeze these embeddings and form article embeddings by taking the average of the embeddings for words in an article and then train a gradient-boosted classifier in sklearn over this article embedding
UnknownsWe will definitely retain the first two parts (pre-trained embeddings + download data via APIs) as they help to fix the vocabulary size and are generally useful for other models as well. While shifting from APIs to dumps would be useful for scaling up training, it is not an essential change. The rest we will experiment with, namely:
* The current articletopic models follow a two-step process: 1) pre-train word embeddings via fastText, 2) freeze these embeddings and form article embeddings by taking the average of the embeddings for words in an article and then train a gradient-boosted classifier in sklearn over this article embedding. The current [[https://github.com/geohci/wikidata-topic-model|Wikidata model]] follows a single-step process: train a fasttext supervised model end-to-end (similar average embedding architecture but a simple fully-connected linear classifier overtop the article embeddings)Performance of gradient-boosted classifier vs. The word embeddings and model weights can be extracted from the fasttext model such that predictions can be made without depending on fastText (see T242013#6155316) but it remains to be seen if this approach is compatible with the ORES architecture which is built around current sklearn-based gradient boosted classifiers being used by the other models.a fastText model (we will implement fasttext in revscoring and compare)
* Current training strategies have focused on using the Wikidata dumps for training.Impact of balanced vs. imbalance data -- i.e. in Wikidata, biographies occur very frequently and Mathematics-related items much less so. In the existing pipeline, the data is artificially balanced so that there are close to the same number of Biographies and Mathematics articles. The model statistics are then adjusted to take into account this balancing. We will test how the model performs when this balancing is not done -- i.e. the model is trained on the original distribution of topics.
* Impact of adding more training data -- currently, the model is trained on ~64,000 data points (at least 1000 data points per topic). Fasttext trains more quickly and so should allow us to increase the training data without substantially impacting the time it takes to train the model. We have almost 6M labeled data points so there is a lot of opportunity to grow the training data set if this has a substantial positive impact on model performance.
* Other standard hyperparameter tuning (e.g., It looks like the ORES architecture uses the APIs.learning rate, It's not hard to make the switch (see [[https://github.com/geohci/wikidata-topic-model/blob/master/app/app.py#L134|this code]] used by the experimental API) but is something we should consider.embeddings dimensionality, vocab size)
NOTE:Out of scope:
* there are two branches of the ORES pipeline: drafttopic (predict topics for first drafts of an article) and articletopic (predict topics for current versions of an article). For Wikidata, we will initiallywe are focusing just on the articletopic facet. Future work could be expanding this to also include a drafttopic facet, though maintaining separate models might be unnecessary for Wikidata.