Replace {maptpx} for Topic Modeling in WDCM
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	GoranSMilovanovic
	Sep 2 2018, 10:53 PM

Description

The {maptpx} LDA for topic modeling needs to be replaced in WDCM.
Main reason: the {maptpx} R package is not actively maintained anymore (for some time already).
The current alternative under consideration: MALLET LDA implementation in {SpeedReader} R package.

Also:

our results indicate that neither perplexity or BF based model selection yields human-interpretable topics of Wikidata items (some attempts to fix this are followed in T203238);
coherence measures will be introduced in model selection; as we change the algorithm we run, the time is right to implement this step too.

If this fails for any reason, we either stick to {maptpx} and introduce coherence measures there, or migrate Spark's mllib implementation of LDA.

		Status	Subtype	Assigned	Task
		Resolved		GoranSMilovanovic	T203366 Replace {maptpx} for Topic Modeling in WDCM
		Resolved		GoranSMilovanovic	T222359 Install Gensim for Python3 on stat1007

GoranSMilovanovic triaged this task as Medium priority.Sep 2 2018, 10:53 PM

GoranSMilovanovic created this task.

This is going to be a general WDCM task, not something implemented specifically for the WDCM Wikipedia Semantics (aka. WDCM Sitelinks).
Removing parent task.

Following thorough experiments with Python Gensim and Apache Spark, which both use the same online LDA estimation, these are the conclusions:

we will replace {maptpx}, but not with Gensim or Spark's LDA routines;
instead, we will go for the R {text2vec} package and use its WarpLDA implementation.

Ratio:

for some time already, users of Gensim have noticed a strange behavior of its perplexity measure:
namely, the perplexity of a hold-out sample in cross-validation increases with the number of topics in the model,
which is quite uncommon, not to say impossible;
I have checked all the code and the math in Gensim and there does not seem to be a bug in their computation of perplexity;
furthermore, I was able to demonstrate empirically the following interesting fact: the above described phenomenon happens only in the context of cross-validation, and never if one fits the corpus as a whole;
this implies that probably there is something about the math of online LDA estimation that we do not understand fully;
my working hypothesis is that the algorithm is biased to find optimal solutions on a smaller number of topics as the size of the corpus decreases, which
might be empirically plausible, but is certainly not necessarily true at all;
I guess the size of the hold-out sample in cross-validation is thus generating a problem.

The authors of Gensim now recommend using coherence measures in place of perplexity;
we already use coherence-based model selection in LDA to support our WDCM (S)itelinks and (T)itles dashboards;
however, I am not ready to go with this - we want to work with a routine which exactly reproduces the known and expected behavior of a topic model.

{text2vec} seems to be the way to go:

this R package is actively maintained, unlike {maptpx} (which I've liked very much) which we seek to replace;
its WarpLDA procedure scales in RAM easily and provides for a rapid estimation of topic models:
I have tested corpora with tens of thousands documents and tens of thousands of features across 10 ... 300 topics,
the package was tested on the reference Wikipedia corpus;
it currently runs a serial implementation, but given its performance we can easily fit many models in parallel.

Next steps:

All changes implemented: {text2vec} procedures replacing {maptpx};
Deploying now; the test run was successful so the dashboards are already update from the new ML back-end;
first run in production scheduled for tomorrow, June 7/2019;
behavior: too many topics in the optimal model from perplexity based model-selection;
action: we will switch the WDCM model selection procedures from perplexity based to coherence based (as in WDCM Sitelinks and Titles).

Closing the ticket.