Page MenuHomePhabricator

Replace {maptpx} for Topic Modeling in WDCM
Closed, ResolvedPublic

Description

  • The {maptpx} LDA for topic modeling needs to be replaced in WDCM.
  • Main reason: the {maptpx} R package is not actively maintained anymore (for some time already).
  • The current alternative under consideration: MALLET LDA implementation in {SpeedReader} R package.

Also:

  • our results indicate that neither perplexity or BF based model selection yields human-interpretable topics of Wikidata items (some attempts to fix this are followed in T203238);
  • coherence measures will be introduced in model selection; as we change the algorithm we run, the time is right to implement this step too.

If this fails for any reason, we either stick to {maptpx} and introduce coherence measures there, or migrate Spark's mllib implementation of LDA.

Event Timeline

GoranSMilovanovic created this task.
  • This is going to be a general WDCM task, not something implemented specifically for the WDCM Wikipedia Semantics (aka. WDCM Sitelinks).
  • Removing parent task.

Following thorough experiments with Python Gensim and Apache Spark, which both use the same online LDA estimation, these are the conclusions:

  • we will replace {maptpx}, but not with Gensim or Spark's LDA routines;
  • instead, we will go for the R {text2vec} package and use its WarpLDA implementation.

Ratio:

  • for some time already, users of Gensim have noticed a strange behavior of its perplexity measure:
  • namely, the perplexity of a hold-out sample in cross-validation increases with the number of topics in the model,
  • which is quite uncommon, not to say impossible;
  • I have checked all the code and the math in Gensim and there does not seem to be a bug in their computation of perplexity;
  • furthermore, I was able to demonstrate empirically the following interesting fact: the above described phenomenon happens only in the context of cross-validation, and never if one fits the corpus as a whole;
  • this implies that probably there is something about the math of online LDA estimation that we do not understand fully;
  • my working hypothesis is that the algorithm is biased to find optimal solutions on a smaller number of topics as the size of the corpus decreases, which
  • might be empirically plausible, but is certainly not necessarily true at all;
  • I guess the size of the hold-out sample in cross-validation is thus generating a problem.
  • The authors of Gensim now recommend using coherence measures in place of perplexity;
  • we already use coherence-based model selection in LDA to support our WDCM (S)itelinks and (T)itles dashboards;
  • however, I am not ready to go with this - we want to work with a routine which exactly reproduces the known and expected behavior of a topic model.

{text2vec} seems to be the way to go:

  • this R package is actively maintained, unlike {maptpx} (which I've liked very much) which we seek to replace;
  • its WarpLDA procedure scales in RAM easily and provides for a rapid estimation of topic models:
  • I have tested corpora with tens of thousands documents and tens of thousands of features across 10 ... 300 topics,
  • the package was tested on the reference Wikipedia corpus;
  • it currently runs a serial implementation, but given its performance we can easily fit many models in parallel.

Next steps:

  • install {text2vec}
  • change WDCM ML procedures
  • implement and deploy.

@RazShuty

  • All changes implemented: {text2vec} procedures replacing {maptpx};
  • Deploying now; the test run was successful so the dashboards are already update from the new ML back-end;
  • first run in production scheduled for tomorrow, June 7/2019;
  • behavior: too many topics in the optimal model from perplexity based model-selection;
  • action: we will switch the WDCM model selection procedures from perplexity based to coherence based (as in WDCM Sitelinks and Titles).

Closing the ticket.