Page MenuHomePhabricator

Implement Topic Coherence scores for WDCM (S)itelinks
Closed, ResolvedPublic

Description

  • Implement topic coherence scores in WDMC (S)itelinks topic models;
  • Compare to the currently used Diversity score (i.e. Shannon's index re-scaled).

Event Timeline

GoranSMilovanovic created this task.
  • Following the examination of several topic coherence measures reported in the recent literature,
  • the decision is to implement a version of the C_uci measure as defined in:

Exploring the Space of Topic Coherence Measures. Michael Röder, Andreas Both, Alexander HinneburgPublished 2015 in WSDM. DOI:10.1145/2684822.2685324

Our version

  • will use the Normalized Pointwise Mutual Information (NPMI) instead of the originally used PMI; empirical studies have shown how C_uci with NMPI performs better in respect to predicting human judgments of topic coherence;
  • will not rely on term (item, in our case) frequency obtained from context vectors (P_sw in the paper), but on the frequencies computed directly from the Document-Term (Wikipedia-Item, in our case) matrix.

The implementation will be used (a) to select the K topic model for a given WDCM semantic category (if it makes sense), and (b) to report on topic coherence on the dashboard.

  • Topic coherence measure implemented;
  • Running model selection in respect to topic coherence.
  • Results: as expected, the number of topics encompassed by LDA models selected in respect to topic coherence is larger than when purely statistical (perplexity) measures are used.
  • Currently using M = 5 top topic terms (items, in our case);
  • Experimenting with M = 10, M = 15 (most frequently mentioned in the literature) values now.
  • Prima facie validity with M = 15 top topic items seems acceptable.
  • This is online and can be tested.
  • The 'Distinctive Classes' annotations should be changed to reflect only the classes that typically annotate the top M topic items.
  • Implementing changes in annotations will be tracked on T203238.
  • Closing.