Page MenuHomePhabricator

Evaluate list-building tools for ad-hoc topic modeling
Closed, ResolvedPublic

Description

Perform a quantitative evaluation for the list-building tools (see API [1]) developed in T266768. The tools are aimed to define custom (ad-hoc) topics in contrast to the pre-defined ORES topics. The question is how well these approaches work. One relevant task is to automatically generate lists of articles belonging to a given topic, such as climate change. We use wikiprojects-labels as a ground-truth dataset for different (arbitrary) topics. Starting from a suitable input-article(s) of the given wikiproject, we compare the output of the list-building tools with the articles contained in the corresponding wikiprojects.

  • generate a curated dataset of wikiprojects and contained articles (overlap T238437)
  • identify input-articles characterizing the corresponding wikiproject
  • Query different list-building tools and quantify overlap to ground-truth

[1] https://list-building.toolforge.org/

Event Timeline

Update week 2021-02-01:

  • None

Update week 2021-02-08:

  • started to explore Isaac's dataset containing the wikiprojects-labels for articles in enwiki together with importance/quality

Update week 2021-02-15:

  • continued exploratory analysis of the data
  • this is mostly to help decide on defining parameters for the evaluation data, such as i) which wikiprojects should we include? (minimum number of articles, minimum level of activity); ii) which articles should we include (only high priority/quality); iii) how to find a single seed-article (or wikidata-item) needed as input for the list-building tool for a given wikiproject (e.g. search from title)

Update week 2021-02-22:

  • started to set up the pipeline for the list-building evaluation
    • based on isaac's wikiprojects dataset, I derived a subset of wikiprojects to use for evaluation of list-building; we have 1486 different wikiprojects with at least 100 articles that have an importance rating (top, high, mid, low); articles without importance ratings are discarded to make sure that articles are relevant for the wikiprojects.
    • identify a seed-article for each wikiproject by randomly sampling one of the articles with top-importance in that wikiproject
    • using the seed-article we query the list-building tools to generate lists of related articles; we compare the overlap with with the articles in the wikiproject via standard precision and recall metrics
  • implemented and ran a baseline from cirrussearch (via morelike) to have lower bound on precision and recall

Update week 2021-03-01:

  • none (focus week)

Update week 2021-03-08:

  • prepared a grund-truth dataset
  • implemented the 3 different tools into evaluation pipeline
  • currently running the list-building for all tools (+ morelike-baseline) on ~1400 different wikiprojects
  • aim for next week is to aggregate peformance metrics across the wikiprojects and write up results

Update week 2021-03-15:

  • started comparison of different list-building tools with morelike-baseline
  • for approximately half the projects, morelike actually yields best coverage of the articles in the wikiprojects
  • for the other half, the new list-building tools better capture articles content in wikiprojects, with reader-based list-building often yielding best performance (though in some cases the content-based and wikidata-based tools yield better results).
  • aim for next week is to check: i) is there a pattern which list-building tools work best for which wikiprojects (e.g. whether reader-based methods work well for wikiprojects related to, for example, geography) , ii) whether pooling the results from different list-building tools actually yields better coverage of a wikiproject than any single method. the latter is most closely capturing the use-case of the list-building tool to support campaign-organizers, where users can pick which tool yields the most useful results based in the specific case and interest. I also started to discuss with alex stinson, he was interested in testing these tools with event-organizers in practice; this would be interesting that it constitutes a more realistic evaluation but also that we could hopefully evaluate in a non-English language (so far, we have the wikiprojects-ground-truth dataset only for English) in order to also take advantage of the fact that the list-building tools are language-agnostic and can be readily applied to any language.
  • plan is to write up those results in the next week

Update week 2021-03-22:

  • dug a little deeper in this analysis. the picture that emerges is that the different methods are very complementary in how they are able to capture the articles contained in each wikiproject
  • the overlap among the different lists is very low (among the 100 items from each list, there are very few items in common); on average (over different wikiprojects) jaccard-index is ~0.05...0.1
  • the improvement of one list with respect to another is not marginal, but often times one list-building-method provides very poor coverage, while another provides very good coverage; for example, there are hundreds of wikiprojects for which the "reader-based" list yields coverage that is at least twice as good as the baseline (that is an improvement of 100% or more in the number of articles that match the articles contained in the wikiprojects)
  • there seems to be no consistent pattern in terms of whether a specific method works best when aggregating different wikiprojects into topics (e.g. the different wikiprojects related to "Biography")
  • this seems to suggest that a good strategy as a tool is to pool the results from different lists
  • discussing with Isaac, we realized that it will be good to check how these results hold for at least one other non-English wiki; Isaac already prepared the data and I should be able to repeat this analysis quickly in the next week (together with writing this up)

Update week 2021-03-29:

Update week 2021-06-28:

Closing this task as all todos have been completed.