Page MenuHomePhabricator

Link-based and text-based topic evaluations October 2020
Closed, ResolvedPublicNov 6 2020

Description

In T234272: Newcomer tasks: evaluate topic matching prototypes, we evaluated three different approaches for finding lists of articles around a given topic. For each of the three approaches, we did this by looking at 10 random articles from each topic (that were also newcomer tasks), and counting how many of those 10 seemed to fit in the topic. After doing this in four languages, we decided that the ORES text-based model was best. In T240517: [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics), we built that model into ElasticSearch and used it for the newcomer tasks feature.

A problem with the text-based model is that separate models have to be trained for every language, which is a lot of human and machine time. Because of this, text-based models only exist in five languages right now, meaning low topic coverage in all other languages.

Therefore, the Research team built the "link-based" model, which is easy to apply to every language and gives high coverage. Whereas the text-based model determines topics based on the words in an article, the link-based model looks at an article's wikilinks, and determines topics based on which other articles are linked.

In this task, we are going to evaluate the new link-based model in five languages. Alongside it, we are going to re-evaluate the text-based model, to make sure we have an apples-to-apples comparison. If the link-based model performs strongly enough, we can start using the link-based model, and instantly be able to bring all Wikipedias to full topic coverage!

Here's how we'll do it (hopefully, this will be a smoother process than last time):

  1. Open this spreadsheet.
  2. Start on the "link-based" tab for your language.
  3. Open each article and decide if it belongs to the topic listed in the "topic" column. There are ten random articles for each of the 64 topics.
  4. In the "Correct?" column, put a 1 if the article belongs to that topic, and a 0 if it does not.
  5. Go to the "text-based" tab for your language and repeat the process.

When complete, we'll calculate accuracy for each model and decide how to proceed.

langLink basedText based
ardonedone
csdonedone
endone
frdonedone
vidonedone

Details

Due Date
Nov 6 2020, 7:00 PM

Event Timeline

@PPham @Dyolf77_WMF @Urbanecm @Trizek-WMF -- this is the validation work for the new topic model. Please see the instructions in the task description. I'll be working on English. I expect this to be about three hours of work, and I would like us to be complete by the end of next week.

@Dyolf77_WMF and @Trizek-WMF, perhaps you would like to collaborate on French, but it is up to you.

Is it worth checking the enwiki-text-based recommendations as well? If it turns out that those are significantly better than link-based, we could use the latter as a fallback.

@Tgr -- I think you're talking about the English scores crosswalked to other languages, right? @Isaac and I talked about that, and we were disinclined to mix-and-match model scores inside a wiki (because the crosswalks have like 60% for lots of wikis). Would that mixing and matching cause any engineering issues? I am also loathe to ask people to evaluate another 640 articles for their language, but we could consider it.

Uh, yeah, that's quite a bit of work for something that's unlikely to produce good results.

Engineering-wise, I think it would be fine. Mixing would happen in the job that ships data to ElasticSearch, and seems like a fairly simple change. Not sure how easy it is to differentiate between "the algorithm didn't handle the page" and "the algorithm didn't recommend anything" - e.g. if you want to fall back from the enwiki score to the link score, and some article does have an enwiki sitelink, but all the recommendations are below the confidence threshold (so basically the algorithm says "this page doesn't really belong in any topic"), you might not be able to prevent the job from interpreting that as no crosswalk coverage for the page and falling back to link-based. But I don't think that has any significance in practice.

Trizek-WMF triaged this task as Medium priority.

Claiming for coordination.

Urbanecm_WMF changed the subtype of this task from "Task" to "Deadline".Oct 26 2020, 7:03 PM

My notes from evaluating link-based topic model:

  • General notes
    • I noticed several disambiguation pages to be included - they're a technical page that doesn't really fit any topic
  • Topic-specific notes:
    • business-and-economics seems to have a lot of false positives (notably, many articles about cars [ie. products of companies] than articles about actual companies/businesses
    • comics-and-anime has a lot of articles about The Simpsons, which doesn't quite match the topic (albeit it's related, comics-and-anime is too specific for The Simpsons to fit there)
    • education is quite bad-performing, a lot of articles weren't really related to educating
    • all of the geographical articles were related to concrete reliefs all around the world, rather than geographical terms. This might confuse users who would expect geographical terms to be there. Rated all as zero for that reason, as I would expect terms to be there rather than reliefs (which should/could be in categories like "Europe", because that's more specific)
    • linguistics contains a disproportionate number of disambiguation pages
    • earth-and-environment contains articles about extinct species, like https://en.wikipedia.org/wiki/Shingopana

My notes from evalauting text-based topic model:

  • Topic-specific notes
    • biography has a surprisingly low average (while it is still high compared to the other topics), given biography is one of the topics with very clear definition (in my opinion, it's quite easy to determine whether an article is a biography or not) - the text based model even suggests numbers as biographies
    • comics-and-anime has a lot of articles about The Simpsons, which doesn't quite match the topic (albeit it's related, comics-and-anime is too specific for The Simpsons to fit there; applies even more than in link based model)
    • all of the geographical articles were related to concrete reliefs all around the world, rather than geographical terms. This might confuse users who would expect geographical terms to be there. Rated all as zero for that reason, as I would expect terms to be there rather than reliefs (which should/could be in categories like "Europe", because that's more specific)
    • linguistics contains a disproportionate number of disambiguation pages
    • radio has a lot of articles related to television
    • stem has surprisingly low "correctness score"
    • western-africa has surprisingly low coverage - a lot of articles that don't really fit the topic definition

So sorry I'm this late.
Here are some comments for the link-based:

  • Some flora/fauna are listed in the geography topics, perhaps because in other languages they are said to be endemic to that place, but you cannot deduce it from the articles in Vietnamese, so 0 point.
  • STEM topic is all flora/fauna. Still okay I guess, but if I'm an user I would expect other kinds of articles and more diversity.
  • Disambiguation pages still show up.

Overall I still got 0-points here and there, but the rate of 0-points over 1-points is not alarming at all, I'd say it's acceptable.

Trizek-WMF renamed this task from Link-based topic evaluations October 2020 to Link-based and text-based topic evaluations October 2020.Nov 2 2020, 3:36 PM
Trizek-WMF changed Due Date from Oct 30 2020, 7:00 AM to Nov 6 2020, 7:00 PM.
Trizek-WMF updated the task description. (Show Details)

New due date for Text-based review: Friday, Nov 6, 8:00 PM.

Notes:

  • I set all disambiguation pages to 0, except the ones that are for sure about a given topic (for instance, two towns with the same complicated name, listed under "western-Europe")
  • In some cases, some topics are only listing articles from the same area. For instance, Southern Africa is only covering South Africa articles, which is a bias.
  • Internet & culture and Linguistics are clearly weaknesses for both models. They are the ones that collect most zeros for French.

Notes for text-based:

  • The rate of accuracy is much lower in text-based than in link-based.
  • Species articles appear everywhere: to the extend that "politics and government" is 1/10 because the other 9 are species articles.
  • We still have disambiguation pages, I score them all 0.

I'd say we choose link-based.

@MMiller_WMF, can I close this task (even if English Wikipedia is not done)?

We use the data from this on the current development.

This task has been closed for a while but leaving a summary of takeaways now that we've had some time to analyze / consider the feedback more in-depth:

  • I did the basic analysis of link-based vs. text-based in the spreadsheet in the task description. Summary stats below in the table below. Overall, the link-based model seemed to perform as well or better than the text-based model. The link-bsed model also had fewer topics where the performance was below 7/10 (70%), which is the basic standard of performance we expect.
  • Disambiguation pages. This is something we'll have to filter out at some stage because I know its recurring feedback and disambiguation pages would make confusing recommendations in a lot of contexts. I know that the pageprops API provides this information but not sure if CirrusSearch also can filter on disambiguation pages. Wikidata maintains some of this information too and in the past, I haven't found it particularly consistent, but it could be used to filter out a lot of the noise at least.
  • Education as a low-performing topic. Looking at the WikiProjects that provide us with our initial education labels, there's a strong US bias, which might account for some of the low performance. I'll have to see if there is some way to counteract that.
  • Geography topics being "correct" but all from the same region: I've started work on building a country-level classifier that will hopefully solve this problem and frankly be much more useful to editors looking for articles about their area (T263646). Similar patterns were noted in a few other topics too (e.g., all flora/fauna in STEM), which suggests that we may also need an approach for diversifying the articles that show up under a topic given that some wikis do have a very large amount of content about very specific sub-topics.
TopicOverall PrecisionTopics < 7/10 correct
Arabic (text)90.9%4
Arabic (links)94.7%2
Czech (text)74.5%17
Czech (links)81.4%11
English (text)85.5%7
English (links)89.5%6
French (text)79.7%11
French (links)88.0%5
Vietnamese (text)79.7%14
Vietnamese (links)91.3%2