Page MenuHomePhabricator

Link-based and text-based topic evaluations October 2020
Closed, ResolvedPublicNov 6 2020

Description

In T234272: Newcomer tasks: evaluate topic matching prototypes, we evaluated three different approaches for finding lists of articles around a given topic. For each of the three approaches, we did this by looking at 10 random articles from each topic (that were also newcomer tasks), and counting how many of those 10 seemed to fit in the topic. After doing this in four languages, we decided that the ORES text-based model was best. In T240517: [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics), we built that model into ElasticSearch and used it for the newcomer tasks feature.

A problem with the text-based model is that separate models have to be trained for every language, which is a lot of human and machine time. Because of this, text-based models only exist in five languages right now, meaning low topic coverage in all other languages.

Therefore, the Research team built the "link-based" model, which is easy to apply to every language and gives high coverage. Whereas the text-based model determines topics based on the words in an article, the link-based model looks at an article's wikilinks, and determines topics based on which other articles are linked.

In this task, we are going to evaluate the new link-based model in five languages. Alongside it, we are going to re-evaluate the text-based model, to make sure we have an apples-to-apples comparison. If the link-based model performs strongly enough, we can start using the link-based model, and instantly be able to bring all Wikipedias to full topic coverage!

Here's how we'll do it (hopefully, this will be a smoother process than last time):

  1. Open this spreadsheet.
  2. Start on the "link-based" tab for your language.
  3. Open each article and decide if it belongs to the topic listed in the "topic" column. There are ten random articles for each of the 64 topics.
  4. In the "Correct?" column, put a 1 if the article belongs to that topic, and a 0 if it does not.
  5. Go to the "text-based" tab for your language and repeat the process.

When complete, we'll calculate accuracy for each model and decide how to proceed.

langLink basedText based
ardonedone
csdonedone
endone
frdonedone
vidonedone

Details

Due Date
Nov 6 2020, 7:00 PM

Event Timeline

@PPham @Dyolf77_WMF @Urbanecm @Trizek-WMF -- this is the validation work for the new topic model. Please see the instructions in the task description. I'll be working on English. I expect this to be about three hours of work, and I would like us to be complete by the end of next week.

@Dyolf77_WMF and @Trizek-WMF, perhaps you would like to collaborate on French, but it is up to you.

MMiller_WMF set Due Date to Oct 30 2020, 7:00 AM.Oct 21 2020, 11:43 PM
Tgr added a comment.Oct 22 2020, 3:26 AM

Is it worth checking the enwiki-text-based recommendations as well? If it turns out that those are significantly better than link-based, we could use the latter as a fallback.

@Tgr -- I think you're talking about the English scores crosswalked to other languages, right? @Isaac and I talked about that, and we were disinclined to mix-and-match model scores inside a wiki (because the crosswalks have like 60% for lots of wikis). Would that mixing and matching cause any engineering issues? I am also loathe to ask people to evaluate another 640 articles for their language, but we could consider it.

Tgr added a comment.Oct 22 2020, 5:17 AM

Uh, yeah, that's quite a bit of work for something that's unlikely to produce good results.

Engineering-wise, I think it would be fine. Mixing would happen in the job that ships data to ElasticSearch, and seems like a fairly simple change. Not sure how easy it is to differentiate between "the algorithm didn't handle the page" and "the algorithm didn't recommend anything" - e.g. if you want to fall back from the enwiki score to the link score, and some article does have an enwiki sitelink, but all the recommendations are below the confidence threshold (so basically the algorithm says "this page doesn't really belong in any topic"), you might not be able to prevent the job from interpreting that as no crosswalk coverage for the page and falling back to link-based. But I don't think that has any significance in practice.

Trizek-WMF triaged this task as Medium priority.

Claiming for coordination.

Urbanecm_WMF changed the subtype of this task from "Task" to "Deadline".Oct 26 2020, 7:03 PM
Dyolf77_WMF moved this task from Backlog to In review on the User-Dyolf77 board.Oct 27 2020, 5:24 PM
Trizek-WMF updated the task description. (Show Details)Oct 28 2020, 4:42 PM

My notes from evaluating link-based topic model:

  • General notes
    • I noticed several disambiguation pages to be included - they're a technical page that doesn't really fit any topic
  • Topic-specific notes:
    • business-and-economics seems to have a lot of false positives (notably, many articles about cars [ie. products of companies] than articles about actual companies/businesses
    • comics-and-anime has a lot of articles about The Simpsons, which doesn't quite match the topic (albeit it's related, comics-and-anime is too specific for The Simpsons to fit there)
    • education is quite bad-performing, a lot of articles weren't really related to educating
    • all of the geographical articles were related to concrete reliefs all around the world, rather than geographical terms. This might confuse users who would expect geographical terms to be there. Rated all as zero for that reason, as I would expect terms to be there rather than reliefs (which should/could be in categories like "Europe", because that's more specific)
    • linguistics contains a disproportionate number of disambiguation pages
    • earth-and-environment contains articles about extinct species, like https://en.wikipedia.org/wiki/Shingopana

My notes from evalauting text-based topic model:

  • Topic-specific notes
    • biography has a surprisingly low average (while it is still high compared to the other topics), given biography is one of the topics with very clear definition (in my opinion, it's quite easy to determine whether an article is a biography or not) - the text based model even suggests numbers as biographies
    • comics-and-anime has a lot of articles about The Simpsons, which doesn't quite match the topic (albeit it's related, comics-and-anime is too specific for The Simpsons to fit there; applies even more than in link based model)
    • all of the geographical articles were related to concrete reliefs all around the world, rather than geographical terms. This might confuse users who would expect geographical terms to be there. Rated all as zero for that reason, as I would expect terms to be there rather than reliefs (which should/could be in categories like "Europe", because that's more specific)
    • linguistics contains a disproportionate number of disambiguation pages
    • radio has a lot of articles related to television
    • stem has surprisingly low "correctness score"
    • western-africa has surprisingly low coverage - a lot of articles that don't really fit the topic definition
Urbanecm_WMF updated the task description. (Show Details)Oct 30 2020, 4:01 PM
PPham updated the task description. (Show Details)Nov 1 2020, 2:04 AM
PPham added a comment.EditedNov 1 2020, 2:10 AM

So sorry I'm this late.
Here are some comments for the link-based:

  • Some flora/fauna are listed in the geography topics, perhaps because in other languages they are said to be endemic to that place, but you cannot deduce it from the articles in Vietnamese, so 0 point.
  • STEM topic is all flora/fauna. Still okay I guess, but if I'm an user I would expect other kinds of articles and more diversity.
  • Disambiguation pages still show up.

Overall I still got 0-points here and there, but the rate of 0-points over 1-points is not alarming at all, I'd say it's acceptable.

Trizek-WMF renamed this task from Link-based topic evaluations October 2020 to Link-based and text-based topic evaluations October 2020.Nov 2 2020, 3:36 PM
Trizek-WMF changed Due Date from Oct 30 2020, 7:00 AM to Nov 6 2020, 7:00 PM.
Trizek-WMF updated the task description. (Show Details)

New due date for Text-based review: Friday, Nov 6, 8:00 PM.

Trizek-WMF updated the task description. (Show Details)Nov 3 2020, 3:40 PM
Trizek-WMF updated the task description. (Show Details)

Notes:

  • I set all disambiguation pages to 0, except the ones that are for sure about a given topic (for instance, two towns with the same complicated name, listed under "western-Europe")
  • In some cases, some topics are only listing articles from the same area. For instance, Southern Africa is only covering South Africa articles, which is a bias.
  • Internet & culture and Linguistics are clearly weaknesses for both models. They are the ones that collect most zeros for French.
PPham updated the task description. (Show Details)Nov 6 2020, 3:54 AM
PPham added a comment.Nov 6 2020, 3:58 AM

Notes for text-based:

  • The rate of accuracy is much lower in text-based than in link-based.
  • Species articles appear everywhere: to the extend that "politics and government" is 1/10 because the other 9 are species articles.
  • We still have disambiguation pages, I score them all 0.

I'd say we choose link-based.

@MMiller_WMF, can I close this task (even if English Wikipedia is not done)?

Trizek-WMF closed this task as Resolved.Nov 30 2020, 10:40 AM

We use the data from this on the current development.