Page MenuHomePhabricator

Newcomer tasks: evaluate new ORES topic models
Closed, ResolvedPublic

Description

In T234272: Newcomer tasks: evaluate topic matching prototypes, ambassadors evaluated several different methods for topic matching: morelike, ORES, and free-text. From those results, we decided to prefer ORES, but to rebuild the ORES models so that they perform better, have a more detailed ontology, and have full coverage in non-English languages.

The new models are ready, and we want to evaluate these, too. There are two models we're evaluating:

  • The "crosswalk model": this is the model that is built and scored on English Wikipedia, and then the scores are applied to the articles in local languages that are also found in English.
  • The "local model": this is the model that is first built in English Wikipedia, then rebuilt in the local languages to ensure full coverage of all articles.

There are some important details with how these models' ontology works:

  • There are now 64 topics, which is 25 more than before. This provides more detail.
  • "STEM" refers to "Science, Technology, Engineering, Math".
  • There are topics that have a *, such as STEM.STEM*. These are the "catch-all" topics. For instance, STEM.STEM* should list all kinds of science articles, and Geography.Regions.Asia.Asia* should list all kinds of Asia articles, regardless of region.

Here is how:

  1. Go to the "ORES (2020)" tab in this spreadsheet.
  2. Open up the new prototype here.
  3. Choose your language.
  4. For each topic in the dropdown, select some of the task types and run two searches:
    • Crosswalk: run one search with the second switch turned on -- the one that says "Only get tasks with topics returned from enwiki ORES, ignore local wiki ORES models."
    • Local: run one search with the second switch turned off.
    • Do not use the first switch at all, the one that says, "Only return article when topic is top-ranked match from ORES (does not do anything if no topics are selected)." That should be left off.
  5. After each of the two searches, look at the first ten articles that are returned, and count how many are good matches for the selected topic.
  6. Record the crosswalk score (switch turned on) in the "crosswalk" column of the spreadsheet.
  7. Record the local score (switch turned off) in the "local" column of the spreadsheet.
  8. If you notice any problems or weird patterns with the models that should be investigated or fixed, please add those as a comment in the spreadsheet. For instance, you may notice that a topic about science is showing a lot of articles about sports, or something like that. Or that a topic doesn't have any articles.

Details

Due Date
Feb 23 2020, 8:00 AM

Event Timeline

MMiller_WMF added subscribers: PPham, Dyolf77_WMF, Urbanecm and 2 others.

@Dyolf77_WMF @Urbanecm @PPham @revi -- this task is ready for you. Because there are two versions of the model, and each has 64 topics, this may take a long time. Please speak up if you are concerned about how much time this will take. We would like to have these results one week from now (Feb 23).

Trizek-WMF triaged this task as Medium priority.Feb 18 2020, 4:55 PM

@Dyolf77_WMF @revi @PPham @Urbanecm -- one clarification that comes from @Dyolf77_WMF. There are a series of topics that are "Geography". Those topics are not meant to show geographical features -- rather, they are meant to include all kinds of things that are inside or related to that geography. For instance, the topic Geography.Regions.Americas.South America should contain South American food, music, buildings, people, etc -- not just South American rivers, cities, mountains.

The topic that should include those things is called Geography.Geographical.

I've added an explanation column to the spreadsheet to clarify some of these things.

@Halfak -- as the ambassadors are reviewing the models, they are noticing some things and will post comments here. I want to post something that I noticed today. Here are some examples of articles that have high "Culture.Biography.Women" scores:

Articles like these that are not actual women biographies seem pretty common. Some of these are articles about women-related organizations. Some are common female names (e.g. "Emily"). And for some I don't see what would trigger the model, (e.g. "UTC-08:30"). What do you think?

This comment was removed by PPham.

@Halfak Also in "Culture.Linguistics" which is supposed to be "About languages, grammar, dialects", there are many results of disambiguations of names. I don't think I'd want disambiguations of names if I selected Linguistics, or any disambiguations at all. What do you think?

Also, there are too many wrestlers showing up in Entertainment and Television. I mean 1-2 out of 10 is okay but in Vietnamese it's like half or more than half of the results are wrestlers. Is it normal?

Note: I see disproportionally high number of Geography.Regions.Africa.Central Africa from unrelated areas (i.e Korean people, Japanese anime, etc etc). Local mode has 13040 results (none of them immediately visible are related), while crosswalk has just 27 results.

Two comments from my side:

  • A very low score for Geography.Regions.Africa.Central Africa (crosswalk and local)
  • Found a recurrent result(s): Articles from this specific template are shown when I choose different topics with the crosswalk option.

Few more:

  • If possible, please set the tool to open links in the new tabs. It'd save my live.
  • Maybe display first paragraph of the page to get an idea about the article?

(Otherwise done.)

I'm done on my part.

  • I think for now we should go with the crosswalk. The results are better because local articles' quality is very low, so the algorithm has a hard time deciding the correct topics for the articles.
  • We should exclude disambiguations of names out of all of the topics. Or any disambiguations at all (although there're only disambiguations of names showing up).
  • The "Culture.Media.Radio" doesn't have many articles, maybe we should reconsider adding it?
  • On the STEM, maybe we should combine Technology and Engineering?

@Halfak -- we are finished evaluating the crosswalk and local models in Vietnamese, Arabic, and Korean. Czech is still ongoing. But I think we have enough results to send to you to get your thoughts and recommendations on, and see if there are improvements you want to make. You can see the ambassadors' comments above in this task, and I've also consolidated them below, along with comments left in the evaluations spreadsheet.

  • Firstly, these models clearly improve over the morelike algorithm that we have been using, and do an excellent job across most topics -- with 9/10 or 10/10 scores from the local models on 53 of 64 topics in Arabic, 33 of 64 in Vietnamese, and 37 of 64 in Korean.
  • In Korean and Vietnamese, the crosswalk models performed better on average than the local ones. We have two theories for this below. But if we think crosswalk outperforms local models, do you think we should use it instead?
    • Local articles about a given subject can be less developed than the English article, and so the local model may have less text with which to accurately classify it.
    • Or maybe the local models do just fine on the articles that happen to crosswalk from English, but worse on the articles that only exist in the local language, because those articles, being about more obscure and less global subjects, are harder for the model to classify.
  • Some of the worse performing topics include:
    • Culture.Linguistics
    • Culture.Media.Radio
    • Geography.Regions.Africa.Central Africa (worst across all languages)
    • Geography.Regions.Africa.Southern Africa
    • Geography.Regions.Asia.South Asia
    • History and Society.Society
    • STEM.Mathematics
    • Culture.Sports (in Vietnamese)
  • It seems common that articles that are about women-related organizations or subjects end up in the Culture.Biography.Women topic. Here are some examples of articles that have high scores:
  • Culture.Linguistics contains many articles that are disambiguations of names, like this one. Perhaps we should handle disambiguation pages differently, or exclude them from the results we give to newcomers.
  • We noticed a surprisingly high ratio of articles about professional wrestling under Culture.Media.Entertainment.
  • In Arabic Wikipedia, there are a lot of articles about Moroccan sports championships being drawn into surprising topics. The thing that all those articles have in common is this template.
  • In History and Society.Education in Vietnamese, there are articles about international organization such as IMF, Conservation International, International Federation of Surveyors, Transparency International.

Hopping in here because this is some really fantastic feedback on the taxonomy!

Culture.Linguistics contains many articles that are disambiguations of names, like this one. Perhaps we should handle disambiguation pages differently, or exclude them from the results we give to newcomers.

The Culture.Linguistics connections to names is because of WikiProject Anthroponymy (yaml). Many of those name pages aren't actually disambiguation pages in English -- e.g., the Robert example. I prefer the route of filtering out disambiguation pages after the fact and leaving Anthroponymy in because it legitimately does belong in Linguistics. I'm willing to be convinced otherwise though. Interestingly too, many of the pages it covers are actually redirects in English -- e.g., the page Churchill, which redirects to Winston Churchill.

For History and Society.Society, there is a mixture of ethnic groups, general human rights, and a few odd inclusions such as Forestry (yaml) -- I'm happy to work with others to clean that up a bit. Probably could move the different ethnic/language groups to Linguistics or appropriate geographies and let the topic focus on Sociology, Feminism, Human Rights, etc.

We noticed a surprisingly high ratio of articles about professional wrestling under Culture.Media.Entertainment.

Entertainment is kinda a mixed bag (yaml) that does include wrestling and could probably be dispersed to other topics.

It seems common that articles that are about women-related organizations or subjects end up in the Culture.Biography.Women topic.

I've also been doing some more thinking about the Women's biographies topic and am not sure I think it should be something that we seek to predict (or at least be very hesitant to surface in interfaces). I just don't know how we could ever get good enough prediction for it without relying on explicit cues like Wikidata properties. For my Wikidata-based topic models (or the article recommendation that we've put together for WikiGap T244587), I've been explicitly filtering so that the Culture.Biography.Women prediction is only returned if the Wikidata properties for sex-or-gender match a predefined list. I think that I would be much more comfortable with us using the topic, for instance, to make sure that we when recommend biographies to be created/improved, we are not just recommending men.

@Halfak @Isaac -- also helpful might be that you can now actually test the whole workflow in beta:

You'll have to enable your homepage via your preferences, and then you can select articles through the suggested edits module. The article shown are after:

  • We consolidated some of the 64 ORES topics into a smaller set of 39 topics via this mapping.
  • After we applied the thresholds in T244297. We use those to select 250 articles that have scores above the threshold, and then sort them randomly.

Hopping in here because this is some really fantastic feedback on the taxonomy!

Culture.Linguistics contains many articles that are disambiguations of names, like this one. Perhaps we should handle disambiguation pages differently, or exclude them from the results we give to newcomers.

The Culture.Linguistics connections to names is because of WikiProject Anthroponymy (yaml). Many of those name pages aren't actually disambiguation pages in English -- e.g., the Robert example. I prefer the route of filtering out disambiguation pages after the fact and leaving Anthroponymy in because it legitimately does belong in Linguistics. I'm willing to be convinced otherwise though. Interestingly too, many of the pages it covers are actually redirects in English -- e.g., the page Churchill, which redirects to Winston Churchill.

@Isaac The cases you said - name pages aren't actually disambiguation pages - are only in English though. In Vietnamese it's entirely different. Half of the first 10 articles of "Culture.Linguistics" in Vietnamese is disambiguations of names, but they are literally disambiguations (Isabella, Andrew...). The other half is dates (27 tháng 2 (Feb 27), 18 tháng 4 (Apr 18)...)

We noticed a surprisingly high ratio of articles about professional wrestling under Culture.Media.Entertainment.

Entertainment is kinda a mixed bag (yaml) that does include wrestling and could probably be dispersed to other topics.

The problem is, more than half of the results is about wrestling and wrestlers, and in my opinion that's way too high. I'm concerned with the unbalanced ratio between wrestlers and other topics. Or is it because there're more articles about wrestlers in my language compared to other subjects?

My thoughts (still a quarter to go through):

  • There are some topics for which the local model is much worse than crosswalk
  • On the other hand, some other topics had Czech-specific results in their local results - which cannot happen in crosswalk, obviously
  • Generally, the local and crosswalk model perform similar to each other - I don't see any user-facing difference among them.
  • Geographical topics are too much "selectable" IMO - while I do know what is related to Americas and what is related to Europe from reading the few top lines of the articles, I'm not really educated on the differences between North and South America. Someone who's fond of geography may appreciate that, but not good for the general public IMO.

@Urbanecm -- thank you for the notes. About the geographic topics: we thought maybe people would want to select the geography that they live in. Like, a European user might select "Europe" and an African user might select "Africa", etc. What do you think about that?

I'm capturing cleanup work in T246909: Follow-up cleanup to topic models

I made subtasks for filtering out disambiguation articles and for cleaning up History & Society.Society.

Re. women-related orgs showing up under Culture.Biography.Women is that really a problem? From a topical interest perspective, it seems about right. It seems that having "Women" appear under "Biography" might be the real problem. E.g. This is likely due to projects like WikiProject Women scientists tagging articles like https://en.wikipedia.org/wiki/Women_in_science which is not a biography but certainly interesting to people who are interested in actual Women who are scientists. Should we accept this as a limitation, change the name of the topic, or do something with Wikidata to filter out womens' interest topics that are not literally women?

Re. the domination of Entertainment by Wrestlers, this is a funny and weird phenomena. Wrestlers obviously belong in Entertainment but them dominating the class is problematic. For whatever reason, English Wikipedians seem to love Wrestlers and has extensive coverage of them. What strategies could we use to filter wrestlers out of general entertainment? I'm honestly not sure. I guess we could give Wrestlers their own category. Given the massive interest and coverage, it seems like that might be productive.

Wrt. some subtype dominating a topic, not that this might not affect the Growth use case as we might end up not using the scores in any other way than a threshold cutoff (see T242476: Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others for more discussion). In general it would still be nice to avoid it, of course.

@MMiller_WMF, @Trizek-WMF: The Due Date set for this open task is three months ago. Can you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!

@Halfak -- is this task still part of your work, or should we resolve it?

I think the evaluation is complete and we can resolve this task. There are improvements we'd like to make as follow-up. Those have been deprioritized at the moment, but they are still on our backlog.