Page MenuHomePhabricator

Filter out disambiguation pages in topic labels
Open, LowPublic

Description

Culture.Linguistics contains many articles that are disambiguations of names, like this one. Perhaps we should handle disambiguation pages differently, or exclude them from the results we give to newcomers.

The Culture.Linguistics connections to names is because of WikiProject Anthroponymy (yaml). Many of those name pages aren't actually disambiguation pages in English -- e.g., the Robert example. I prefer the route of filtering out disambiguation pages after the fact and leaving Anthroponymy in because it legitimately does belong in Linguistics. I'm willing to be convinced otherwise though. Interestingly too, many of the pages it covers are actually redirects in English -- e.g., the page Churchill, which redirects to Winston Churchill.

We should implement a strategy for filtering out disambiguation pages as part of our modeling pipeline.

Note that we'll need to check if a specific page on a specific wiki is a disambiguation page in order to exclude it from the set.

Event Timeline

@Halfak -- are you saying that this would exclude disambiguation pages from model training? Or from model scoring? I still think that disambiguation pages should get model scores, because it would be weird to have some article pages with no scores. We can exclude them on our side, and keep them from being recommended to newcomers as tasks to do.

Halfak triaged this task as High priority.Mar 23 2020, 4:56 PM
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.
Halfak lowered the priority of this task from High to Low.May 4 2020, 5:16 PM

It doesn't seem like there's a straightforward way to fix this. It does look like the model is working pretty well so deprioritizing this work.