Page MenuHomePhabricator

Text fetched by articlequality's `fetch_text` might not match the talk page label (for moved pages)
Closed, ResolvedPublic

Description

Consider the following sequence of events in the context of Portuguese Wikipedia:

  • March 2018: São Carlos was moved to São Carlos (desambiguação)
  • March 2018: Discussão:São Carlos was moved to Discussão:São Carlos (desambiguação)
  • March 2018: São Carlos (São Paulo) was moved to São Carlos
  • March 2018: Discussão:São Carlos (São Paulo) was moved to Discussão:São Carlos
  • April 2020: São Carlos was moved to São Carlos (São Paulo)
  • April 2020: Discussão:São Carlos was moved to Discussão:São Carlos (São Paulo)
  • April 2020: São Carlos (desambiguação) was moved to São Carlos
  • April 2020: Discussão:São Carlos (desambiguação) was moved to Discussão:São Carlos
  • April 2020: Labels were extracted from ptwiki-20200301-pages-meta-history*.xml*.bz2, including these:
{"timestamp": "20081220171253", "project": "marca de projeto", "wp10": "3", "page_title": "S\u00e3o Carlos"}
{"timestamp": "20171109163710", "project": "marca de projeto", "wp10": "4", "page_title": "S\u00e3o Carlos"}

Now, these are timestamps which appear in the history of Discussão:São Carlos (São Paulo).
However, when the text was extracted from ptwiki's API, it came from São Carlos, which is a disambiguation page, and not the article São Carlos (São Paulo) to which the labels refer to.

I don't know how often this mismatch between the text and the labels happens in the full datasets, but it should be fixed.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm really not sure how we could track this better. Do you have some suggestions @He7d3r?

Halfak triaged this task as Low priority.May 4 2020, 4:57 PM
Halfak moved this task from Unorganized to Maintenance/cleanup on the Machine-Learning-Team board.

While the dumps are processed, we could store the <id> of the talk pages instead of their <title>s. Then, an API query such as
https://pt.wikipedia.org/w/api.php?action=query&format=json&prop=info&pageids=18363&formatversion=2&inprop=subjectid
will return the <id> of the associated subject page (the one whose text we are interested in). This should work when pages are moved, since page moves do not change the pageid (but it is not guaranteed if the page is deleted and restored).

See https://github.com/wikimedia/articlequality/pull/126

I like it. Thanks for the PR. :)