Page MenuHomePhabricator

Provide historical redirect flag in Data Lake edit data
Open, LowPublic

Description

We already have page_is_redirect_latest in the mediawiki_page_history table, but it would be very useful to have an is_redirect flag for each revision.

This would allow us to analyze article trends—such as deletion rates, average edit rates, or article counts—over time while properly filtering out redirects (which can significantly alter the conclusions). Knowing whether a page is currently a redirect can help with this, but redirects are sometimes turned into regular pages and vice versa. For example, on the English Wikipedia, it's relatively common that a new user creates an article which duplicates an existing one, and a patroller reacts by turning that new page into a redirect to the existing article.

Event Timeline

nshahquinn-wmf renamed this task from Provide historical redirect information in Data Lake edit data to Provide historical redirect flag in Data Lake edit data.Mar 22 2017, 7:21 PM

We will be able to do this once we our changes regarding parsing text (content, not metadata) are final

Nuria triaged this task as Medium priority.Mar 27 2017, 3:40 PM
Nuria moved this task from Incoming to Dashiki on the Analytics board.

XML dumps loading and preprocessing has been tested, it works.
Then text parsing works as well.
However there still are some issues, with biggest being: the #REDIRECT command in mediawiki is internationalized, meaning depending on languages multiple names can be used.
A first step would be to know where to find the list of the commands-projects.

Reedy helped me find it, of course :) Thanks!!!

Here it is for French, and it's in the same place in all the Messages*.php files in that same folder:
https://github.com/wikimedia/mediawiki/blob/master/languages/messages/MessagesFr.php#L144

We cannot do this without plain text parsing, so moving this to Q4

Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Nuria lowered the priority of this task from Medium to Low.Apr 5 2018, 4:50 PM
Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.