Page MenuHomePhabricator

Incorrect backlog cutoff for redirects in the New Page Patrol queue
Open, NormalPublic2 Story Points

Description

The backlog length for the New Pages Queue for articles is 90 days, but several weeks ago editors realized that redirects were dropping off of the queue after only 30 days. In practice, this means that many if not most redirects will not be reviewed. This problem would be solved if the backlogs were both 90 days long.

Note that currently there should be no redirects older than 20-something days in the queue, as once we noticed the problem a few editors made sure to keep the back of the queue patrolled.

Event Timeline

Rosguill created this task.Jul 4 2019, 4:31 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 4 2019, 4:31 AM
Rosguill updated the task description. (Show Details)Jul 4 2019, 4:32 AM
JJMC89 added a subscriber: JJMC89.

It looks like cron/updatePageTriageQueue.php is using a hardcoded 30 days instead of the configured PageTriageMaxAge to remove pages from the queue.

Restricted Application added a project: Growth-Team. · View Herald TranscriptJul 4 2019, 9:23 PM
Rosguill updated the task description. (Show Details)Jul 8 2019, 7:34 AM
JTannerWMF moved this task from Inbox to External on the Growth-Team board.
JTannerWMF added subscribers: Niharika, JTannerWMF.

We are tagging the Community-Tech team with this being a top wish list item. CC: @Niharika

Niharika triaged this task as Normal priority.Jul 9 2019, 11:34 PM
Niharika set the point value for this task to 2.
Niharika moved this task from To be estimated/discussed to Estimated on the Community-Tech board.
ifried added subscribers: MusikAnimal, ifried.EditedJul 24 2019, 9:24 PM

Update: When @MusikAnimal looked into the history of the code, this behavior (i.e. redirects dropped after 30 days in the queue) appeared to be an intentional choice/by design, though we do not know the reason why (perhaps database storage issues?). @Barkeep49, do you know why this decision may have been made?

The patch that changed redirects to purge after 30 days was https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/24939, which was merged by @kaldari. Perhaps he remembers the reasoning?

My suspicion is simply performance. At the time of writing, there are 31,741 redirects in the queue (reviewed + unreviewed, over 30 days). This is versus 22,646 mainspace pages + 4,086 drafts + 17,983 user pages = 44,715 non-redirects. So if we triple the redirects (90 days), we have 95,223 pages -- over twice as many as all other types of pages combined. It's surely the case that the database servers in 2012 were not as powerful as they are today, so maybe we can get away with storing that much more data. It's still an awful lot, though.

We should weigh out the costs/benefits. Patrolling redirects seems much less specialized than new page reviewing. I suspect your average [[offensive term]] redirect to [[famous person]] will get picked up by recent changes patrollers, no? Do we have any idea on how many redirects are deleted or corrected as a result of NPP, typically?

This discussion might be better held at https://en.wikipedia.org/wiki/Wikipedia_talk:New_pages_patrol/Reviewers but pinging @Rosguill and @DannyS712 as the two people who have dived deep into redirects most recently.

I'd say that on a typical day of patrolling the back end of the queue, I'll go through 150-300 articles, send 5-10 to RfD, tag around 5 with G5 or R3, and either retarget or convert-to-dab 5 more. Attack redirects are less frequent, I'll come across a handful of attack redirects per week.

These numbers can swing quite a bit though, because errors and/or vandalism on redirects are often repeated by the same editor multiple times in a row. There's been days where 30+ redirects will end up being bundled together for RfD, and on some days I've run into 10+ G5s in a row.

DannyS712 added a comment.EditedJul 25 2019, 12:25 AM

For me, part of the issue is how I review redirects: I have a chrome extension to let me mass open links in new tabs. I go to the Special:NewPages feed of unpatrolled redirects and open 100 at a time, then go through and close any that need a second look / aren't obviously acceptable, and then go through and mark the remaining tabs as patrolled. The ratelimit on patrolling means that I have to pause after each one. Something I've been thinking about when it comes to redirect patrolling is extending my bot's task of patrolling redirects to create a pseudo-group of "autopatrolled redirect creators" to ease the thousands of redirects that need to be patrolled, and I was waiting for T223828 to start investigating the task / looking for consensus. I've opened a preliminary discussion at https://en.wikipedia.org/wiki/Wikipedia_talk:New_pages_patrol/Reviewers#Initial_thoughts_-_autopatrolled_redirects. The bot task could save a few dozen per week, but I think personally the ratelimit on patrolling redirects is a big hurdle.

Okay so it sounds like there is a healthy amount of redirect patrolling using Page Curation. I suppose let's talk to the DBAs about this. If they think all those extra rows are fine, then there's no reason not to bump the expiry to 90 days, provided you're okay with the longer backlog.

I think personally the ratelimit on patrolling redirects is a big hurdle.

Are you sure it's a rate limit on patrolling, and not the MediaWiki API itself?

DannyS712 added a comment.EditedJul 25 2019, 1:08 AM

Okay so it sounds like there is a healthy amount of redirect patrolling using Page Curation. I suppose let's talk to the DBAs about this. If they think all those extra rows are fine, then there's no reason not to bump the expiry to 90 days, provided you're okay with the longer backlog.

I think personally the ratelimit on patrolling redirects is a big hurdle.

Are you sure it's a rate limit on patrolling, and not the MediaWiki API itself?

The popup window says An error occurred while marking the page as reviewed: You've exceeded your rate limit. Please wait some time and try again. This only appears to arise when using the page curation toolbar - using the "mark this page as patrolled" button at the bottom when the toolbar is disabled works fine (but it also moves around on the page if there is a redirect template, so the toolbar is more convenient). https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&uiprop=ratelimits says that my rate limits are:

Ratelimits
{
    "batchcomplete": "",
    "query": {
        "userinfo": {
            "id": 34581532,
            "name": "DannyS712",
            "ratelimits": {
                "move": {
                    "user": {
                        "hits": 8,
                        "seconds": 60
                    },
                    "extendedmover": {
                        "hits": 16,
                        "seconds": 60
                    }
                },
                "edit": {
                    "user": {
                        "hits": 90,
                        "seconds": 60
                    }
                },
                "badcaptcha": {
                    "user": {
                        "hits": 30,
                        "seconds": 60
                    }
                },
                "emailuser": {
                    "user": {
                        "hits": 20,
                        "seconds": 86400
                    }
                },
                "changeemail": {
                    "user": {
                        "hits": 4,
                        "seconds": 86400
                    }
                },
                "rollback": {
                    "user": {
                        "hits": 10,
                        "seconds": 60
                    },
                    "rollbacker": {
                        "hits": 100,
                        "seconds": 60
                    }
                },
                "purge": {
                    "user": {
                        "hits": 30,
                        "seconds": 60
                    }
                },
                "linkpurge": {
                    "user": {
                        "hits": 30,
                        "seconds": 60
                    }
                },
                "renderfile": {
                    "user": {
                        "hits": 700,
                        "seconds": 30
                    }
                },
                "renderfile-nonstandard": {
                    "user": {
                        "hits": 70,
                        "seconds": 30
                    }
                },
                "cxsave": {
                    "user": {
                        "hits": 10,
                        "seconds": 30
                    }
                },
                "urlshortcode": {
                    "user": {
                        "hits": 50,
                        "seconds": 120
                    }
                },
                "pagetriage-mark-action": {
                    "user": {
                        "hits": 1,
                        "seconds": 3
                    }
                },
                "pagetriage-tagging-action": {
                    "user": {
                        "hits": 1,
                        "seconds": 10
                    }
                },
                "thanks-notification": {
                    "user": {
                        "hits": 10,
                        "seconds": 60
                    }
                },
                "badoath": {
                    "user": {
                        "hits": 10,
                        "seconds": 60
                    }
                }
            }
        }
    }
}

The popup window says An error occurred while marking the page as reviewed: You've exceeded your rate limit. Please wait some time and try again.
...

Got it, I see now where this is happening in the code https://github.com/wikimedia/mediawiki-extensions-PageTriage/blob/master/includes/Api/ApiPageTriageAction.php#L31-L33. I think we could easily exempt redirects from this, if there was consensus to do so.

DannyS712 added a comment.EditedJul 25 2019, 1:23 AM

The popup window says An error occurred while marking the page as reviewed: You've exceeded your rate limit. Please wait some time and try again.
...

Got it, I see now where this is happening in the code https://github.com/wikimedia/mediawiki-extensions-PageTriage/blob/master/includes/Api/ApiPageTriageAction.php#L31-L33. I think we could easily exempt redirects from this, if there was consensus to do so.

I'm not sure that is the best idea though - per BEANS I've emailed you my concern - feel free to post it here if you think it isn't an issue

I'm not sure that is the best idea though - per BEANS I've emailed you my concern - feel free to post it here if you think it isn't an issue

Not quite BEANS-worthy in my opinion, but thanks for the caution! I wanted to state that removing this throttling for redirects is merely technically possible. Is there anyone else hitting the rate limit? I ask because you have a bot account to get around it, no? Anyway we might be getting a little off-topic :)

I'll try to do the math to see just how much of an impact the extra redirects will have on database storage, considering there's associated metadata too (pagetriage_page_tags), and not just the rows in pagetriage_page. I also noticed that while we don't expose things like category/reference counts (and even AfC state!) for redirects, there are still rows for them in the database, so maybe fixing that would give us more wiggle room.

I removed the Community-Tech tag from this ticket, as the potential changes are not a part of the Community Wishlist Survey.