Page MenuHomePhabricator

Extend PageTriageMaxAge (noindex) for unpatrolled articles at enwiki
Closed, DeclinedPublic

Description

In 2012, an RFC decided that unpatrolled articles should be NOINDEXED until reviewed and accepted. In 2016, it was implemented differently as discussed here. At that time, it was assumed that new articles would be patrolled within 90 days and it would be sufficient to only honor NOINDEX on mainspace articles for 90 days. The New Page Reviewer user group is at present unable to review articles within 90 days (there is a backlog of 14,000) and requests that the no-indexing of unreviewed articles be extended indefinitely (as discussed here) per the 2012 RFC.

Request: Set wgPageTriageMaxAge to a value higher than 90. Suggested 365 days infinite duration (null) per discussion


Grafana showing pages returning a NOINDEX state: https://grafana.wikimedia.org/d/GDZR_4IVz/pagetriage-debugging?orgId=1&from=now-7d&to=now&refresh=1m

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Stang

Hello everyone, Phabricator is not a place to reach community consensus, please discuss on English Wikipedia, thanks.

Quite right - this is not the place to relitigate a recent consensus.
The issue has been discussed a total of 3 times, each time reaching a consensus. Following the 10 year old strong consensus a WMF staff member confirmed that it would be developed, but it was later conveniently swept under the carpet. It has been discussed and reaffirmed recently (see the links I posted above). Consensus has not changed. @IAmChaos reiterates again the exact reasons for the consensus having been achieved.

A patch has been written, can we please now have it implemented. Thanks.

The discussion has now been listed at Village Pump (Proposals) for 12 days which has brought a couple of more opinions. The are many who feel that there is clear consensus between the original RFC and the recent NPP discussion to extend until reviewed. There are still some who think this is a major change and needs more discussion. However, I see no one that is objecting to an extension of the time period. Can we get the extension to 365 implemented NOW at least as an interim step. This would address the immediate problem and leave more time to debate the indefinite issue and whether it has consensus.

If someone else has already tested or otherwise verified this functionality, then ignore this; But, from a quick glance, it appears that this feature might not work as intended.

Change 807910 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@master] Allow disabling the noindex age threshold

https://gerrit.wikimedia.org/r/807910

This patch, as I understand it, modifies the private PageTriage isPageNew() function such that it will always be true for pages, regardless of when they were created or whether they have or haven't been triaged yet. This is an internal function that is not used by many things. That's why it can be so basic as it doesn't really affect the meaning of "new" in places where one might refer to such concept on-wiki. I believe its main purpose is through shouldShowNoIndex(), where it is combined with doesPageNeedTriage() which is checking the database for whether it has been triaged yet.

It appears this would result in large amounts of pages that predate PageTriage (possibly a majority of the site?) being considered untriaged and thus marked as noindex, whilst also probably not triage-able as lacking an entry in the database indicating it is pending review. It's possible that it might also not work correctly with mechanisms for skipping/excluding triage or deleting/purging triage metadata.

@Krinkle : It is most unlikely that this affects a majority of the 6.5 million articles on the site.

As far as we at en.Wiki have understood, there are no technical constraints as confirmed by @MusikAnimal and a patch had been made by @taavi , tested,and approved at https://gerrit.wikimedia.org/r/808424. This functionality is a request for Page Triage/Curation (New Page Review), the official policy and only new article control system on en.Wiki.

As well as the 2012 RfC, NOINDEX was part of the WMF's original concept in the Page Curation development in 2012 and again in their extensive improvement in 2018 to accommodate further features. This was an important development project and this feature , as were a couple of other details, had been overlooked when the new system went live and again when it was later updated.
The 2022 consensus was by way of affirmation that NOINDEX is still needed (and now urgently) required. The coordinators @mb and @Novem Linguae , are still waiting for it to be implemented.

There are currently 150,000 pagetriage records in total to date (ref Quarry).

This includes a number of records for pages that aren't triaged, and records for pages that weren't (yet) excluded at the time by namespace configurations (ref Quarry 2). For example, the 150K records include a reviewed=false (0) record for the 2009-created talk page at https://en.wikipedia.org/?curid=25198255 ("Wikipedia talk:WikiProject Faroe Islands/Archive 1"), as well as many records for pages that no longer exist (possibly from before PageTriage's cleanup mechanism was developed).

What mechanism is preventing the millions of articles, both those from before PageTriage was deployed as well as all articles creating during PageTriage that weren't triaged within the respective maxage that applied at the time, from getting retroactively no-indexed?

As previously mentioned 'millions' would be a great exaggeration. There are 'only' 6.5 million articles in the entire enWikipedia.

I've only been on Wikipedia for 17 years and I was only an admin for 9 of those but there has always been a system of sorts for patrolling new pages and checking them as reviewed. For historical reasons the old feed feed still continues at special:new pages. and it was being used well before my time. It had no curation system. Its other problem was that pages could be patrolled as 'appropriate' by absolutely anyone and everyone.

The vast majority of pages were nevertheless patrolled after a fashion but as the encylopedia reached the zenith of its (at that time exponential growth), it was impossible for the editors, whoever they were) to cope with the flow and check all pages with in the deadline. Hence the new page triage feed and the curation tool was developed in 2012. In 2016 a user right was created to ensure quality of patrolling, ACPERM was deployed in 2018 to help stem the tide but while initially extremely successful , due to the English Wikipedia becoming increasingly accessible in non-Enlish speaking countries and regions of economic development, this is no longer having an effect and new solutions are required.

Although there are around 750! authorised reviewers , only about half of them have ever patrolled a page after asking for the user right, and only around 100 of the remainder make any significant patrols at all. It is impossible to maintain a sustainable and workable backlog for clearance with in the current time limit and the situation will only get worse.The NOINDEX feature is now required for the patrollers to have sufficient time to check the pages.

I do not see the relevance of the Wikipedia talk:WikiProject Faroe Islands/Archive 1.. We are concerned only with Wikipedia mainspace articles and not WikiProjects and/or their talk pages.
It will have to be accepted that pre-Page Triage unpatrolled articles will remain indexed for search engines, so retroactive NOINDEX is not part of the equation. However, increasing abuse not only of Wikipedia but of several user rights themselves make it imperative that all the more recent pages are triaged as soon as, and as quickly as possible.

Now that you have all the background you need, please ask the developers to implement the patch for this change without further delay.

What mechanism is preventing the millions of articles, both those from before PageTriage was deployed as well as all articles creating during PageTriage that weren't triaged within the respective maxage that applied at the time, from getting retroactively no-indexed?

This concern is what I was also asking above: there is often a conflation between "patrolled" pages, "reviewed" pages, and how that does or doesn't impact indexing related to pagetriage. TheresNoTime suggested above this patch is only for some namespaces. New PAGE patrolling is still a workflow, it's just not something that most of the "new page reviewers" concern themselves with much (as they are focused on content pages) - I'm not seeing any current support for changing the indexability of non-content pages in this request - and care should be taken to ensure that unintended side effects are not introduced.

Obviously no one has bothered to read this discussion with @kaldari and @Roan Kattouw. I think there is still a lot of misunderstanding about the role of new page patrollers/reviewers and the workflow involved. I've been deeply concerned with 'NPP' as far back as I can remember and I was overjoyed when Eric Möller agreed to allocate funds and personally collaborate closely with some patrollers and me to develop the new system.
Has anyone here actually taken a moment to read WP:NPP and WP:NPR? I have - I wrote them.

New page patrolling is New Article patrolling - that is what is meant by it and that's what the reviewers do; "patrolled" pages, and "reviewed" pages are synonyms for the same thing - no one from the NPP community who made this request is thinking in the slightest in a context of impact on all namespaces. I for one have focused solely all these years on finding ways to keep trash articles out of the encyclopedia corpus and at the same time encourage those new users whose first creations show some potential, and It's disheartening for the reviewers that after all these years much of the WMF and some of the volunteers on the sidelines still adamantly stand by their perspective that the growth in the quantity of articles is more important than the improvement of quality.

Wikipedia is an encyclopedia, and beyond its editors, few people are bothered about its backroom and the 54 mio talk pages, policies, and guidelines. There is therefore naturally little or no support for changing the indexability of non-content pages, it's not what we're asking for. Our concern is to maintain a clean encyclopedia and its reputation for reliability. It would be completely impossible and out of the question for the reviewers to patrol all the other namespace pages as well- if that's what the devs at Phab wants, they'll have to find another solution.

So let's please not keep looking under every rock for reasons to refuse this urgently required request.

What mechanism is preventing the millions of articles, both those from before PageTriage was deployed as well as all articles creating during PageTriage that weren't triaged within the respective maxage that applied at the time, from getting retroactively no-indexed?

This concern is what I was also asking above: there is often a conflation between "patrolled" pages, "reviewed" pages, and how that does or doesn't impact indexing related to pagetriage. TheresNoTime suggested above this patch is only for some namespaces. New PAGE patrolling is still a workflow, it's just not something that most of the "new page reviewers" concern themselves with much (as they are focused on content pages) - I'm not seeing any current support for changing the indexability of non-content pages in this request - and care should be taken to ensure that unintended side effects are not introduced.

For clarity, this patch should just affect the namespaces mentioned based on the flow I described above, but I relied on the assumption that shouldShowNoIndex was only called for pages in those namespaces in this extension — however as @Krinkle rightly points out, I entirely missed the fact that all pages will call the onArticleViewFooter hook, which in turn will call shouldShowNoIndex.

It appears @Krinkle is correct that this will, as it currently stands, cause every non-reviewed page to be NOINDEX'd regardless of namespace.

That being said, pages which weren't added to the page triage table will cause doesPageNeedTriage to return null, which as a falsey value will cause shouldShowNoIndex to return false — I've not yet looked, but I'm hoping only pages in the configured namespaces are added to the page triage table?

Making everything in File:, Help:, etc become non-indexable seems like a net-negative - and is certainly beyond what the proposers are asking (which is about content namespaces), especially if it includes pages that were already "patrolled" but not "reviewed" in these non-content namespaces.

I agree causing non-article namespaces to be noindex by default and yet also be unreviewable would be a bad thing.

However my more urgent concern is that well over 90% of articles would be no-indexed as well. Has this concern been understood and confirmed as not being the case, or is that still a real possibility? From my quick analysis it appears that all articles ever written will get retroactively no-indexed unless they were reviewed in PageTriage during the 30-90 day windows after their creation. E.g. articles that predate PageTriage or expired from review maxage in the years past would retroactively become no-index and yet also not be in the review queue. I assume that, even if we developed a new mechanism to add millions of (old) article (back) to the queue - that it would be undesired to retroactively no-index until they get reviewed. Noting that over all years we've had PageTriage, about ~0.1M pages have been reviewed.

To be clear - it remains possible that I've missed a piece of code that would prevent this from happening. I invite you to verify this in local development, Beta Cluster, or test wiki, etc.

We are talking about mainspace only AFAIK, and as far as I understand, there would be no point in un-indexing articles that have already dropped off the queue into the search engines' clutches.

@Krinkle. It took me a couple days to wrap my head around this one but I think I understand well enough to give an answer now. Here's the brains of the NOINDEX check.

private static function shouldShowNoIndex( Article $article ) {
	//...
	return $wgPageTriageNoIndexUnreviewedNewArticles
		&& PageTriageUtil::doesPageNeedTriage( $article->getPage() )
		&& self::isPageNew( $article->getPage() );
}

So let's pretend we have an old article that predates the installation of PageTriage, and step through these three conditionals

private static function shouldShowNoIndex( Article $article ) {
	//...
	return true        // We have the noindex feature turned on
		&& null    // We have a page so old it predates the installation of PageTriage. However this function will return null (see explanation at end of this post), which is falsey
		&& true;   // Always return true because of taavi's new code
}

So we can simplify to...

private static function shouldShowNoIndex( Article $article ) {
	//...
	return true
		&& false
		&& true;
}
private static function shouldShowNoIndex( Article $article ) {
	//...
	return false;
}

shouldShowNoIndex() = false; So NOINDEX is not emitted for old articles.

In what part of the code is the old article detected, you might ask? It's not in the isPageNew() function where @taavi made his change, which was the 3rd conditional above. Rather it's in the PageTriageUtil::doesPageNeedTriage() function, the 2nd conditional above. Here, specifically.

if ( !$row ) {
	return null;
}

If the row isn't found in the SQL pagetriage_pages table (which happens all the time, for very old articles, for articles not in PageTriage's watched namespaces, and for articles that are deleted by the cron job that gobbles marked as reviewed articles up after 30 days), then the SQL query returns no rows and the function returns null, which is falsey.

I actually typed all this up, then realized this was pretty easy to test. So I went ahead and tested it and can confirm correct behavior in my localhost wiki. But I already did all this typing, so you guys are stuck with my explanation I guess :)

Hopefully that makes sense, and hopefully I understood you correctly. Feel free to correct me if I'm missing something!

Change 808424 abandoned by Stang:

[operations/mediawiki-config@master] enwiki: Raise wgPageTriageMaxAge to indefinite

Reason:

per Krinkle's comment

https://gerrit.wikimedia.org/r/808424

Some queries & numbers for y'all:

A quick query to count per-namespace gives us this breakdown:

NamespaceUnreviewed pagesNotes
Main (0)8732
Talk (1)235
User (2)11898Already NOINDEX'd
User talk (3)205
Project (4)384
Project talk (5)13
File talk (7)1
MediaWiki (8)8
Template (10)1134
Template talk (11)13
Help (12)6
Help talk (13)1
Category (14)130
Portal (100)122
Draft (118)48978Already NOINDEX'd
Draft talk (119)22
TimedText (710)2
Module (828)6
Module talk (829)1

enwiki certainly doesn't have enough volunteers using pagetraige to deal with that - their backlog is what started this request in the first place. Any options to only have pagetriage control indexing in content namespaces?

enwiki certainly doesn't have enough volunteers using pagetraige to deal with that - their backlog is what started this request in the first place. Any options to only have pagetriage control indexing in content namespaces?

Well at first glance it looks bad, but 60,876 of that 71,894 are already NOINDEX'd, no?

Some quick maths gives me 71894 (total NOINDEX'd) - 48978 (Draftspace) - 11898 (Userspace) - 8732 (Mainspace) = 2286 additionally undesired NOINDEX'd pages?

image.png (450×869 px, 373 KB)

Draftspace shouldn't be indexed anyway. They are either new article submissions that have been rejected by the New Page patrollers, ('draftified'), or articles created directly in draft space, possibly coming from the Article Wizard, and are a work in progress. Either way, as soon as they are moved to mainspace they will (or should) appear in the New Page Feed for review at NPP.

@Novem_Linguae Thank you. The part that I had missed is that absence from the pagetriage system is interpreted as effectively reviewed=1 i.e. already reviewed, or more precisely, not needing review.

Consider my concern addressed. I'll follow-up with a request for this to be documented better. I'm not sure with whom yet as the extension is pending ownership, but alas, that's a problem for another day.

As for whether the change should go ahead, that is, as it always has been, a community decision. It looks the conversation above brought some interesting data to light that may sway the community to not want the same configuration as proposed or something in-between, but I'll leave y'all to decide unless you specifically want my input on something. Happy to help as always.

Change 808424 restored by Stang:

[operations/mediawiki-config@master] enwiki: Raise wgPageTriageMaxAge to indefinite

https://gerrit.wikimedia.org/r/808424

Thank you very much @Krinkle. I appreciate your input! I agree about improving the documentation. I happen to have been working on documentation the last few days over at mw:Extension:PageTriage (diff), including a NOINDEX section. Hopefully that helps.

@TheresNoTime 's data above may indicate some bugs in the cron job in non-mainspace or non-PageTriage namespaces. So the way these non-PageTriage namespace entries in the pagetriage_page table arise, I believe, is through page moves. In non-PageTriage namespaces, according to the cron job code comments, I would expect old entries to be deleted 30 days after the page move, but she gave an example of some that aren't, e.g. en:Talk:Israeli Defense Forces. Let me dig into that a bit more. The volume looks pretty low (2,283 rows in non-PageTriage namespaces) so if it is a bug, I think it's a small one.

TheresNoTime changed the task status from Stalled to In Progress.EditedJul 20 2022, 3:12 AM

It feels like there's been enough discussion on-wiki (wrt. consensus) and here (wrt. implementation) to call this un-stalled

@TheresNoTime Thank you for un-stalling this.
There certainly has been more than sufficient discussion on-Wiki. It was part of the original development and and consensus has been affirmed twice since - please see the folded comments above - in 2016 and again this year.
Can it now please be implemented without any further delay?

@TheresNoTime, @Kudpung - One of the main reasons that a 90 day limit was imposed was to limit the potential for the noindex template feature to be abused. (See lengthy discussion here.) That issue needs to be addressed prior to any changes to the limit. My personal opinion is that the security risk outweighs the risk of non-patrolled articles being indexed by Google (especially since truly problematic new articles such as vandalism or attack pages are typically deleted within a few days of their creation).

@kaldari, thanks for your insights. That thread you linked is pretty long, but it seems like the concern is that a vandal could place __NOINDEX__ in a template without folks noticing.

If we agree that this is worth patching, one solution could be to do a 90 day check in the noindex magic word code path, but not in the unpatrolled code path. That could be done by creating a 2nd config variable. Also open to other ideas.

@kaldari Thank you for reminding me of that discussion, Ryan. NPP is now nearly bankrupt despite my having created the user right for it in order to introduce some quality and competency into the system.
I am not a software engineer and I do not pretend to understand the purely technical issues, but we need this request addressed urgently now to to avoid NPP collapsing altogether.

We now need more than ever to prevent Wikipedia becoming a free-for-all for every spammer and paid editor. My personal opinion is that the non-indexing by Google outweighs any security issues, because getting articles indexed for their SEO is precisely what those who abuse Wikipedia are hoping to achieve.

@Kudpung - If NPP can't clear all new articles within 90 days, I would definitely consider that bankrupt. Despite the number of new articles per day steadily declining over the years, the length of the unpatrolled backlog has only crept longer and longer. It looks like the current backlog stretches at least to January (disregarding converted redirects, moves, undeletions, etc.). What's to prevent the backlog from growing further than 365 days? The idea that it could take a year for a new article on Wikipedia to become truly "live" is disheartening. Sure, it might discourage a few spammers, but it will also surely discourage a large number of good faith contributors. As a long-time leader in the Wikipedia community, I'm sure you understand that and appreciate the implications. And as admirable as your efforts to instill rigor and quality control into NPP have been, it seems apparent that bigger changes are needed than technical band-aids like this. Looking through the older unpatrolled pages, I'm struck by the fact that nearly all of them have actually been reviewed by other editors and cleaned-up or tagged for problems. It's unfortunate that the patrolling process has become largely divorced from this organic decentralized reviewing process. Have you considered ways that these processes could be brought back together? As the failure of Citizendium and similar projects shows us, too much rigor and control kills the power of crowd-sourcing.

Change 815835 had a related patch set uploaded (by Novem Linguae; author: NovemLinguae):

[mediawiki/extensions/PageTriage@master] Add $wgPageTriageMaxNoIndexAge configuration variable

https://gerrit.wikimedia.org/r/815835

What is the actual status of this request right now? Is is stuck at code review or something?

What is the actual status of this request right now? Is is stuck at code review or something?

Hi @Kudpung — it appears you are correct, this task is currently stalled on code review. There are two patches which, once merged, will "resolve" this task. They are:

  • 815835: Add $wgPageTriageMaxNoIndexAge configuration variable by @Novem_Linguae — awaiting @Novem_Linguae to make some suggested changes.
  • 808424: enwiki: Raise wgPageTriageMaxAge to indefinite by @Stang — I believe this is stalled waiting on the above patch to be reviewed and merged.

Community Consensus - This thread had two objections about community consensus, but one person withdrew their objection. We couldn't get a formal close of the thread at WT:NPPR, but consensus in favor seems clear, and it was advertised with a link at WP:VPR, so was also advertised to the wider community. I think this takes care of the community consensus part, which is probably why the status was changed from "stalled" to "in progress".

Technical - Kaldari raised a technical objection, worrying that vandals would add NOINDEX to templates, which would NOINDEX a bunch of articles and be hard to track down. I wrote a patch to fix this, and it is stuck in code review. I'm busy with my day job for about another week, then I will have time to address it. If someone else wants to take a stab at addressing the code review feedback sooner, go right ahead.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/815835

The code review comment is:

I think I found a loophole in this logic. Say the NOINDEX age is at 90 days, and the other variable is indefinite. Now if I have an unreviewed article that's 100 days old, wouldn't adding NOINDEX actually remove the noindex property from that article?

To fix this I think the logic needs to be fixed so that there's one variable for all unreviewed articles and a second variable for all other articles.

Not only in 2012, did an RFC decide that unpatrolled articles should be NOINDEXED until reviewed and accepted, but it was also in the MediaWiki manifesto at Page Curation. 10 years is a long time to wait and there's no knowing how many 100s of thousands of junk articles reside today in the corpus. Let's not wait another 10 years for a code review.

Change 844083 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/extensions/PageTriage@master] [WIP, untested] Hooks: Log to statsd when a page is noindex'd

https://gerrit.wikimedia.org/r/844083

Okay, so... as I mentioned on the patch, I'm confident we're good to proceed — however, I'd really like to both be confident and prepared for something to go awry.

The idea behind this patch (which was suggested by @Novem_Linguae, thank you) is to get a baseline count of how often a page is served with NOINDEX — we'll use a statsv counter to get a general "feel" of what the status quo is. Then, when we make this change, we can then see how this count is influenced (and make quicker decisions iff something negative occurs)

I don't want this to hold things up any longer than absolutely necessary, so reviews of my implementation would be greatly appreciated.

Oh well, we've been waiting for 10 years for this so what's another few weeks (or months)?

Change 844083 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@master] Hooks: Log to statsd when a page is noindex'd

https://gerrit.wikimedia.org/r/844083

Change 844040 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/extensions/PageTriage@wmf/1.40.0-wmf.5] Hooks: Log to statsd when a page is noindex'd

https://gerrit.wikimedia.org/r/844040

Change 844040 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@wmf/1.40.0-wmf.5] Hooks: Log to statsd when a page is noindex'd

https://gerrit.wikimedia.org/r/844040

Mentioned in SAL (#wikimedia-operations) [2022-10-19T20:07:54Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:844040|Hooks: Log to statsd when a page is noindex'd (T310974)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-19T20:08:20Z] <samtar@deploy1002> samtar and samtar: Backport for [[gerrit:844040|Hooks: Log to statsd when a page is noindex'd (T310974)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet

Change 845012 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/extensions/PageTriage@wmf/1.40.0-wmf.6] Hooks: Log to statsd when a page is noindex'd

https://gerrit.wikimedia.org/r/845012

Change 815835 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@master] Add $wgPageTriageMaxNoIndexAge configuration variable

https://gerrit.wikimedia.org/r/815835

Mentioned in SAL (#wikimedia-operations) [2022-10-20T18:05:53Z] <TheresNoTime> Backporting [[gerrit:845012]] for T310974 to wmf.6

Change 845012 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@wmf/1.40.0-wmf.6] Hooks: Log to statsd when a page is noindex'd

https://gerrit.wikimedia.org/r/845012

Mentioned in SAL (#wikimedia-operations) [2022-10-20T18:09:55Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:845012|Hooks: Log to statsd when a page is noindex'd (T310974)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-20T18:10:15Z] <samtar@deploy1002> samtar and samtar: Backport for [[gerrit:845012|Hooks: Log to statsd when a page is noindex'd (T310974)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-20T18:18:04Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:845012|Hooks: Log to statsd when a page is noindex'd (T310974)]] (duration: 08m 08s)

Novem_Linguae renamed this task from Extend PageTriageMaxAge for unpatrolled articles at enwiki to Extend PageTriageMaxAge (noindex) for unpatrolled articles at enwiki.Oct 27 2022, 11:12 AM

@Stang — could you schedule https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808424 for deployment next week (Monday/Tuesday in the UTC late deployment window, if at all possible)


nb. Might it be worth a mediawiki-config patch to set:

'wgPageTriageMaxAge' => [
	'default' => 90,
],

(even though that is already the default) prior to 808424? That way we're avoiding a patch which both "introduces" a defined configuration variable and sets it?

@Stang — could you schedule https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808424 for deployment next week (Monday/Tuesday in the UTC late deployment window, if at all possible)

Sure, scheduled at Oct 31 afternoon window.

Change 850106 had a related patch set uploaded (by Stang; author: Stang):

[operations/mediawiki-config@master] Define a default value for wgPageTriageMaxAge

https://gerrit.wikimedia.org/r/850106

Change 850106 merged by jenkins-bot:

[operations/mediawiki-config@master] Define a default value for wgPageTriageMaxAge

https://gerrit.wikimedia.org/r/850106

Mentioned in SAL (#wikimedia-operations) [2022-10-27T13:23:50Z] <lucaswerkmeister-wmde@deploy1002> Started scap: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-27T13:24:09Z] <lucaswerkmeister-wmde@deploy1002> lucaswerkmeister-wmde and stang: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-27T13:29:24Z] <lucaswerkmeister-wmde@deploy1002> Finished scap: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]] (duration: 05m 33s)

The original bug report states that "The New Page Reviewer user group is at present unable to review articles within 90 days (there is a backlog of 14,000). " We should note that the NPP backlog on the English Wikipedia was recently reduced to zero and the backlog does not seem to be climbing again significantly -- there are only 142 unreviewed pages as I write.

So, the issue no longer seems especially pressing.

The original bug report states that "The New Page Reviewer user group is at present unable to review articles within 90 days (there is a backlog of 14,000). " We should note that the NPP backlog on the English Wikipedia was recently reduced to zero and the backlog does not seem to be climbing again significantly -- there are only 142 unreviewed pages as I write.

So, the issue no longer seems especially pressing.

I certainly agree that the issue appears to be less pressing (which is a a great thing, and kudos to the English Wikipedia volunteers!) — it would still be great to move forward at our current planned rate while the attention and motivation is here 🙂

@Stang further off-wiki displeasure at the config patch has been raised, and I'm losing the will to engage with this. I'd recommend pulling the patch from the schedule.

@TheresNoTime, off-Wiki discussions carry no weight at all. This has been more than sufficiently discussed and approved and your help with it has been enormously appreciated. One user's late complaint does not overturn a consensus. If Phabricator would work faster and if the other WMF employees here would prioritise issues that improve the the quality of new pages, there would be no need for requests for this feature and any others like it. Is Code Review done by volunteers or something? What's holding this up now?

There have been mentions of holding another RFC, although to my knowledge this hasn't happened and it's getting a bit late in that regard

off-Wiki discussions carry no weight at all.

Yes they do, this whole discussion is off wiki for instance and has a lot of weight. You just don't like those opinions and stampede over them @Kudpung.

I personally have always heavily opposed for reasons stated multiple times on wiki and in phabricator, and I've also opposed this time around (I was not the only one). NOINDEX is bad and creates an SEO pit that is hard to crawl out of.

  1. If a page has been NOINDEXed for a while, Google basically treats it like a 404 page and will not visit (crawl) the url for potentially months, not allowing it to organically find that it was switched to remove the NOINDEX.
  2. Now, our crawling is accelerated by actively signaling Google I believe, but as there is no edit, no new revision and no linksupdate, Google doesn't get any signal for this state change by PageTriage.
  3. Additionally, there is no active CDN purge when there is a flip (as far as I can determine), so we need to wait until someone edits the page, before this noindex really starts taking effect, or we have to wait for the frontend caches to expire naturally.
  4. So basically we just have to wait for Google to crawl us by chance (a full reindex of wikipedia is pretty rare though. I think its about every 3 months??). So that's a potential delay of up to 3 months of Google + our cache + page triage delay and the older the page is, the worse it gets.
  5. NOINDEX heavily influences the website's relevance, because recent things that ppl search for are not in the index (the site cannot provide an answer), yet other sites most likely can, so you become less relevant to google visitors compared to other sites, which influences your overall score.
  6. Any mirrors of Wikipedia that carry the same content and strip the noindex (because ad revenue), can now be considered the 'original content publishers' and this will further derank all of wikipedia as a 'non-original publisher'.
  7. All of these effects are hard to measure and or verify as Google keeps the exact workings closed to outsiders, so we can't quantify them which makes it unprovable which then makes this a he said, they said situation that no one can proof.

But in my opinion flipping NOINDEX back to INDEX is bad because search engines are bad at it, esp for smaller topics with fewer edits that get reviewed very late and for topics that are 'recent'. And this change further aggravates an already bad technical solution. Some people (mostly NPP and some BLP people) think this is warranted and almost everyone else doesn't really have an opinion on it as the topic is too complex to care about.

So hashing this out in an RFC seems pretty pointless to me. Very few people will understand the intricacies and the NPP crowd will all back each other up over any sort of objections, so I'm definitely NOT going to waste any of my time on an RfC. As far as I care, someone can deploy the patch.

But we should really think about how this state change gets communicated to third parties, because now it's heavily delayed.

Since I was asked to look at this, I mostly agree with TheDJ's assessment here. NOINDEX and SEO changes are hard to gauge the scope of because we're at the mercy of Google and others' blackbox algorithms. Flipping on and off NOINDEX is not like a light switch, it's much slower and usually has other negative effects (especially #5 and #6). I can provide a bit of anecdotal evidence to back up this pain as we ran into all of these during the GlobalUserPage rollout. I think we're better off spending developer time in building/improving tools to keep the backlog down rather than trying to hide pages if/when the backlog gets too large.

(fwiw I don't think any on-wiki RfC is needed at this time (in other words, it's *not* a Community-consensus-needed issue), rather it needs a proper technical evaluation.)

@Legoktm

I think we're better off spending developer time in building/improving tools

At the moment it's not using any developer time. All the work on the patch has been done by volunteers from the en.Wiki. Thank you for confirming that this does not need yet another on-Wiki RfC - consensus has not changed.

Can those of you expressing concerns about the technical aspects of NOINDEX explain why this patch makes the problem worse (that is, why NOINDEXing after however many months it takes to get reviewed is worse than NOINDEXing after 3 months as is currently done)? I'm not seeing a clear answer to that anywhere.

Novem_Linguae changed the task status from In Progress to Open.Jan 30 2023, 7:29 AM

Marking as declined for now to reflect reality. When myself and the NPP team did a big push to implement this in 2022, various folks had small objections that added up into enough objections to stall this. No reason to work on it if it's just going to get stalled. We will focus our energies elsewhere for now.

Change 808424 abandoned by Stang:

[operations/mediawiki-config@master] enwiki: Raise wgPageTriageMaxAge to indefinite

Reason:

per T310974#9162755

https://gerrit.wikimedia.org/r/808424