Page MenuHomePhabricator

Pages in main namespace are being noindexed despite being older than 90 days
Closed, ResolvedPublic3 Story Points

Description

It looks like something is malfunctioning in PageTriage (or core) and __NOINDEX__ is effectively noindexing articles in main namespace older than 90 days. See documentation of how it's supposed to work at https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing#Indexing_of_articles_(%22mainspace%22).

From bug report at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(miscellaneous)#Important_pages_hidden_from_search_engines.

Event Timeline

kaldari created this task.Dec 3 2018, 6:57 PM
Restricted Application added a project: Growth-Team. · View Herald TranscriptDec 3 2018, 6:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kaldari triaged this task as High priority.Dec 3 2018, 6:57 PM
kaldari updated the task description. (Show Details)
Xeno added a subscriber: Xeno.Dec 3 2018, 8:36 PM
JJMC89 added a subscriber: JJMC89.Dec 4 2018, 6:27 AM

This is running afoul of T149538 so appears to be a regression.

kostajh added a subscriber: kostajh.Dec 5 2018, 3:36 AM
Catrope added a subscriber: Catrope.Dec 5 2018, 7:01 PM

As far as I can tell, it's not true that the __NOINDEX__ magic word and the {{NOINDEX}} template do not work on pages older than 90 days (which is what the documentation on enwiki claims). The __NOINDEX__ magic word does not expire, it's effective forever. But it's also disabled in the main namespace, so the only thing noindexing pages in the main namespace should be PageTriage.

FYI, the 90 days comes from $wgPageTriageMaxAge.

JJMC89 added a comment.Dec 6 2018, 3:34 AM

As far as I can tell, it's not true that the __NOINDEX__ magic word and the {{NOINDEX}} template do not work on pages older than 90 days (which is what the documentation on enwiki claims). The __NOINDEX__ magic word does not expire, it's effective forever. But it's also disabled in the main namespace, so the only thing noindexing pages in the main namespace should be PageTriage.

__NOINDEX__ shouldn't work at all in mainspace. As far as I can tell, it is functioning correctly.

According to shouldShowNoIndex, {{NOINDEX}} (wgPageTriageNoIndexTemplates in InitialiseSettings.php) shouldn't cause PageTriage to noindex the article if the article is older than 90 days (third requirement). @kostajh's patch for T203008: Ensure retrieval and storage of article metadata touched isArticleNew, which is used to check this requirement.

shouldShowNoIndex doc
        /**
	 * Determines whether to set noindex for the article specified
	 *
	 * Returns true if all of the following are true:
	 *   1. The page includes a template that triggers noindexing
	 *   2. The page was at some point in the triage queue
	 *   3. The page is younger than the maximum age for "new pages"
	 * or all of the following are true:
	 *   1. $wgPageTriageNoIndexUnreviewedNewArticles is true
	 *   2. The page is in the triage queue and has not been triaged
	 *   3. The page is younger than the maximum age for "new pages"
	 * Note that we always check the age of the page last since that is
	 * potentially the most expensive check (if the data isn't cached).
	 *
	 * @param Article $article
	 * @return bool
	 */

@JJMC89 suppose we certain can try to break some more '90 day + articles' at the time of the report I checked the presented html on:
https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company

and got:
<meta name="robots" content="noindex,nofollow"/>

JJMC89 added a comment.Dec 6 2018, 3:56 AM

I tested a random old page and got <meta name="robots" content="noindex,nofollow"/> when using {{NOINDEX}} but not when I tried __NOINDEX__.

So seems like the AND "The page is younger than the maximum age for "new pages"" isn't working at the very least.

TheDJ added a subscriber: TheDJ.Dec 6 2018, 11:49 PM

It seems to at least be sort-of working correctly. For example, these two articles were both nominated for speedy deletion (and thus include the {{NOINDEX}} template):

kaldari added a comment.EditedDec 13 2018, 2:38 AM

@kostajh - Looks like the bug was introduced here: https://phabricator.wikimedia.org/diffusion/EPTR/browse/master/includes/Hooks.php$385. The reason this produces the bug is that wfTimestamp( TS_MW, false ) returns the current timestamp, not false or null as you would probably expect. Thus each time PageTriage tried to look up the age of an old article that wasn't in PageTriage's DB tables, it would think the article was brand new.

Change 479369 had a related patch set uploaded (by Kaldari; owner: Kaldari):
[mediawiki/extensions/PageTriage@master] Fixing creation date lookup for noindexing

https://gerrit.wikimedia.org/r/479369

kaldari set the point value for this task to 3.Dec 13 2018, 2:53 AM
kaldari moved this task from Inbox to Current Sprint on the Growth-Team board.
kaldari edited projects, added Growth-Team (Current Sprint); removed Growth-Team.
kaldari moved this task from Incoming to Code Review on the Growth-Team (Current Sprint) board.
kaldari claimed this task.
TheDJ awarded a token.Dec 13 2018, 7:26 AM

Change 479369 merged by jenkins-bot:
[mediawiki/extensions/PageTriage@master] Fixing creation date lookup for noindexing

https://gerrit.wikimedia.org/r/479369

Change 479449 had a related patch set uploaded (by Kosta Harlan; owner: Kaldari):
[mediawiki/extensions/PageTriage@wmf/1.33.0-wmf.8] Fixing creation date lookup for noindexing

https://gerrit.wikimedia.org/r/479449

Change 479449 abandoned by Kosta Harlan:
Fixing creation date lookup for noindexing

Reason:
Let's let this ride the train next week

https://gerrit.wikimedia.org/r/479449

@Etonkovidova - Here's how it should all pan out after the patch is deployed:

  • Unreviewed article, less than 90 days old, no {{NOINDEX}} transclusion: outputs <meta name="robots" content="noindex,nofollow"/>
  • Reviewed article, less than 90 days old, no {{NOINDEX}} transclusion: doesn't noindex
  • Unreviewed article, more than 90 days old, no {{NOINDEX}} transclusion: doesn't noindex
  • Reviewed article, more than 90 days old, no {{NOINDEX}} transclusion: doesn't noindex
  • Unreviewed article, less than 90 days old, {{NOINDEX}} transclusion: outputs <meta name="robots" content="noindex,nofollow"/>
  • Reviewed article, less than 90 days old, {{NOINDEX}} transclusion: outputs <meta name="robots" content="noindex,nofollow"/>
  • Unreviewed article, more than 90 days old, {{NOINDEX}} transclusion: doesn't noindex
  • Reviewed article, more than 90 days old, {{NOINDEX}} transclusion: doesn't noindex

@Etonkovidova - note that you can see which articles transclude a {{NOINDEX}} template at https://en.wikipedia.org/wiki/Category:Noindexed_articles.

Checked the above testcases (thx, @kaldari!) in testwiki (wmf.9) - all is correct, except the following case:

(1)

Reviewed article, less than 90 days old, {{NOINDEX}} transclusion: outputs <meta name="robots" content="noindex,nofollow"/>

The example of such a page - https://test.wikipedia.org/wiki/MavetunaZilant30_1. The page was created on Nov26/2018, reviewed via PageTriage. Applying {{NOINDEX}} and adding magic words via VE did not make the page "noindex,nofollow" (I check the output with $('meta[name=robots]').attr('content')). Could it be just testwiki setup?

(2) VE presents the page option for excluding the page, promising to exclude the page from search. The option is present independently of exclude logic (e.g. the page is in main namespace, the page is older than 90 days etc)

(3) What was not checked:
Looking at Category:Wikipedia templates which apply NOINDEX, there are about 70 templates that apply {{NOINDEX}}. I checked only selectively several of AfC templates - all seem to be functioning as expected (AfC templates exclude articles from indexing).
Another interesting testing venue will be to check what was reported on this ticket: {{NOINDEX}} was included into {{Undisclosed paid}} template (the template does not exist in testwiki).

(4) According to Wikipedia:Controlling search engine indexing

On English Wikipedia the entire User: namespace, User talk:, Draft: and Draft talk: namespaces are automatically noindexed via a software setting.

testwiki behaves differently.

Checked the above testcases (thx, @kaldari!) in testwiki (wmf.9) - all is correct, except the following case:

(1)

Reviewed article, less than 90 days old, {{NOINDEX}} transclusion: outputs <meta name="robots" content="noindex,nofollow"/>

The example of such a page - https://test.wikipedia.org/wiki/MavetunaZilant30_1. The page was created on Nov26/2018, reviewed via PageTriage. Applying {{NOINDEX}} and adding magic words via VE did not make the page "noindex,nofollow" (I check the output with $('meta[name=robots]').attr('content')). Could it be just testwiki setup?

This is expected since __NOINDEX__ shouldn't function in content namespaces and {{NOINDEX}} is not configured on testwiki.

(2) VE presents the page option for excluding the page, promising to exclude the page from search. The option is present independently of exclude logic (e.g. the page is in main namespace, the page is older than 90 days etc)

AFAIK, it has always been that way in VE.

Due to the configuration differences (see below), I recommend checking everything on enwiki once wmf.9 is deployed there.

InitialiseSettings.php
'wgPageTriageNoIndexTemplates' => [
	'default' => [],
	'enwiki' => [ 'NOINDEX' ]
],

...

'wgNamespaceRobotPolicies' => [
	...
	'enwiki' => [
		NS_USER => 'noindex,follow', // T104797
		NS_USER_TALK => 'noindex,follow',
		118 => 'noindex,nofollow', // draft
		119 => 'noindex,nofollow', // draft talk
	],
	...
],

...

'wmgExemptFromUserRobotsControlExtra' => [
	// When wmgAllowRobotsControlInAllNamespaces is false (the default),
	// __NOINDEX__ and __INDEX__ will be ignored for these namepaces,
	// as well as for namespaces in $wgContentNamespaces.
	'default' => [],
	'enwiki' => [ 118, 119 ], // draft and draft talk
	...
],

Thx, @JJMC89 - since VE adds __NOINDEX__ which disabled in many namesapces - (as per documentation Namespace_control), there are not many use cases for that option.

I'll check the test cases on enwiki with wmf.9.

Change 480884 had a related patch set uploaded (by Kaldari; owner: Kaldari):
[operations/mediawiki-config@master] Adding NOINDEX template to $wgPageTriageNoIndexTemplates for testwiki

https://gerrit.wikimedia.org/r/480884

Thanks for catching that @JJMC89! I'll deploy a config change for Test Wikipedia tomorrow so it matches English Wikipedia.

Change 480884 merged by jenkins-bot:
[operations/mediawiki-config@master] Adding NOINDEX template to $wgPageTriageNoIndexTemplates for testwiki

https://gerrit.wikimedia.org/r/480884

Etonkovidova closed this task as Resolved.Dec 20 2018, 9:09 PM

Re-checked the following in testwiki after the patch deploy - all works as intended:

Reviewed article, less than 90 days old, {{NOINDEX}} transclusion: outputs <meta name="robots" content="noindex,nofollow"/>

To follow up with

(2) VE presents the page option for excluding the page, promising to exclude the page from search. The option is present independently of exclude logic (e.g. the page is in main namespace, the page is older than 90 days etc)

filed the task: T212458