Page MenuHomePhabricator

action=info always returns search engine index status "Not indexable" when curid is specified
Open, LowPublic

Description

Author: richardg_uk

Description:
If the curid parameter is specified with action=info, the search engine index status is always reported as "Not indexable", which is generally incorrect.

Presumably the returned information is intended to relate to the canonical page, regardless of how it is specified in the URL, so that action=info results should be identical irrespective of whether the page is identified by curid ("Page ID") or by title ("Display title").

Something seems to be going wrong with passing the canonical title to setIndexPolicy() or getRobotPolicy() in InfoAction::pageInfo().

Example with the [[Trumptonshire]] article on enwiki:

http://en.wikipedia.org/w/index.php?title=Trumptonshire&action=info
-> Search engine status: "Indexable" (as expected)

http://en.wikipedia.org/w/index.php?curid=11652196&action=info
-> Search engine status: "Not indexable" (unexpected)

http://en.wikipedia.org/w/index.php?title=Trumptonshire&curid=11652196&action=info
-> Search engine status: "Not indexable" (unexpected)

(If both curid and title are specified, title is ignored. This is presumably by design.)

  • created by bug 38531 - "Add (search) index status to MediaWiki's info action"
  • tracking bug 38450 - "Reimplement MediaWiki's info action (tracking)"

Version: 1.21.x
Severity: normal

Details

Reference
bz42867

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:58 AM
bzimport added a project: MediaWiki-General.
bzimport set Reference to bz42867.
bzimport added a subscriber: Unknown Object (MLST).
bzimport created this task.Dec 8 2012, 7:26 PM

I have a feeling that this is a misunderstanding because of the misleading wording, which would make it a duplicate of bug 43935?
Richard, could you take a look?

richardg_uk wrote:

Thanks for the pointer, but this bug is a technical error caused by an internal inconsistency, different from the terminology issue in bug 43935.

Note also that http://en.wikipedia.org/robots.txt disallows *all* subpages of http://en.wikipedia.org/w/

So all 3 URLs above should certainly show the *same* search status.

Since the info parameter is intended to provide information about the canonical page identified by the request (not about the URL through which the information happens to be requested), the search status should always be that of the the canonical page.

So, as previously stated, all 3 examples in comment #0 ought to display:
-> Search engine status: "Indexable"

Note that the info results are unaffected by whether the entry point is "/w/index.php" or "/wiki/PAGENAME", as demonstrated in these further examples:

Ex 4: http://en.wikipedia.org/wiki/Trumptonshire?action=info
-> Search engine status: "Indexable" (as expected)

Ex 5: http://en.wikipedia.org/wiki/?curid=11652196&action=info
-> Search engine status: "Not indexable" (unexpected)

Ex 6: http://en.wikipedia.org/wiki/Trumptonshire?curid=11652196&action=info
-> Search engine status: "Not indexable" (unexpected)

richardg_uk wrote:

MZMcBride has pointed out elsewhere (bug 43935 comment 4) that the reported "Search engine status" corresponds to the <meta name="robots" content="noindex,follow" /> tag controlled by namespace settings or the NOINDEX page behavior switch.

That explains the cause of this bug: replacing "action=info" with "action=view" in each request returns a wikipage that contains the noindex meta tag IF AND ONLY IF "Not indexable" is shown as the search engine status on the corresponding "action=info" page.

However, this behavior is still inappropriate, because the user expects to see information about the canonical page, regardless of which URL entry point or parameters are used to identify it.

For example, "Redirects to this page" would be zero if the information related to the URL, but it is instead always reports the number of redirects to the canonical page, as expected.

Compare, for example:

Ex 7: http://en.wikipedia.org/w/index.php?title=Wikipedia:Assume_good_faith&action=info
-> Search engine status: "Indexable"
-> Number of page watchers: "276"
-> Redirects to this page: "30"

Ex 8: http://en.wikipedia.org/w/index.php?curid=502959&action=info
-> Search engine status: "Not indexable" (bug!)
-> Number of page watchers: "276"
-> Redirects to this page: "30"

The search engine status is the only info reported inconsistently.

Article::getRobotPolicy contains the condition for this:

} elseif ( $this->getContext()->getRequest()->getInt( 'curid' ) ) {

  1. For ?curid=x urls, disallow indexing

return array(

		'index'  => 'noindex',
		'follow' => 'follow'

);
}

So works as designed.