Page MenuHomePhabricator

?action=info incorrectly states that a new page is indexed when not
Open, In Progress, MediumPublic

Description

?action=info checks page_props to detect if a page is indexed or not. So this will be incorrect on new unpatrolled pages.

Steps to replicate the issue (include links if applicable):

  • enwiki
  • find a newish page in mainspace that is unreviewed. visit it.
  • go to "Tools -> Page information"

What happens?:

  • "Indexing by robots" says "Allowed"

2022-12-16_132920.png (1×1 px, 251 KB)

2022-12-16_133019.png (190×833 px, 21 KB)

What should have happened instead?:

  • "Indexing by robots" should say "Not allowed" or equivalent

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

  • My guess is PageTriage tweaks whether noindex is shown, but forgets to do a corresponding update to "Page information -> Indexing by robots". This should be fixed to avoid confusion.

Possibly this can be fixed by using the InfoAction hook.

See a discussion where this came up.

additional discussion from merged task

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I suspect the same problem might occur with blocked users btw. since we have T13443: Auto-noindex user/user talk pages for blocked users..

Xaosflux triaged this task as Medium priority.Jul 23 2017, 4:54 PM

This caused a lot of frustration for me today, when I was trying to help someone determine why their new article wasn't showing up in search indexes. Eventually I looked at the HTML source of the page and saw the noindex tag.

Could probably write a patch in PageTriage/includes/Hooks.php that takes advantage of the InfoAction hook.

MPGuy2824 changed the task status from Open to In Progress.Feb 25 2023, 10:47 AM
MPGuy2824 claimed this task.

Change 892028 had a related patch set uploaded (by MPGuy2824; author: MPGuy2824):

[mediawiki/extensions/PageTriage@master] Correcting the indexable property shown in '?action=info'

https://gerrit.wikimedia.org/r/892028

Suggestions from @kostajh:

  1. It might make sense to have a separate row for PageTriage's "noindex" rule. So, leave "pageinfo-robot-policy" alone, and have a new row for e.g. "pagetriage-robot-policy" with a field name like "PageTriage index policy". That way the user can see what core's Article::getRobotPolicy() returns, as well as what PageTriage overrides it with.
  2. A second thought is that perhaps it would make sense for Article to execute a new hook, something like onArticleGetRobotPolicy, to allow extensions to override the value. Then you wouldn't need onInfoAction, because PageTriage's implementation of onArticleGetRobotPolicy would set the policy.

Since this is an either/or situation, we should continue the discussion here.
To me, #1 seems like it will cause confusion. #2 sounds more cleaner except that it will need a change to mw-core too.

As I wrote on the Gerrit patch, this field is telling you the robot policy associated with the ParserOutput object (and stored in the database), yet PageTriage is writing to the OutputPage object when it sets its policy which would explain the mismatch.

A reader looking at that would presumedly ultimately want to know the indexing setting - not an intermediary parser snapshot. If we want to also show the stepwise process, sure do it, but the user story here is that what you see is not what you get.

Playing devils advocate, why is it important to a reader to know this information? It seems a little advanced to me. I can see how it would be useful to an editor workflow... but not a reader. For an editor, it seems useful for confirming an edit, given the magic words __NOINDEX__ and __INDEX__ don't seem to work on all namespaces. (However it also seems strange that we allow modification of robots policy to an anonymous user as that seems like a vector for attack).

Another thing to note is that the robots policy varies on the action type. For example on https://en.wikipedia.org/wiki/Robot?action=edit (or action=info) the robots policy is noindex,nofollow,max-image-preview:standard and on a view action it's max-image-preview:standard (which means use defaults). So the presence of this field and the information it provides on the action=info table is just plain confusing and misleading.

  • I'd argue that &action=info is mainly for technical users, not readers or editors. A reader or editor is rarely going to need the page ID, for example.
  • I think "Indexing by robots" refers to whether the page presents a noindex property or not after all other parsing and processing. That seems like the most intuitive reading of it to me. If this is for some reason ambiguous or unclear, perhaps it should be renamed.
  • I think that if we assume the second bullet above to be true, then this ticket is a clear bug and should be fixed.

I used the term reader above as in the actor reading that output, the user story is that the actor reading that message is expecting it to be the final state

I think "Indexing by robots" refers to whether the page presents a noindex property or not after all other parsing and processing. That seems like the most intuitive reading of it to me. If this is for some reason ambiguous or unclear, perhaps it should be renamed.

Right. But it's not doing that and I don't think it's possible to do that currently given it varies per ?action and is decided via a hook. To do it accurately the info page would need to pretend to render an article page to extract that information. This issue is not an issue inside the PageTriage extension but an issue inside MediaWiki core.

Making sure we are all talking about the same output here:

image.png (568×831 px, 48 KB)

On the above, it seems quite reasonable that someone reading that would assume it is the value that would be made available to these 'robots' on future reads. What extension or component of it isn't relevant to that user story. If this is the output of only one thing, its label (pageinfo-robot-policy) should be changed to not appear to be authoritative.

"Indexing by robots" is from MediaWiki:Pageinfo-robot-policy.
I tried to link https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing with this in https://en.wikipedia.org/wiki/MediaWiki:Pageinfo-robot-policy:

[[Wikipedia:Controlling search engine indexing|Indexing by robots]]

Unfortunately, page information just displayed the code as text without making a link so I deleted it again. Is there a way to tell whether a MediaWiki message allows wikitext without trying it on a live installation or diving into MediaWiki's source code?

. Is there a way to tell whether a MediaWiki message allows wikitext without trying it on a live installation or diving into MediaWiki's source code?

Not that I'm aware of. That decision is made at the time it's used. It's also possible to sometimes parse a message and sometimes render it as text if it's used in multiple places.

I really like the idea of linking here for more context, but perhaps it's better to add more structure to this and make it a generalized feature?

Idea:

In this case: pageinfo-robot-policy would have a corresponding message pageinfo-robot-policy-more-info-url and create the information link.

Thanks Jdlrobson. I don't know the implementation part but for editors I think it would be simpler and more flexible to allow wikitext.
https://en.wikipedia.org/wiki/Help:Page_information shows other examples of links but some of them are only on part of the text and wouldn't make proper sense on the whole text.

https://en.wikipedia.org/w/index.php?title=Talk:Example&action=info shows that MediaWiki makes links on "Number of redirects to this page" and "Number of subpages of this page". If wikitext is allowed in the other fields then I suppose it could potentionally cause a conflict later if MediaWiki ever tries to add its own links there without replacing the message name.

True. Looks like there is prior art there. Feel free to submit a patch for making it parse the message consistently with pageinfo-subpages-name. I'll happily review.

I don't know PHP and cannot make patches but I would certainly prefer a wikitext solution. Looking at https://en.wikipedia.org/w/index.php?title=Talk:Example&action=info&uselang=qqx, maybe we shouldn't worry about future conflicts if wikitext is allowed. The two fields with links have the link text in message names ending in -name: pageinfo-redirects-name and pageinfo-subpages-name. None of the other message names are like that. Indexing by robots is pageinfo-robot-policy. If customized wikitext is allowed and MediaWiki later wants to add its own link there, can it be expected to use a new message pageinfo-robot-policy-name for the link text, and ignore a customized pageinfo-robot-policy?

Change 892028 abandoned by MPGuy2824:

[mediawiki/extensions/PageTriage@master] Correcting the indexable property shown in '?action=info'

Reason:

Consensus seems to be against doing it this way.

https://gerrit.wikimedia.org/r/892028

There are some interesting comments on the gerrit patch. Anyone taking this up should go through them.