SearchPageGenerator returns non-existing pages when the search API returns cross-wiki results
Open, Needs TriagePublic

Description

Wikipedia's internal search returns file/file page matches from commons even if the file description page does not exist on the local wiki. SearchPageGenerator returns these, resulting in some strange behavior (e.g. the search and replace script throwing a bunch of errors about non-existent pages). There should probably be an option to filter these out.

Tgr created this task.Nov 22 2015, 5:48 AM
Tgr updated the task description. (Show Details)
Tgr raised the priority of this task from to Needs Triage.
Tgr added subscribers: Tgr, binbot.
jayvdb added a subscriber: jayvdb.

SearchPageGenerator uses API search.

I see File pages from Commons appearing when I search for "file:...", but not without file:

https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=file:Campos%20de%20Cari%C3%B1ena,%20Espa%C3%B1a
https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=Campos%20de%20Cari%C3%B1ena,%20Espa%C3%B1a

Is "file:..." required to trigger this behaviour, or can it happen other ways?

IMO they should never have been included in the API results by default; the caller should have to explicitly request that the API search module includes pages which do not (and never have) existed on the wiki.

Any idea what release the API search module started including non-local file pages by default?

Pywikibot should definitely have a way to remove these from the generated list of search results.

Is it possible to call the API so that these items are not included at all? I looked into the API parameter interwiki, which is disabled by default, but that looks like it isnt used for these File: pages.

Otherwise Pywikibot probably needs to filter out page records that include the missing flag.

Anomie added a subscriber: Anomie.Dec 9 2015, 5:04 PM

SearchPageGenerator uses API search.

I see File pages from Commons appearing when I search for "file:...", but not without file:

https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=file:Campos%20de%20Cari%C3%B1ena,%20Espa%C3%B1a
https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=Campos%20de%20Cari%C3%B1ena,%20Espa%C3%B1a

Is "file:..." required to trigger this behaviour, or can it happen other ways?

Add gsrnamespace=6, or a list of namespaces including 6.

IMO they should never have been included in the API results by default; the caller should have to explicitly request that the API search module includes pages which do not (and never have) existed on the wiki.

action=search is intended to work equivalently to Special:Search. And the search result does exist on the wiki, thanks to the way shared files work.

Any idea what release the API search module started including non-local file pages by default?

Probably due to the CirrusSearch extension.

SearchPageGenerator uses API search.

I see File pages from Commons appearing when I search for "file:...", but not without file:

https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=file:Campos%20de%20Cari%C3%B1ena,%20Espa%C3%B1a
https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=Campos%20de%20Cari%C3%B1ena,%20Espa%C3%B1a

Is "file:..." required to trigger this behaviour, or can it happen other ways?

Add gsrnamespace=6, or a list of namespaces including 6.

Thanks. Is that the only case where non-local pages are included in the results (by default)?

I checked GlobalUserPage , and they dont appear in search results. I'm interested in determining the scope and likely impact of this bug, and look for workarounds. The bug is already being resolved by implementing ExistingPageBot as a base class of the scripts (this hasnt been done for replace.py yet.)

If this bug only occurs with File pages, and the pywikibot user explicitly requested file pages, it is quite an obscure bug, and the bug title and description need to be made more precise.

IMO they should never have been included in the API results by default; the caller should have to explicitly request that the API search module includes pages which do not (and never have) existed on the wiki.

action=search is intended to work equivalently to Special:Search.

Of course it is desirable that all special:search results can be fetched by the API, but I hope and pray the API has a little more thought put into it, and strives to higher goals of being an API instead of a UI.

An existing example where the API and UI differ: changes in the $wgNamespacesToBeSearchedDefault are not reflected in the API. Search namespaces do not default to $wgNamespacesToBeSearchedDefault .
e.g. English Wikisource includes 102, 106 & 114 in the default namespaces in the UI, but that doesnt occur in the API
https://en.wikisource.org/w/api.php?action=help&modules=query+search

Are you sure that the API search can not have a parameter to control whether these non-local pages are included? It can be enabled by default, like the redirects default was flipped in v1.23.

It is very odd that Search API parameters interwiki and redirect exist, but these non-local files can not be excluded...?

Thanks. Is that the only case where non-local pages are included in the results (by default)?

No idea, you'd have to ask someone who knows about CirrusSearch stuff.

I do know that there's code for a feature of searches returning interwiki results as a sidebar, and those are returned separately.

I checked GlobalUserPage , and they dont appear in search results.

I wonder if they'd consider that a bug?

An existing example where the API and UI differ: changes in the $wgNamespacesToBeSearchedDefault are not reflected in the API. Search namespaces do not default to $wgNamespacesToBeSearchedDefault .

Hmm. If it wouldn't be a breaking change I'd consider fixing the srnamespace default to use that setting.

It is very odd that Search API parameters interwiki and redirect exist, but these non-local files can not be excluded...?

'srinterwiki' controls whether that sidebar thing I mentioned is used. ApiQuerySearch doesn't have a 'redirect' parameter.

jayvdb added a subscriber: XZise.EditedDec 9 2015, 11:48 PM

Thanks. Is that the only case where non-local pages are included in the results (by default)?

No idea, you'd have to ask someone who knows about CirrusSearch stuff.

I do know that there's code for a feature of searches returning interwiki results as a sidebar, and those are returned separately.

I checked GlobalUserPage , and they dont appear in search results.

I wonder if they'd consider that a bug?

An existing example where the API and UI differ: changes in the $wgNamespacesToBeSearchedDefault are not reflected in the API. Search namespaces do not default to $wgNamespacesToBeSearchedDefault .

Hmm. If it wouldn't be a breaking change I'd consider fixing the srnamespace default to use that setting.

Hmm. It might only be the default mention in the help which is not accurate, assuming the commit message of 53cd372ea9b is correct. (https://gerrit.wikimedia.org/r/#/c/159477/ doesnt contain any further discussion) I do recall this being discussed on IRC with @XZise . There is probably some task which sparked that code.

Hmm. It might only be the default mention in the help which is not accurate, assuming the commit message of 53cd372ea9b is correct. (https://gerrit.wikimedia.org/r/#/c/159477/ doesnt contain any further discussion) I do recall this being discussed on IRC with @XZise . There is probably some task which sparked that code.

I've done a quick test on en.wikisource.
https://en.wikisource.org/w/index.php?search=Maxwell&title=Special%3ASearch&go=Go indicates it is searching (for me) in namespaces main(0) , Author, Page, Index & Translation, and the first result is Author:James Clerk Maxwell, as expected.
however .. https://en.wikisource.org/w/api.php?action=query&generator=search&gsrsearch=Maxwell doesnt appear to use my user preferences or site config, as the first result is Best Company v. Maxwell and the results do not include Author: pages.