Page MenuHomePhabricator

page is a redirect but not a redirect
Closed, ResolvedPublic

Description

When attempting to get the redirects to a page using page.backlinks(filterRedirects=True), IsNotRedirectPage is raised. The page the exception claims is not a redirect, however, is a redirect.

page.getReferences(redirectsOnly=True) gives the same exception.

page.getReferences(follow_redirects=False, redirectsOnly=True) yields the same page twice.

Python 3.4.3 (default, Nov 28 2017, 16:41:13)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pywikibot
>>> site = pywikibot.Site('en', 'wikipedia')
>>> page = pywikibot.Page(site, 'File:1979–80 National Football League (Ireland) final.jpg')
>>> page.backlinks(filterRedirects=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/shared/pywikipedia/core/pywikibot/tools/__init__.py", line 1423, in wrapper
    return obj(*__args, **__kw)
  File "/shared/pywikipedia/core/pywikibot/page.py", line 1062, in backlinks
    content=content
  File "/shared/pywikipedia/core/pywikibot/site.py", line 3649, in pagebacklinks
    if redir.getRedirectTarget() == page:
  File "/shared/pywikipedia/core/pywikibot/page.py", line 1664, in getRedirectTarget
    return self.site.getredirtarget(self)
  File "/shared/pywikipedia/core/pywikibot/site.py", line 3166, in getredirtarget
    raise IsNotRedirectPage(page)
pywikibot.exceptions.IsNotRedirectPage: Page [[en:File:1979-80 National Football League (Ireland) final.jpg]] is not a redirect page.
>>> 
>>> page2 = pywikibot.Page(site, 'File:1979-80 National Football League (Ireland) final.jpg')
>>> page2.isRedirectPage()
True
>>> page2.getRedirectTarget()
Page('File:1979–80 National Football League (Ireland) final.jpg')
>>> page3 = page2.getRedirectTarget()
>>> page3
Page('File:1979–80 National Football League (Ireland) final.jpg')
>>> page == page3
True
>>> 
>>> page.getReferences(redirectsOnly=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/shared/pywikipedia/core/pywikibot/tools/__init__.py", line 1423, in wrapper
    return obj(*__args, **__kw)
  File "/shared/pywikipedia/core/pywikibot/page.py", line 1038, in getReferences
    content=content
  File "/shared/pywikipedia/core/pywikibot/tools/__init__.py", line 1423, in wrapper
    return obj(*__args, **__kw)
  File "/shared/pywikipedia/core/pywikibot/site.py", line 3718, in pagereferences
    namespaces=namespaces, content=content),
  File "/shared/pywikipedia/core/pywikibot/site.py", line 3649, in pagebacklinks
    if redir.getRedirectTarget() == page:
  File "/shared/pywikipedia/core/pywikibot/page.py", line 1664, in getRedirectTarget
    return self.site.getredirtarget(self)
  File "/shared/pywikipedia/core/pywikibot/site.py", line 3166, in getredirtarget
    raise IsNotRedirectPage(page)
pywikibot.exceptions.IsNotRedirectPage: Page [[en:File:1979-80 National Football League (Ireland) final.jpg]] is not a redirect page.
>>> 
>>> list(page.getReferences(follow_redirects=False, redirectsOnly=True))
[FilePage('File:1979-80 National Football League (Ireland) final.jpg'), FilePage('File:1979-80 National Football League (Ireland) final.jpg')]

Note: The difference between page and page2 is en dash vs hyphen.

Event Timeline

Xqt triaged this task as High priority.May 9 2018, 5:48 PM
Xqt removed Xqt as the assignee of this task.May 10 2018, 4:44 PM
Xqt subscribed.

It is much more curious:

>>> import pwb, pywikibot as py
>>> s = py.Site('en')
>>> p1 = py.Page(s, u'File:1979–80 National Football League (Ireland) final.jpg')
>>> for p in p1.backlinks(followRedirects=False, filterRedirects=True):
	print p.isRedirectPage(), p.pageid
	

True 51399802
False 51399802
>>>

This means we get two identical entries from the backlinks api generator and different results for isRedirectPage().

zhuyifei1999 added a subscriber: Anomie.

prop=imageinfo is following the redirect but prop=info is not, causing discrepancy between how MediaWiki and pywikibot interprets the data.

We can set a very high value to iilimit, or have some sort of 'strict' mode for prop=imageinfo, or... what's the best way to resolve this @Anomie?

prop=imageinfo is showing the data for the image associated with File:1979-80 National Football League (Ireland) final.jpg. MediaWiki's logic for finding "the image associated with a title" follows redirects behinds the scenes.

So what happens here at the API level is:

  • The first query returns the data for info, categoryinfo, and userinfo, and also the imageinfo for the 2017-02-07T20:14:37Z upload.
  • The second query returns only imageinfo for the 2016-08-23T10:52:33Z upload. No info, no categoryinfo, and no userinfo.

The intention is that the results of the two queries should be merged together for the client to have the full picture of the information available for the page. Pywikibot is instead incorrectly thinking that the second query alone gives some sort of full picture.

It doesn't help that the imageinfo module is crufty and needs a rewrite to behave a bit more sanely, i.e. to have "current file version for a set of pages" and "all file versions for one page" modes like prop=revisions does.

The most correct thing for pywikibot's "backlinks" generator to do is to keep continuing the query and merging the result sets until it gets the batchcomplete flag in the response.

The most correct thing for pywikibot's "backlinks" generator to do is to keep continuing the query and merging the result sets until it gets the batchcomplete flag in the response.

Merging until 'batchcomplete' is not realistic if a certain generator yields too many results (say generator for transclusions of a widely used template generator yields hundreds of thousands of results). Other methods I thought of, such as estimating the #of results that will be yielded for different logic, and merging until the certain pageid is gone from 'pageids', can make the logic in pywikibot unnecessarily complex.

@Xqt The easiest backwards-compatible 'workaround' that doesn't break ZOI rule I thought of is to remove the prop=imageinfo from pywikibot.data.api.PageGenerator. The framework, as I understand, should reload the imageinfo from the API if it has not been pre-loaded but explicitly requested from a bot's code. Does that sound sane to you, or is prop=imageinfo is critical for some bots or the framework?

Merging until 'batchcomplete' is not realistic if a certain generator yields too many results

No, batchcomplete is specifically designed to do the right thing with generators.

If you were to complain about prop=transcludedin rather than generator=transcludedin, you might have a point.

No, batchcomplete is specifically designed to do the right thing with generators.

Oh right, re-read the docs and I see what you mean.

@Anomie While testing how batchcomplete interacts with pageid, I tested the same query for File:Example.jpg on commons, which contains many redirects and multiple file revisions, but I got batchcomplete with only a single imageinfo for each file name:
https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&maxlag=5&prop=info%7Cimageinfo%7Ccategoryinfo&meta=userinfo&indexpageids=1&continue=&generator=backlinks&inprop=protection&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata&uiprop=blockinfo%7Chasmsg&gbltitle=File%3AExample.jpg&gblfilterredir=redirects&gbllimit=500
I was expecting a continue for the imageinfo query. Is it expected that there is no continuation?

One idea would be to raise iilimit to max instead of using the default 1 here. I guess we don't reach the limit of 5000 imageinfo results for a bot. That would be good enough until the T89971 issue is rewritten.

It's a bug of pywikibot to not treat 'batchcomplete' properly anyhow.

It's a bug of pywikibot to not treat 'batchcomplete' properly anyhow.

Sure and caused by the imageinfo content but currently it could be easily solved by reading that list at once instead step by step as long as we don't use a separate generator for it.

@Anomie While testing how batchcomplete interacts with pageid, I tested the same query for File:Example.jpg on commons, which contains many redirects and multiple file revisions, but I got batchcomplete with only a single imageinfo for each file name:
https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&maxlag=5&prop=info%7Cimageinfo%7Ccategoryinfo&meta=userinfo&indexpageids=1&continue=&generator=backlinks&inprop=protection&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata&uiprop=blockinfo%7Chasmsg&gbltitle=File%3AExample.jpg&gblfilterredir=redirects&gbllimit=500
I was expecting a continue for the imageinfo query. Is it expected that there is no continuation?

... Ok, imageinfo is weirder than I thought. If there is one page produced by the generator (or specified directly), it continues. If there is more than one, it stops at the limit.

Change 432767 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Solve wrong redirect status found for redirect filter in backlinks

https://gerrit.wikimedia.org/r/432767

Change 432767 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Solve wrong redirect status found for redirect filter in backlinks

https://gerrit.wikimedia.org/r/432767

Change 433962 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Don't patch the site.unconnected_pages

https://gerrit.wikimedia.org/r/433962

Change 433962 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Don't patch the site.unconnected_pages

https://gerrit.wikimedia.org/r/433962

Vvjjkkii renamed this task from page is a redirect but not a redirect to ybdaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Xqt as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.