Page MenuHomePhabricator

unconnected_pages generator doesn't seem to return all pages
Open, HighPublic

Description

One of my bots didn't run for a while so I had to catch up at the unconnected pages (see https://nl.wikipedia.org/w/index.php?title=Speciaal:OngekoppeldePaginas&limit=5000&offset=0&namespace=0 ). I use the -unconnected commandline option (https://phabricator.wikimedia.org/diffusion/PWBC/browse/master/pywikibot/pagegenerators.py$985 ). I noticed that if I run "python pwb.py touch.py -lang:nl -family:wikipedia -namespaces:0 -unconnectedpages" that I don't get all pages.

The generator uses site.unconnected_pages ( https://phabricator.wikimedia.org/diffusion/PWBC/browse/master/pywikibot/site.py$6803 ) which using https://phabricator.wikimedia.org/diffusion/PWBC/browse/master/pywikibot/site.py$1915 gets a api.PageGenerator ( https://phabricator.wikimedia.org/diffusion/PWBC/browse/master/pywikibot/data/api.py$2971 ) which is a subclass of the QueryGenerator ( https://phabricator.wikimedia.org/diffusion/PWBC/browse/master/pywikibot/data/api.py$2568 ).

I think the continue handling is going wrong here. Have a look at https://nl.wikipedia.org/w/api.php?action=query&list=querypage&qppage=UnconnectedPages&format=json&qpoffset=10 . The qpoffset is used for the paging and the continue parameter. I don't think we're using the qpoffset

More info at https://www.mediawiki.org/wiki/API:Querypage

Event Timeline

I think the continue handling is going wrong here

That's very likely. I recently tried to fix a continue handling issue in PropertyGenerator, see T196876, but similar issues might also exist in other generators. If so, we should look for a more general solution.

Xqt triaged this task as High priority.Sep 23 2018, 7:16 PM

If all the generators listed on https://www.mediawiki.org/wiki/API:Querypage follow the same logic in Pywikibot as the unconnected pages one, than probably all of them have the same issue. Maybe make a new subclass QueryPageGenerator that wraps around https://www.mediawiki.org/wiki/API:Querypage ?

qpoffset is used, the issue is same as T173293.

The last request that fetches data is

pywikibot.data.api.Request<wikipedia:nl->'/w/api.php?gqppage=UnconnectedPages&prop=info|imageinfo|categoryinfo&inprop=protection&iiprop=timestamp|user|comment|url|size|sha1|metadata&iilimit=max&generator=querypage&action=query&indexpageids=&continue=gqpoffset||userinfo&gqplimit=500&meta=userinfo&uiprop=blockinfo|hasmsg&maxlag=5&format=json&gqpoffset=10000'>

while

pywikibot.data.api.Request<wikipedia:nl->'/w/api.php?gqppage=UnconnectedPages&prop=info|imageinfo|categoryinfo&inprop=protection&iiprop=timestamp|user|comment|url|size|sha1|metadata&iilimit=max&generator=querypage&action=query&indexpageids=&continue=gqpoffset||userinfo&gqplimit=500&meta=userinfo&uiprop=blockinfo|hasmsg&maxlag=5&format=json&gqpoffset=10500'>

yields

{'batchcomplete': '', 'query': {'querypage': {'name': 'UnconnectedPages'}}, 'limits': {'imageinfo': 500}}

and no more data can be fetched.
I do not know what sets the limit of 10000 in the API.

So you get all the pages with namespace=0 that are in the yielded 10500 pages.