Page MenuHomePhabricator

unconnected_pages generator doesn't seem to return all pages
Open, HighPublic


One of my bots didn't run for a while so I had to catch up at the unconnected pages (see ). I use the -unconnected commandline option ($985 ). I noticed that if I run "python -lang:nl -family:wikipedia -namespaces:0 -unconnectedpages" that I don't get all pages.

The generator uses site.unconnected_pages ($6803 ) which using$1915 gets a api.PageGenerator ($2971 ) which is a subclass of the QueryGenerator ($2568 ).

I think the continue handling is going wrong here. Have a look at . The qpoffset is used for the paging and the continue parameter. I don't think we're using the qpoffset

More info at

Event Timeline

I think the continue handling is going wrong here

That's very likely. I recently tried to fix a continue handling issue in PropertyGenerator, see T196876, but similar issues might also exist in other generators. If so, we should look for a more general solution.

Xqt triaged this task as High priority.Sep 23 2018, 7:16 PM

If all the generators listed on follow the same logic in Pywikibot as the unconnected pages one, than probably all of them have the same issue. Maybe make a new subclass QueryPageGenerator that wraps around ?

qpoffset is used, the issue is same as T173293.

The last request that fetches data is<wikipedia:nl->'/w/api.php?gqppage=UnconnectedPages&prop=info|imageinfo|categoryinfo&inprop=protection&iiprop=timestamp|user|comment|url|size|sha1|metadata&iilimit=max&generator=querypage&action=query&indexpageids=&continue=gqpoffset||userinfo&gqplimit=500&meta=userinfo&uiprop=blockinfo|hasmsg&maxlag=5&format=json&gqpoffset=10000'>



{'batchcomplete': '', 'query': {'querypage': {'name': 'UnconnectedPages'}}, 'limits': {'imageinfo': 500}}

and no more data can be fetched.
I do not know what sets the limit of 10000 in the API.

So you get all the pages with namespace=0 that are in the yielded 10500 pages.