Page MenuHomePhabricator

pagegenerator -unconnectedpages skips items older than 3 years
Open, HighPublic

Description

on page https://fr.wiktionary.org/wiki/Spécial:UnconnectedPages?limit=5000&namespace=14 there are currenty thousands of pages

C:\pwb>pwb.py newitem -lang:fr -family:wiktionary -unconnectedpages -namespace:14 -touch -pageage:7
Page age is set to 7 days so only pages created
before 2017-08-07T05:22:07Z will be considered.
Last edit is set to 7 days so only pages last edited
before 2017-08-07T05:22:07Z will be considered.
Retrieving 18 pages from wiktionary:fr.

But there are 22 unconnected pages.
And thousand of connected needing purge, the newest of them have timestamp of creation
6 août 2014 à 09:37 (2014-08-06)

I tested it for longer time and the border between pages taken by pagegenerator and ignored pages was somewhere between 2104-08-06 and 2014-08-22, so I suppose 3 years

Event Timeline

JAnD created this task.Aug 14 2017, 5:35 AM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptAug 14 2017, 5:35 AM
Mpaa added a subscriber: Mpaa.EditedAug 16 2017, 10:12 PM

I am not sure I get your point. Let me see if I get it right.

This query returns 1400+ Categories:
https://fr.wiktionary.org/wiki/Sp%C3%A9cial:UnconnectedPages?limit=5000&namespace=14

When you run the script below, you get 20+ pages
C:\pwb>pwb.py newitem -lang:fr -family:wiktionary -unconnectedpages -namespace:14 -touch -pageage:7

IMO, the explanation is that pywikibot, in this case, filters namespace=14 after retrieving all sort of pages.
And the API returns at most 10000 pages, 22 of them are categories.

This is the request done by pywikibot. You can try to increase the gqpoffset and see the returned data:
pywikibot.data.api.Request<wiktionary:fr->'/w/api.php?gqppage=UnconnectedPages&prop=info|imageinfo|categoryinfo&inprop=protection&iiprop=timestamp|user|comment|url|size|sha1|metadata&generator=querypage&action=query&indexpageids=&continue=gqpoffset||userinfo&gqplimit=500&meta=userinfo&uiprop=blockinfo|hasmsg&maxlag=5&format=json&gqpoffset=500

Or try to replicate it in the API Sandbox:
https://fr.wiktionary.org/wiki/Sp%C3%A9cial:ApiSandbox#action=query&format=json&prop=info&list=&continue=gqpoffset%7C%7C&generator=querypage&inprop=&intestactions=&gqppage=UnconnectedPages&gqplimit=500

You can search for Categories also here:
https://fr.wiktionary.org/w/index.php?title=Sp%C3%A9cial:UnconnectedPages&limit=5000

Mpaa added a subscriber: Anomie.Aug 17 2017, 8:12 AM

@Anomie, am I correct?
Is there a way to get from the generator only pages in a given namespace?

ApiQueryQueryPage does not support additional parameters that a random query page might use. That includes the "namespace" parameter being used there.

If you're willing to ignore an "Unrecognized parameter: namespace." warning, it does seem to currently work to pass the parameter to the API query anyway (example). I can't promise that will continue working, though, or that if it breaks that it will get fixed.

Xqt triaged this task as High priority.Sep 23 2018, 7:16 PM