Page MenuHomePhabricator

Pywikibot CategorizedPageGenerator yields more files than present in category
Open, Needs TriagePublic

Description

There are about 11K files in Category:License review needed but the generator yields more than 20K files.

# -*- coding: utf-8 -*-
import pywikibot
from pywikibot import pagegenerators

def main():
    SITE = pywikibot.Site()
    cat = pywikibot.Category(SITE, 'License_review_needed')
    gen = pagegenerators.CategorizedPageGenerator(cat)
    for count, page in enumerate(gen, start=1):
        file_name = page.title()
        print("%d - %s" % (count, file_name))


if __name__ == "__main__":
    try:
        main()
    finally:
        pywikibot.stopme()

Run at https://repl.it/repls/SeashellHilariousActivecell#main.py

Event Timeline

Eatcha created this task.Jul 15 2020, 9:33 AM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptJul 15 2020, 9:33 AM
Eatcha updated the task description. (Show Details)Jul 15 2020, 9:48 AM
Mpaa added a subscriber: Mpaa.EditedThu, Oct 1, 3:01 PM

I see:

Media in category "License review needed"
The following 200 files are in this category, out of 17,418 total. <-- this is probably cached

https://petscan.wmflabs.org/ gives 17406

and the script generates 17406

I had to remove the metadata property in the code, to avoid issues like T253591.
In such cases I was getting repeated pages, probably because PageGenerator was getting large metadata info between queries:

8768 8749 - (Category('Category:License review needed'), FilePage('File:Foolad FC vs Esteghlal FC, 24 June 2020 - 36.jpg'))
 8769 8750 - (Category('Category:License review needed'), FilePage('File:Foolad FC vs Esteghlal FC, 24 June 2020 - 35.jpg'))
 8770 /w/api.php?gcmtitle=Category:License+review+needed&gcmprop=ids|title|sortkey&gcmtype=page|file&prop=info|imageinfo|categoryinfo&      inprop=protection&iiprop=timestamp|user|comment|url|size|sha1|metadata&iilimit=max&generator=categorymembers&action=query&indexp      ageids=&continue=gcmcontinue||info|categoryinfo|userinfo&gcmlimit=500&meta=userinfo&uiprop=blockinfo|hasmsg&maxlag=5&format=json      &gcmcontinue=file|3039313338393437320a564c4144494d495220505554494e20494e204952414e20283135292e4a5047|91389472&iicontinue=Charles      _O'Malley,_the_Irish_dragoon_(IA_charlesomalleyir00leve_0).pdf|20200627053136
 8771 8751 - (Category('Category:License review needed'), FilePage('File:Vladimir Putin in Iran (15).jpg'))
 8772 8752 - (Category('Category:License review needed'), FilePage('File:Vladimir Putin in Iran (14).jpg'))



 9269 9249 - (Category('Category:License review needed'), FilePage('File:Foolad FC vs Esteghlal FC, 24 June 2020 - 36.jpg'))
 9270 9250 - (Category('Category:License review needed'), FilePage('File:Foolad FC vs Esteghlal FC, 24 June 2020 - 35.jpg'))
 9271 /w/api.php?gcmtitle=Category:License+review+needed&gcmprop=ids|title|sortkey&gcmtype=page|file&prop=info|imageinfo|categoryinfo&      inprop=protection&iiprop=timestamp|user|comment|url|size|sha1|metadata&iilimit=max&generator=categorymembers&action=query&indexp      ageids=&continue=gcmcontinue||info|categoryinfo|userinfo&gcmlimit=500&meta=userinfo&uiprop=blockinfo|hasmsg&maxlag=5&format=json      &gcmcontinue=file|3039313338393437320a564c4144494d495220505554494e20494e204952414e20283135292e4a5047|91389472&iicontinue=Diction      naire_universel_de_la_langue_française_(IA_dictionnaireuniv00bois).pdf|20200623103453
 9272 9251 - (Category('Category:License review needed'), FilePage('File:Vladimir Putin in Iran (15).jpg'))
 9273 9252 - (Category('Category:License review needed'), FilePage('File:Vladimir Putin in Iran (14).jpg'))

https://commons.wikimedia.org/w/api.php?gcmtitle=Category:License+review+needed&gcmprop=ids|title|sortkey&gcmtype=page|file&prop=info|imageinfo|categoryinfo&inprop=protection&iiprop=timestamp|user|comment|url|size|sha1|metadata&iilimit=max&generator=categorymembers&action=query&indexpageids=&continue=gcmcontinue||info|categoryinfo|userinfo&gcmlimit=500&meta=userinfo&uiprop=blockinfo|hasmsg&maxlag=5&format=json&gcmcontinue=file|3039313338393437320a564c4144494d495220505554494e20494e204952414e20283135292e4a5047|91389472&iicontinue=Charles_O'Malley,_the_Irish_dragoon_(IA_charlesomalleyir00leve_0).pdf|20200627053136

https://commons.wikimedia.org//w/api.php?gcmtitle=Category:License+review+needed&gcmprop=ids|title|sortkey&gcmtype=page|file&prop=info|imageinfo|categoryinfo&inprop=protection&iiprop=timestamp|user|comment|url|size|sha1|metadata&iilimit=max&generator=categorymembers&action=query&indexpageids=&continue=gcmcontinue||info|categoryinfo|userinfo&gcmlimit=500&meta=userinfo&uiprop=blockinfo|hasmsg&maxlag=5&format=json&gcmcontinue=file|3039313338393437320a564c4144494d495220505554494e20494e204952414e20283135292e4a5047|91389472&iicontinue=Dictionnaire_universel_de_la_langue_française_(IA_dictionnaireuniv00bois).pdf|20200623103453