I was running a script that found all files in a category (recursively: In all sub categories also) using the -catr flag. I found that some of the pages are given out multiple times.
Here is a simple bot to recreate this:
#!/usr/bin/python # -*- coding: utf-8 -*- from __future__ import absolute_import, unicode_literals import pywikibot from pywikibot import pagegenerators def main(*args): generator = None local_args = pywikibot.handle_args(args) site = pywikibot.Site('commons', 'commons') genFactory = pagegenerators.GeneratorFactory(site) for arg in local_args: genFactory.handleArg(arg) generator = genFactory.getCombinedGenerator(gen=generator) if not generator: pywikibot.bot.suggest_help(missing_generator=True) else: pregenerator = pagegenerators.PreloadingGenerator(generator) site.login() old_pages = set() for i, page in enumerate(pregenerator): if page.exists() and not page.isRedirectPage(): pywikibot.output(str(i) + '. ' + page.title()) if page.title() in old_pages: print('Found ' + page.title() + ' more than once.') return old_pages.add(page.title()) # _ = str(raw_input("More ?")) if __name__ == "__main__": main()
When I run it with python name.py -catr:Ogg_sound_files it stops around 2200 at the file File:Thalie Envolée - Charles Baudelaire - La beauté.oga which appears at 2165 and 2200.
I also ran this for a while and got the list: https://tools.wmflabs.org/paste/view/5e83ddea
On analyzing this using python's Counter I found that the most_common() pages are:
from collections import Counter _file = open('./tmp/llog.txt') lines = _file.readlines() pages = [] for line in lines: if line[0].isdigit(): pages.append(line.split('.', 1)[1].strip()) print(len(pages)) for i, j in Counter(pages).most_common()[:40]: print j, i
Shows that the top 68 items are all of the form "File:Thalie Envolée - .*.oga" and all have been found 4 times. Then comes some "Chinese tones" like File:Yue-好.ogg, File:Chinese tone 35.png, etc. which occur 3 times, and so on.