Page MenuHomePhabricator

PageGenerator generating same page multiple times
Closed, ResolvedPublic

Description

I was running a script that found all files in a category (recursively: In all sub categories also) using the -catr flag. I found that some of the pages are given out multiple times.

Here is a simple bot to recreate this:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import absolute_import, unicode_literals

import pywikibot
from pywikibot import pagegenerators


def main(*args):
    generator = None
    local_args = pywikibot.handle_args(args)
    site = pywikibot.Site('commons', 'commons')
    genFactory = pagegenerators.GeneratorFactory(site)
    for arg in local_args:
        genFactory.handleArg(arg)

    generator = genFactory.getCombinedGenerator(gen=generator)
    if not generator:
        pywikibot.bot.suggest_help(missing_generator=True)
    else:
        pregenerator = pagegenerators.PreloadingGenerator(generator)
        site.login()
        old_pages = set()
        for i, page in enumerate(pregenerator):
            if page.exists() and not page.isRedirectPage():
                pywikibot.output(str(i) + '. ' + page.title())
                if page.title() in old_pages:
                    print('Found ' + page.title() + ' more than once.')
                    return
                old_pages.add(page.title())
            # _ = str(raw_input("More ?"))
if __name__ == "__main__":
    main()

When I run it with python name.py -catr:Ogg_sound_files it stops around 2200 at the file File:Thalie Envolée - Charles Baudelaire - La beauté.oga which appears at 2165 and 2200.

I also ran this for a while and got the list: https://tools.wmflabs.org/paste/view/5e83ddea
On analyzing this using python's Counter I found that the most_common() pages are:

from collections import Counter

_file = open('./tmp/llog.txt')
lines = _file.readlines()
pages = []
for line in lines:
    if line[0].isdigit():
        pages.append(line.split('.', 1)[1].strip())
print(len(pages))
for i, j in Counter(pages).most_common()[:40]:
    print j, i

Shows that the top 68 items are all of the form "File:Thalie Envolée - .*.oga" and all have been found 4 times. Then comes some "Chinese tones" like File:Yue-好.ogg, File:Chinese tone 35.png, etc. which occur 3 times, and so on.

Event Timeline

Also, I am doing a not page.isRedirectPage() so redirects should not be affecting it.

I believe this is because the file exists in more than 1 sub category of the original category given in -catr.

The files related to Thalie Envolée have:

  • Ogg sound files -> Ogg files by language -> Ogg sound files of spoken French -> Thalie Envolée -> File:Thalie Envolée - Charles Baudelaire - La beauté.oga
  • Ogg sound files -> Ogg files by language -> Ogg sound files of spoken French -> Thalie Envolée -> Thalie Envolée - Opus 1 -> File:Thalie Envolée - Charles Baudelaire - La beauté.oga
  • Ogg sound files -> Ogg sound files of audiobooks -> Thalie Envolée -> File:Thalie Envolée - Charles Baudelaire - La beauté.oga
  • Ogg sound files -> Ogg sound files of audiobooks -> Thalie Envolée -> Thalie Envolée - Opus 1 -> File:Thalie Envolée - Charles Baudelaire - La beauté.oga

The only solution to this, as I see it would be to save all the items in a set() and check if the page has already been generated.

Change 428944 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Don't yield duplicates with Category.articles(recurse=True)

https://gerrit.wikimedia.org/r/428944

Xqt triaged this task as High priority.

I'm not sure about storing yielded articles in a set. It might hog a lot of memory for large categories.

I'm not sure about storing yielded articles in a set. It might hog a lot of memory for large categories.

In pagegenerators we also have such filters where pages are stored in a dict but this doesn't save any memory usage. One idea would be to add the Page._link instead of the whole page which could increase a lot by some attributes. The Page.title() doesn't save memory:

>>> import pwb, pywikibot as py, sys
>>> s = py.Site()
>>> p = py.Page(s, 'user:Xqt')
>>> t = p.title()
>>> l = p._link
>>> p.__sizeof__()
16
>>> t.__sizeof__()
46
>>> l.__sizeof__()
16
>>> sys.getsizeof(p)
32
>>> sys.getsizeof(t)
46
>>> sys.getsizeof(l)
32

I only trust sys.getsizeof for built-in objects. Try Pympler. The size also depends on whether the content of the page is fetched or not... Anyway, perhaps using page.title would be OK for all common purposes.

Change 428944 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Don't yield duplicates with Category.articles(recurse=True)

https://gerrit.wikimedia.org/r/428944