PageGenerator generating same page multiple times
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	AbdealiJK
	Jun 11 2016, 3:54 AM

Description

I was running a script that found all files in a category (recursively: In all sub categories also) using the -catr flag. I found that some of the pages are given out multiple times.

Here is a simple bot to recreate this:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import absolute_import, unicode_literals

import pywikibot
from pywikibot import pagegenerators


def main(*args):
    generator = None
    local_args = pywikibot.handle_args(args)
    site = pywikibot.Site('commons', 'commons')
    genFactory = pagegenerators.GeneratorFactory(site)
    for arg in local_args:
        genFactory.handleArg(arg)

    generator = genFactory.getCombinedGenerator(gen=generator)
    if not generator:
        pywikibot.bot.suggest_help(missing_generator=True)
    else:
        pregenerator = pagegenerators.PreloadingGenerator(generator)
        site.login()
        old_pages = set()
        for i, page in enumerate(pregenerator):
            if page.exists() and not page.isRedirectPage():
                pywikibot.output(str(i) + '. ' + page.title())
                if page.title() in old_pages:
                    print('Found ' + page.title() + ' more than once.')
                    return
                old_pages.add(page.title())
            # _ = str(raw_input("More ?"))
if __name__ == "__main__":
    main()

When I run it with python name.py -catr:Ogg_sound_files it stops around 2200 at the file File:Thalie Envolée - Charles Baudelaire - La beauté.oga which appears at 2165 and 2200.

I also ran this for a while and got the list: https://tools.wmflabs.org/paste/view/5e83ddea
On analyzing this using python's Counter I found that the most_common() pages are:

from collections import Counter

_file = open('./tmp/llog.txt')
lines = _file.readlines()
pages = []
for line in lines:
    if line[0].isdigit():
        pages.append(line.split('.', 1)[1].strip())
print(len(pages))
for i, j in Counter(pages).most_common()[:40]:
    print j, i

Shows that the top 68 items are all of the form "File:Thalie Envolée - .*.oga" and all have been found 4 times. Then comes some "Chinese tones" like File:Yue-好.ogg, File:Chinese tone 35.png, etc. which occur 3 times, and so on.

Details

	Subject	Repo	Branch	Lines +/-
	[bugfix] Don't yield duplicates with Category.articles(recurse=True)	pywikibot/core	master	+7 -0

Customize query in gerrit

Event Timeline

AbdealiJK created this task.Jun 11 2016, 3:54 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 11 2016, 3:54 AM

AbdealiJK added a project: Pywikibot.Jun 11 2016, 3:54 AM

Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald TranscriptJun 11 2016, 3:54 AM

Also, I am doing a not page.isRedirectPage() so redirects should not be affecting it.

I believe this is because the file exists in more than 1 sub category of the original category given in -catr.

The files related to Thalie Envolée have:

Ogg sound files -> Ogg files by language -> Ogg sound files of spoken French -> Thalie Envolée -> File:Thalie Envolée - Charles Baudelaire - La beauté.oga
Ogg sound files -> Ogg files by language -> Ogg sound files of spoken French -> Thalie Envolée -> Thalie Envolée - Opus 1 -> File:Thalie Envolée - Charles Baudelaire - La beauté.oga

Ogg sound files -> Ogg sound files of audiobooks -> Thalie Envolée -> File:Thalie Envolée - Charles Baudelaire - La beauté.oga
Ogg sound files -> Ogg sound files of audiobooks -> Thalie Envolée -> Thalie Envolée - Opus 1 -> File:Thalie Envolée - Charles Baudelaire - La beauté.oga

The only solution to this, as I see it would be to save all the items in a set() and check if the page has already been generated.

Change 428944 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Don't yield duplicates with Category.articles(recurse=True)

https://gerrit.wikimedia.org/r/428944

gerritbot added a project: Patch-For-Review.Apr 25 2018, 3:52 PM

Xqt claimed this task.Apr 25 2018, 3:52 PM

Xqt triaged this task as High priority.

I'm not sure about storing yielded articles in a set. It might hog a lot of memory for large categories.

In T137612#4181315, @Dalba wrote:

I'm not sure about storing yielded articles in a set. It might hog a lot of memory for large categories.

In pagegenerators we also have such filters where pages are stored in a dict but this doesn't save any memory usage. One idea would be to add the Page._link instead of the whole page which could increase a lot by some attributes. The Page.title() doesn't save memory:

>>> import pwb, pywikibot as py, sys
>>> s = py.Site()
>>> p = py.Page(s, 'user:Xqt')
>>> t = p.title()
>>> l = p._link
>>> p.__sizeof__()
16
>>> t.__sizeof__()
46
>>> l.__sizeof__()
16
>>> sys.getsizeof(p)
32
>>> sys.getsizeof(t)
46
>>> sys.getsizeof(l)
32

I only trust sys.getsizeof for built-in objects. Try Pympler. The size also depends on whether the content of the page is fetched or not... Anyway, perhaps using page.title would be OK for all common purposes.

Change 428944 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Don't yield duplicates with Category.articles(recurse=True)

https://gerrit.wikimedia.org/r/428944

Dvorapa subscribed.May 19 2018, 10:47 AM

PageGenerator generating same page multiple timesClosed, ResolvedPublicActions

Description

Details

Event Timeline

PageGenerator generating same page multiple times
Closed, ResolvedPublic
Actions