Page MenuHomePhabricator

[Pywikibot] proofreadpage.IndexPage.page_gen with option 'content=False' need a lot of time to proceed
Closed, ResolvedPublicBUG REPORT

Description

I'm working on a script on python using IndexPage page generator:

IndexPage.page_gen(content=False)

According with the documentation this should work and don't preload every page (which can be very time and bandwidth consuming if the Index is, say, 500 or 900 pages long), but instead it does nothing. Removing the option, or setting 'content=True' work as usual.

Event Timeline

@Ninovolador: you mean that the content is loaded anyway even that flag is set to False?

I guess, this is due to the quality filter which needs the text. @Mpaa: am I right?

@Ninovolador: you mean that the content is loaded anyway even that flag is set to False?

It doesn't work at all. It does not work as a genetator. No prompt or console output, just hangs

@Ninovolador: you mean that the content is loaded anyway even that flag is set to False?

It doesn't work at all. It does not work as a genetator. No prompt or console output, just hangs

No, it works, for example:

>>> import pywikibot
>>> from pywikibot.proofreadpage import IndexPage
>>> site = pywikibot.Site('wikisource:en')
>>> ip = IndexPage(site, 'index:Popular Science Monthly Volume 1.djvu')
>>> gen = ip.page_gen(content=False)
>>> next(gen)
ProofreadPage('Page:Popular Science Monthly Volume 1.djvu/2')

Maybe it needs a lot of time to load all the pages. But this is due to the sorting funtion; propose to remove that in https://gerrit.wikimedia.org/r/c/pywikibot/core/+/1007342

@Ninovolador: you mean that the content is loaded anyway even that flag is set to False?

It doesn't work at all. It does not work as a genetator. No prompt or console output, just hangs

No, it works, for example:

>>> import pywikibot
>>> from pywikibot.proofreadpage import IndexPage
>>> site = pywikibot.Site('wikisource:en')
>>> ip = IndexPage(site, 'index:Popular Science Monthly Volume 1.djvu')
>>> gen = ip.page_gen(content=False)
>>> next(gen)
ProofreadPage('Page:Popular Science Monthly Volume 1.djvu/2')

Maybe it needs a lot of time to load all the pages. But this is due to the sorting funtion; propose to remove that in https://gerrit.wikimedia.org/r/c/pywikibot/core/+/1007342

I see! I gave it a couple of minutes and didn't worked. It's strange that when i set content=True is almost instantaneous.

But thanks for the patch!

Change 1007387 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [bugfix] remove content parameter of ItemPage.page_gen method

https://gerrit.wikimedia.org/r/1007387

I did some test using this script

import pywikibot
from pywikibot import proofreadpage
from pywikibot import pagegenerators

import time

site=pywikibot.Site('es', fam="wikisource")

def treat_page(page):
    page.body = page.body + ""
    pass
        
time1 = time.time()

index = "Cautiverio feliz, y razón de las guerras dilatadas de Chile.pdf"
INDEX = proofreadpage.IndexPage(site, title="Index:"+index)


gen = INDEX.page_gen(end=100, content=False)

for page in gen:
    treat_page(page)

time1 = time.time() - time1

print('content=False: ', time1)

I got these two results (changing the boolean accordingly):
For content=True
content=True: 3.071580171585083
For content=False
(first try with a ConnectionError exception) content=False: 48.33420515060425
content=False: 53.96983456611633

I did some test using this script

I think the reason is that with content=False the content is loaded page by page because the content is required by the filter, whereas content=True does a bulk load which might be faster.

Xqt triaged this task as Medium priority.
Xqt renamed this task from [Pywikibot] proofreadpage.IndexPage.page_gen with option 'content=False' does nothing to [Pywikibot] proofreadpage.IndexPage.page_gen with option 'content=False' need a lot of time to proceed.Feb 28 2024, 4:29 PM
Xqt added a project: Performance Issue.
Xqt moved this task from Backlog to Needs Review on the Pywikibot board.

Change 1007387 merged by jenkins-bot:

[pywikibot/core@master] [bugfix] remove content parameter of ItemPage.page_gen method

https://gerrit.wikimedia.org/r/1007387