Page MenuHomePhabricator

Provide parallel processing of the BaseBot generator property
Open, Needs TriagePublicFeature

Description

Feature summary:

We have asyncronous put since Pywikibot 1.0 (compat) which shortens the script's executing time a bit due to the put throttle waiting time. This is usefull if the processing time for a single page is very long compared to putting the page but the put throttle would increase the time additionally.

Instead of asyncronous put a parallel processing of pages retrieved from generators (run through BasePage.generator property for example) should be implemented e.g. as additional option.

Use case

  • archivebot.py needs a lot of time to process all talk pages where a template is transcluded. For example parallel task needs only 45' for a complete run on en-wiki but sequential run needs 9 hours I guess. (Ok this can be improved by holding results on a local database for the second and following runs.)

Possible implementations

  • concurrent.futures
  • threading
  • Awaitables

https://docs.python.org/3/library/concurrency.html?highlight=concurrent%20future
https://docs.python.org/3/library/asyncio-task.html
https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.tools.html#pywikibot.tools.ThreadedGenerator
https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.tools.html#pywikibot.tools.ThreadList

Benefits

  • ~10 times faster than serial processing of generators
  • async put is not necessary in most cases because other threads are still working during put throttle wait cycles
    • Any exception handling is inside the processed task
    • no` stopme()` must be called to wait for the put queue is done; otherwise some callbacks might be unavailable after the Bot class is leaved.
    • T104809 can be solved probably

Known problems

  • Logging functions like pywikibot.info must be used either on entry or exit of the parallel task method/function or must start with the thread name/identifier. Otherwise the output messages are confusing because you have no context to which page/generator item it belongs. Another idea would be a priority queue for the cache_output but the task id must be given somehow with the logging function.
  • Any ui input function calls Rlock to queue any other output; they are flushed if RLock is freed. But it must be checked what happens if any ui input is waiting and another task also calls an ui input method. It should work because any stream output is made after the blocking lock the blocking lock
  • There are parallel processing already implemented for some scripts like weblinkchecker (theading) and fixing_redirects (concurrent.futures). The generator shouldn't be parallelisized in such cases I think.

Good to know

  • comms.threadedhttp was given up with the switch from httplib2 to requests and was not adapted
  • See T57889