Provide parallel processing of the BaseBot generator property
Open, Needs TriagePublicFeature
Actions

Assigned To

None

Authored By

	Xqt
	Jul 29 2022, 9:09 AM

Description

Feature summary:

We have asyncronous put since Pywikibot 1.0 (compat) which shortens the script's executing time a bit due to the put throttle waiting time. This is usefull if the processing time for a single page is very long compared to putting the page but the put throttle would increase the time additionally.

Instead of asyncronous put a parallel processing of pages retrieved from generators (run through BasePage.generator property for example) should be implemented e.g. as additional option.

Use case

archivebot.py needs a lot of time to process all talk pages where a template is transcluded. For example parallel task needs only 45' for a complete run on en-wiki but sequential run needs 9 hours I guess. (Ok this can be improved by holding results on a local database for the second and following runs.)

Possible implementations

concurrent.futures
threading
Awaitables

https://docs.python.org/3/library/concurrency.html?highlight=concurrent%20future
https://docs.python.org/3/library/asyncio-task.html
https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.tools.html#pywikibot.tools.ThreadedGenerator
https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.tools.html#pywikibot.tools.ThreadList

Benefits

~10 times faster than serial processing of generators
async put is not necessary in most cases because other threads are still working during put throttle wait cycles
- Any exception handling is inside the processed task
- no` stopme()` must be called to wait for the put queue is done; otherwise some callbacks might be unavailable after the Bot class is leaved.
- T104809 can be solved probably

Known problems

Logging functions like pywikibot.info must be used either on entry or exit of the parallel task method/function or must start with the thread name/identifier. Otherwise the output messages are confusing because you have no context to which page/generator item it belongs. Another idea would be a priority queue for the cache_output but the task id must be given somehow with the logging function.
Any ui input function calls Rlock to queue any other output; they are flushed if RLock is freed. But it must be checked what happens if any ui input is waiting and another task also calls an ui input method. It should work because any stream output is made after the blocking lock the blocking lock
There are parallel processing already implemented for some scripts like weblinkchecker (theading) and fixing_redirects (concurrent.futures). The generator shouldn't be parallelisized in such cases I think.

Good to know

comms.threadedhttp was given up with the switch from httplib2 to requests and was not adapted
See T57889

Details

	Subject	Repo	Branch	Lines +/-
	[WIP] Enable asyncronous processing og Bot.treat()	pywikibot/core	master	+31 -15

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T57889 Improve support for asynchronous requests (saving/preloading pages)
Open	Feature	None	T314121 Provide parallel processing of the BaseBot generator property
Open	Feature	None	T334053 Use only one entry point for all http requests
Resolved		Xqt	T367649 http.get_authentication() uses private requests.util.urlparse() function