Improve support for asynchronous requests (saving/preloading pages)
Open, LowPublicFeature
Actions

Assigned To

None

Authored By

	Strainu
	Oct 18 2013, 6:04 PM

Description

Currently, the generators functions use yield, which is not thread-safe. PWB should offer a thread-safe version using one of the many interesting suggestions from http://www.dabeaz.com/generators/Generators.pdf , http://anandology.com/blog/using-iterators-and-generators/ , etc

Version: core-(2.0)
Severity: enhancement

Details

Reference: bz55889

	Subject	Repo	Branch	Lines +/-
	Asynchronous HTTP requests	pywikibot/core	master	+276 -47

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T57889 Improve support for asynchronous requests (saving/preloading pages)
Resolved		None	T57220 put_async should support all put parameters
Resolved	Release	Xqt	T301908 Drop support for Python 3.5
Resolved		Xqt	T305292 migrate Pywikibot from Toolforge Stretch Grid Engine to Toolforge Kubernetes
Resolved		Urbanecm	T312794 Provide some Pywikibot usage statistics for Python3.5 and 3.6
Open	Feature	None	T314121 Provide parallel processing of the BaseBot generator property
Open	Feature	None	T334053 Use only one entry point for all http requests
Resolved		Xqt	T367649 http.get_authentication() uses private requests.util.urlparse() function

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 22 2014, 2:16 AM

• bzimport added a project: Pywikibot-General.

• bzimport set Reference to bz55889.

• bzimport added a subscriber: Unknown Object (????).

Strainu created this task.Oct 18 2013, 6:04 PM

What is the goal you want to achieve by this? Remember that threads in Python are useless for computations, due to the GIL.

As I understand it, I/O happens outside of GIL. As the API requests are the most time-consuming part of many of my robots (and more precisely the connection to the servers), being able to do requests from several threads should somewhat improve performance (as long as the throttling is not too aggressive).

I've noticed that the preloading limit is not only 50 pages, making this problem even more stringent for many small pages. It's probably also a good idea for things like image upload/download.

If it helps, we can do some tests to see if performance is increased for a simple file downloader?

I see. There are already some features in place, but we are maybe not using asynchronous requests at all points where it might be useful.

First of all, connections should be re-used - this is already a feature in the httplib2 library.

The next layer, comms.threadedhttp, supports asynchronous requests ('features' would be a closer term - basically, you create a request and then wait for a lock to be released). However, I don't think we use this feature anywhere, as it's not exposed in the higher-up layers.

For saving pages, which (I think) is the most relevant place for async request, we already have support, where requests that do not return a reply that has to be handled can be handled asynchronously - see Page.put_async.

For pagegenerators, we might be able to win a bit by requesting the (i+1)th page before returning the i-th page (or, for the PreloadingGenerator, by requesting the (i+1)th batch before all pages from the i-th batch have been returned).

(In reply to comment #3)

The next layer, comms.threadedhttp, supports asynchronous requests. [...] I don't think we use this feature anywhere, as
it's not exposed in the higher-up layers.

I've noticed that while writing the answer to Gerard's questions today :)

For saving pages, which (I think) is the most relevant place for async
request,
we already have support, where requests that do not return a reply that has
to
be handled can be handled asynchronously - see Page.put_async.

I've experimented with put_async with mixed results. When the upload works, it's mostly OK, however when one request hits an error (like a 504 from the server) it just keeps trying again and again, keeping the thread blocked.

Instead, the request should probably be de-queued, processed and, if a callback has been registered, the callback should be called in order to allow the bot to re-queue the request. This, however, could cause trouble if the order of the requests is important. The bot can receive a callback, but AFAIK it cannot remove already queued requests. Also, what happens if no callback has been registered? Should we simply re-queue the request? I don't have a perfect solution at this time, but this is a point that should be considered.

Another possible issue, that PWB can't really do much about, is that one can get a 504 even if the save is successful, making the re-queueing useless. I don't have a good solution for that either, but we could consult with the Wikimedia developers.

For pagegenerators, we might be able to win a bit by requesting the (i+1)th
page before returning the i-th page (or, for the PreloadingGenerator, by
requesting the (i+1)th batch before all pages from the i-th batch have been
returned).

This should be especially useful if it can be controlled by the user. Do you have any ideas on how to do this?

I think there were some good ideas brought up on this bug. Should we start a thread on the mailing list so we can gather more input on this?

Change 172023 had a related patch set uploaded by John Vandenberg:
Asynchronous HTTP requests

https://gerrit.wikimedia.org/r/172023

Aklapper added a project: Pywikibot.Nov 27 2014, 4:16 PM

Change 172023 had a related patch set uploaded (by John Vandenberg):
Asynchronous HTTP requests

https://gerrit.wikimedia.org/r/172023

Patch-For-Review

Change 172023 merged by jenkins-bot:
Asynchronous HTTP requests

https://gerrit.wikimedia.org/r/172023

jayvdb updated the task description. (Show Details)Dec 7 2014, 3:06 AM

jayvdb edited projects, added Pywikibot-pagegenerators.py, Performance Issue; removed Patch-For-Review, Pywikibot-General.

jayvdb set Security to None.

jayvdb removed a subscriber: Unknown Object (????).

fwiw, Change 172023 made the threadedhttp.Request more usable, and made it more accessible as new function http.fetch now returns a threadedhttp.Request object (whereas the only previous function was http.request, which returned unicode). (it also means we can replace threadedhttp/httplib2 with another library more easily)

regarding the initial focus of this task, being thread-safe generators, it would be easy to add a tool (pywikibot/tools.py) which wraps any generator with a semaphore. Allowing threaded-apps to just wrap the outer generator will allow them to use multiple consumers of the outer generator, without all generators needing locking which would slow down unthreaded-apps. I agree this is a low priority, as it is simple to use a managed 'worker' model like weblinkchecker.py does, where a single thread hands out tasks to threads.

In T57889#599738, @Strainu wrote:

(In reply to comment #3)

The next layer, comms.threadedhttp, supports asynchronous requests. [...] I don't think we use this feature anywhere, as
it's not exposed in the higher-up layers.

I've noticed that while writing the answer to Gerard's questions today :)

For saving pages, which (I think) is the most relevant place for async
request,
we already have support, where requests that do not return a reply that has
to
be handled can be handled asynchronously - see Page.put_async.

I've experimented with put_async with mixed results. When the upload works, it's mostly OK, however when one request hits an error (like a 504 from the server) it just keeps trying again and again, keeping the thread blocked.

I would like to experiment with having an async thread pool available to avoid this being a deal breaker. Another approach is to move failed requests from the main async thread to a 'failed request' thread, which manages them differently and escalates many failures so that it eventually kills the job if the error rate is too high. My first excursion in this area is https://gerrit.wikimedia.org/r/#/c/176691/ , to explore whether there are bugs in the existing multiple threads implementation.

Instead, the request should probably be de-queued, processed and, if a callback has been registered, the callback should be called in order to allow the bot to re-queue the request. This, however, could cause trouble if the order of the requests is important. The bot can receive a callback, but AFAIK it cannot remove already queued requests. Also, what happens if no callback has been registered? Should we simply re-queue the request? I don't have a perfect solution at this time, but this is a point that should be considered.

comms.http now allows for additional callbacks, which can be experimented with to develop failover/resending strategies, etc.

Another possible issue, that PWB can't really do much about, is that one can get a 504 even if the save is successful, making the re-queueing useless. I don't have a good solution for that either, but we could consult with the Wikimedia developers.

For pagegenerators, we might be able to win a bit by requesting the (i+1)th
page before returning the i-th page (or, for the PreloadingGenerator, by
requesting the (i+1)th batch before all pages from the i-th batch have been
returned).

This should be especially useful if it can be controlled by the user. Do you have any ideas on how to do this?

It would be good if preloading was able to be set by command line options, so operators can override scripts default settings for different workloads where the scripts default preloading settings are not ideal.

Also wikidata tasks are now regularly slowed down because they use at least two sites (wikibase server and the client), regularly flicking between them. The same problem exists to a lesser extend with shared media host (Wikimedia Commons) + client site scripts.

Ricordisamoa subscribed.Dec 31 2014, 11:23 AM

jayvdb mentioned this in rPWBCb14f8d098e5e: Asynchronous HTTP requests.Mar 4 2015, 8:46 AM

Strainu unsubscribed.Apr 15 2015, 7:56 PM

Enterprisey updated the task description. (Show Details)Jun 30 2016, 3:46 AM

Enterprisey subscribed.

matej_suchanek closed subtask T57220: put_async should support all put parameters as Resolved.Jul 16 2017, 6:52 PM

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:13 AM

Xqt added a subtask: T301908: Drop support for Python 3.5.May 12 2022, 10:37 AM

threadedhttp was replaced by requests long time ago (https://gerrit.wikimedia.org/r/c/pywikibot/core/+/213977) but further asyncio development should be made when Python 3.5 (or even 3.6) has been dropped.

Xqt mentioned this in T314121: Provide parallel processing of the BaseBot generator property.Jul 29 2022, 9:09 AM

Xqt added a subtask: T314121: Provide parallel processing of the BaseBot generator property.

Xqt closed subtask T301908: Drop support for Python 3.5 as Resolved.Sep 26 2022, 11:00 AM

Fuzheado subscribed.Feb 25 2023, 4:32 PM

Xqt mentioned this in T334053: Use only one entry point for all http requests.Apr 5 2023, 7:52 AM

Xqt added a subtask: T334053: Use only one entry point for all http requests.