Page MenuHomePhabricator

Improve support for asynchronous requests (saving/preloading pages)
Open, LowPublicFeature

Description

Currently, the generators functions use yield, which is not thread-safe. PWB should offer a thread-safe version using one of the many interesting suggestions from http://www.dabeaz.com/generators/Generators.pdf , http://anandology.com/blog/using-iterators-and-generators/ , etc


Version: core-(2.0)
Severity: enhancement

Details

Reference
bz55889

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:16 AM
bzimport set Reference to bz55889.
bzimport added a subscriber: Unknown Object (????).

What is the goal you want to achieve by this? Remember that threads in Python are useless for computations, due to the GIL.

As I understand it, I/O happens outside of GIL. As the API requests are the most time-consuming part of many of my robots (and more precisely the connection to the servers), being able to do requests from several threads should somewhat improve performance (as long as the throttling is not too aggressive).

I've noticed that the preloading limit is not only 50 pages, making this problem even more stringent for many small pages. It's probably also a good idea for things like image upload/download.

If it helps, we can do some tests to see if performance is increased for a simple file downloader?

I see. There are already some features in place, but we are maybe not using asynchronous requests at all points where it might be useful.

First of all, connections should be re-used - this is already a feature in the httplib2 library.

The next layer, comms.threadedhttp, supports asynchronous requests ('features' would be a closer term - basically, you create a request and then wait for a lock to be released). However, I don't think we use this feature anywhere, as it's not exposed in the higher-up layers.

For saving pages, which (I think) is the most relevant place for async request, we already have support, where requests that do not return a reply that has to be handled can be handled asynchronously - see Page.put_async.

For pagegenerators, we might be able to win a bit by requesting the (i+1)th page before returning the i-th page (or, for the PreloadingGenerator, by requesting the (i+1)th batch before all pages from the i-th batch have been returned).

(In reply to comment #3)

The next layer, comms.threadedhttp, supports asynchronous requests. [...] I don't think we use this feature anywhere, as
it's not exposed in the higher-up layers.

I've noticed that while writing the answer to Gerard's questions today :)

For saving pages, which (I think) is the most relevant place for async
request,
we already have support, where requests that do not return a reply that has
to
be handled can be handled asynchronously - see Page.put_async.

I've experimented with put_async with mixed results. When the upload works, it's mostly OK, however when one request hits an error (like a 504 from the server) it just keeps trying again and again, keeping the thread blocked.

Instead, the request should probably be de-queued, processed and, if a callback has been registered, the callback should be called in order to allow the bot to re-queue the request. This, however, could cause trouble if the order of the requests is important. The bot can receive a callback, but AFAIK it cannot remove already queued requests. Also, what happens if no callback has been registered? Should we simply re-queue the request? I don't have a perfect solution at this time, but this is a point that should be considered.

Another possible issue, that PWB can't really do much about, is that one can get a 504 even if the save is successful, making the re-queueing useless. I don't have a good solution for that either, but we could consult with the Wikimedia developers.

For pagegenerators, we might be able to win a bit by requesting the (i+1)th
page before returning the i-th page (or, for the PreloadingGenerator, by
requesting the (i+1)th batch before all pages from the i-th batch have been
returned).

This should be especially useful if it can be controlled by the user. Do you have any ideas on how to do this?

I think there were some good ideas brought up on this bug. Should we start a thread on the mailing list so we can gather more input on this?

Change 172023 had a related patch set uploaded by John Vandenberg:
Asynchronous HTTP requests

https://gerrit.wikimedia.org/r/172023

Change 172023 had a related patch set uploaded (by John Vandenberg):
Asynchronous HTTP requests

https://gerrit.wikimedia.org/r/172023

Patch-For-Review

Change 172023 merged by jenkins-bot:
Asynchronous HTTP requests

https://gerrit.wikimedia.org/r/172023

jayvdb set Security to None.
jayvdb removed a subscriber: Unknown Object (????).

fwiw, Change 172023 made the threadedhttp.Request more usable, and made it more accessible as new function http.fetch now returns a threadedhttp.Request object (whereas the only previous function was http.request, which returned unicode). (it also means we can replace threadedhttp/httplib2 with another library more easily)

regarding the initial focus of this task, being thread-safe generators, it would be easy to add a tool (pywikibot/tools.py) which wraps any generator with a semaphore. Allowing threaded-apps to just wrap the outer generator will allow them to use multiple consumers of the outer generator, without all generators needing locking which would slow down unthreaded-apps. I agree this is a low priority, as it is simple to use a managed 'worker' model like weblinkchecker.py does, where a single thread hands out tasks to threads.

(In reply to comment #3)

The next layer, comms.threadedhttp, supports asynchronous requests. [...] I don't think we use this feature anywhere, as
it's not exposed in the higher-up layers.

I've noticed that while writing the answer to Gerard's questions today :)

For saving pages, which (I think) is the most relevant place for async
request,
we already have support, where requests that do not return a reply that has
to
be handled can be handled asynchronously - see Page.put_async.

I've experimented with put_async with mixed results. When the upload works, it's mostly OK, however when one request hits an error (like a 504 from the server) it just keeps trying again and again, keeping the thread blocked.

I would like to experiment with having an async thread pool available to avoid this being a deal breaker. Another approach is to move failed requests from the main async thread to a 'failed request' thread, which manages them differently and escalates many failures so that it eventually kills the job if the error rate is too high. My first excursion in this area is https://gerrit.wikimedia.org/r/#/c/176691/ , to explore whether there are bugs in the existing multiple threads implementation.

Instead, the request should probably be de-queued, processed and, if a callback has been registered, the callback should be called in order to allow the bot to re-queue the request. This, however, could cause trouble if the order of the requests is important. The bot can receive a callback, but AFAIK it cannot remove already queued requests. Also, what happens if no callback has been registered? Should we simply re-queue the request? I don't have a perfect solution at this time, but this is a point that should be considered.

comms.http now allows for additional callbacks, which can be experimented with to develop failover/resending strategies, etc.

Another possible issue, that PWB can't really do much about, is that one can get a 504 even if the save is successful, making the re-queueing useless. I don't have a good solution for that either, but we could consult with the Wikimedia developers.

For pagegenerators, we might be able to win a bit by requesting the (i+1)th
page before returning the i-th page (or, for the PreloadingGenerator, by
requesting the (i+1)th batch before all pages from the i-th batch have been
returned).

This should be especially useful if it can be controlled by the user. Do you have any ideas on how to do this?

It would be good if preloading was able to be set by command line options, so operators can override scripts default settings for different workloads where the scripts default preloading settings are not ideal.

Also wikidata tasks are now regularly slowed down because they use at least two sites (wikibase server and the client), regularly flicking between them. The same problem exists to a lesser extend with shared media host (Wikimedia Commons) + client site scripts.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:13 AM

threadedhttp was replaced by requests long time ago (https://gerrit.wikimedia.org/r/c/pywikibot/core/+/213977) but further asyncio development should be made when Python 3.5 (or even 3.6) has been dropped.