Page MenuHomePhabricator

Provide direct access to a Guzzle HTTP client
Closed, ResolvedPublic

Description

Shellbox (T260330) requires MediaWiki to post potentially large HTTP requests that are assembled in PHP in a custom multipart/mixed format, and to receive large responses in the same format. In the library I used PSR-7 abstractions, allowing this to theoretically be done without requiring either the requests or responses to be held in memory.

However, MediaWiki's existing abstractions for HTTP requests do not allow response streaming. In the WIP patch, I had to buffer the response into a local temporary file.

The obvious way to implement a PSR-7 HTTP client is to just use Guzzle. Guzzle is already a dependency. The only thing that is missing is a factory function (say HttpRequestFactory::createGuzzleClient()), and possibly a subclass. The returned client object should respect the maximum timeout configuration $wgHTTPMaxTimeout and $wgHTTPMaxConnectTimeout, and it should have a default User-Agent which comes from HttpRequestFactory::getUserAgent().

Guzzle has asynchronous HTTP support. If I understand it correctly, this feature is implicitly used for response streaming. That's how the caller can read data from the response stream while the HTTP request is still active and the server is still sending data.

Event Timeline

Edit: ignore this comment. See my followup comment for why it was misguided.

The only thing that is missing is a factory function (say HttpRequestFactory::createGuzzleClient()), and possibly a subclass

Alternatively, can we make HttpRequestFactory::create() unconditionally return a Guzzle client?

HttpRequestFactory::create() checks Http::$httpEngine to determine what type of client to create (GuzzleHttpRequest, CurlHttpRequest, or PhpHttpRequest). But Http::$httpEngine was deprecated since 1.34 with the advice "just use the default engine". If Http::$httpEngine is not set, HttpRequestFactory::create() will set it to "guzzle".

The HISTORY file says:

Http::$httpEngine is deprecated and has no replacement. The default 'guzzle'
  engine will eventually be made the only engine for HTTP requests.

Maybe now is the time?

Codesearch does find a number of references to Http:$httpEngine in various extensions: https://codesearch.wmcloud.org/search/?q=httpEngine&i=nope&files=&repos= , mostly to force use of curl. I have not exhaustively ensured that GuzzleHttpRequest would happily handle all those cases.

Some things we could do:

  1. make HttpRequestFactory::create() ignore Http::$httpEngine. Leave Http::$httpEngine in place. Extension code can set it all it likes, but it won't do anything. To be clear - I don't suggest doing this without first making sure it won't break anything.
  2. do #1, and also remove Http::$httpEngine completely. We'd need to modify extensions to not reference Http::$httpEngine, at least when running against > 1.35. This may require some backward compatibility code - I haven't checked the individual extension compatibility policies - but that seems straightforward.
  3. do #1 and #2, and also deprecate CurlHttpRequest and PhpHttpRequest for eventual removal. Reducing the amount of custom code was, after all, part of the reason we added Guzzle as a vendor library.

Yet another option would be to make HttpRequestFactory::create() unconditionally return a GuzzleHttpRequest, and also introduce a deprecated-from-birth HttpRequestFactory::createCurlClient() method. We could then modify extensions to use that function (with backward compatibility code as necessary) so that extensions (for now) continue to use a CurlHttpRequest in all cases. This would be slightly safer, as it wouldn't risk changing the run-time behavior for extensions in unexpected ways. But it would add a bit more code to someday remove if we eventually retire CurlHttpRequest.

I'd be happy to work on any of the above.

After further discussion, I was confused about Tim's suggestion. Shellbox, being a library, needs a GuzzleHttp\Client, not a MediaWiki GuzzleHttpRequest. I retract my lengthy comment regarding deprecations and code removal.

The reason it needs a GuzzleHttp\Client not a GuzzleHttpRequest is, as I said in the task description, because it needs response streaming. The WIP patch is already using a GuzzleHttpRequest, but GuzzleHttpRequest finishes reading the response before it returns. It can take a write callback, but control is inverted compared to Guzzle which allows the caller to pull from the response stream. The multipart reader I wrote assumes this control direction, which is convenient and consistent with PSR-7.

I should note that I haven't actually tested and confirmed correct streaming behaviour in Guzzle, but I'm pretty sure it will work, based on the code and docs.

Change 634541 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Add a function to HttpRequestFactory for creating a GuzzleHttp\Client

https://gerrit.wikimedia.org/r/634541

The patch I just uploaded is very basic - it simply adds HttpRequestFactory::createGuzzleClient(), and an associated test. I suspect it could be expanded a bit to make things more convenient. But as I'm not sure exactly how you'll be using it, I decided to start small and expand rather than trying to guess what would be helpful.

Change 634541 merged by jenkins-bot:
[mediawiki/core@master] Add a function to HttpRequestFactory for creating a GuzzleHttp\Client

https://gerrit.wikimedia.org/r/634541

BPirkle claimed this task.
BPirkle moved this task from In Progress to Done on the MW-on-K8s board.

Change 636505 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] PHPUnit tests: prevent real Guzzle requests from tests.

https://gerrit.wikimedia.org/r/636505

Change 636505 merged by jenkins-bot:
[mediawiki/core@master] PHPUnit tests: prevent real Guzzle requests from tests.

https://gerrit.wikimedia.org/r/636505

I should note that I haven't actually tested and confirmed correct streaming behaviour in Guzzle, but I'm pretty sure it will work, based on the code and docs.

It in fact does not work. Guzzle just reads the whole response into a php://temp stream to invert the control flow.