Page MenuHomePhabricator

How should the MachineVision extension interact with external APIs from production?
Closed, ResolvedPublic

Description

The MachineVision extension requests image metadata from external machine vision providers when an image is uploaded or a maintenance script is run. It's designed to be entirely self-contained (in contrast with the ContentTranslation extension, which communicates with external APIs through the cxserver Node.js service). The planned initial machine vision provider is Google Cloud Vision, and the extension contains a GoogleCloudVisionHandler class that communicates with Google Cloud Vision via its official PHP client library.

This task is about connecting to external APIs from MediaWiki in production. What needs to happen to allow the MachineVision extension to talk to external APIs from Wikimedia production? Do we need to make any changes to the architecture to allow it to do so?

Event Timeline

It looks like what we need to do is ensure that our outbound requests can use an HTTP proxy. I'll create a dedicated subtask for that.

I see an http_proxy value defined in hieradata/common.yaml, but is it exposed to the MediaWiki appservers somehow?

As far as UX is concerned...

The HTTP request should not happen on page load, it should be deferred and either run in the background (scheduled job) or on the client (JavaScript). If it's run on the client, an API module would need to be created to function as a proxy to the external service.

Hi, I assumed the fetching of such data would happen via an async job indeed, upon image upload. Anything synchronous is effectively discouraged, even in post-send as even with relatively aggressive timeouts it's easy to clog up a lot of php workers waiting for an external provider which is lagging.

As far as proxy is concerned, yes your code will need to support setting an http/https proxy that will be passed down from the configuration.

As far as UX is concerned...

The HTTP request should not happen on page load, it should be deferred and either run in the background (scheduled job) or on the client (JavaScript). If it's run on the client, an API module would need to be created to function as a proxy to the external service.

I would assume that running in the client would defeat the purpose of storing such information trustfully.

On the other hand, I assume we would soon hit any rate-limiting google has on that API if we just run it from production (all requests will be coming from 2 IPs). We will need to tune the concurrency of such jobs, and also add a rate-limiting of sorts in change-propagation I guess.

I would assume that running in the client would defeat the purpose of storing such information trustfully.

On the other hand, I assume we would soon hit any rate-limiting google has on that API if we just run it from production (all requests will be coming from 2 IPs). We will need to tune the concurrency of such jobs, and also add a rate-limiting of sorts in change-propagation I guess.

If it's running through a proxy on production, that shouldn't be a problem...?

The HTTP requests for labels happen asynchronously in a deferred update on upload complete, or when a maintenance script is run. They're fetched from the DB when the user requests the related special page.

What I specifically need to know here is how to get the proxy info from the environment, and if there are any other barriers unique to production for making external HTTP requests. I see that there is an http_proxy setting in hiera, but as far as I can tell from operations-wmf-config that isn't exposed to the MW appservers.

The HTTP requests for labels happen asynchronously in a deferred update on upload complete, or when a maintenance script is run. They're fetched from the DB when the user requests the related special page.

What I specifically need to know here is how to get the proxy info from the environment, and if there are any other barriers unique to production for making external HTTP requests. I see that there is an http_proxy setting in hiera, but as far as I can tell from operations-wmf-config that isn't exposed to the MW appservers.

I would assume your code will have to allow to have a configuration variable to set the http proxy, and then we'll provide access to it via mediawiki-config and appropriate stanzas.

@Joe That sounds good, thanks. I'll update the code accordingly.

Basically what you need is:

  • a setting for the domains to exclude from proxying to
  • one for the proxy (if present)

i'm not sure if mediawiki already supports all that in its libraries, but i would assume as much. I'll check and get back to you.

While investigating this yesterday I found that the Google client library that our Google request handler uses is built around Guzzle. From the docs, it looks like if HTTPS_PROXY is set in the environment, requests to the outside world should Just Work (TM). Further, Guzzle uses cURL as its underlying transport by default, so an http_proxy environment variable might get picked up by cURL itself.

Otherwise, it looks like the Google client library does not provide for an explicit proxy option at request time, so if I need to pass in a variable from extension config (which appears to be the case) then I'll have to update the handler to not use the Google client library and instead use one of the concrete MwHttpRequest subclasses (CurlHttpRequest or GuzzleHttpRequest) for these requests so that I can pass in the proxy setting.

So after some quick grepping, we already define a proxy in mediawiki-config, and it can be retrieved at $wmfLocalServices['urldownloader'], so:

  • if it's possible to pass the proxy to your library, it should be done like we do for other things, e.g in mediawiki-config
$wgMachineVisionProxy = ( $wmfRealm !== 'labs' ) ? $wmfLocalServices['urldownloader'] : false;

but we definitely can't set the HTTPS_PROXY environment variable globally.

I would assume it shouldn't be too hard to contribute upstream the option of adding a specific proxy setting that doesn't come from the environment variables in case?

At a glance, ImageAnnotatorClient takes a httpHandler config option which is a callable processing PSR-7 requests, so you can just drop a Guzzle client in there:

$guzzle = new \GuzzleHttp\Client( [ 'proxy' => $proxy ] );
$client = new ImageAnnotatorClient( [
    'transportConfig' => [
        'rest' => [
            'httpHandler' => function ( \Psr\Http\Message\RequestInterface $request ) use ( $guzzle ) {
                return $guzzle->send( $request );
            },
        ],
    ],
] );

https://github.com/googleapis/google-cloud-php-vision/blob/416e2cbc0b5f00b6b64ab295bc9580f1c4c1e244/src/V1/Gapic/ImageAnnotatorGapicClient.php#L201-L211
https://googleapis.github.io/gax-php/master/Google/ApiCore/Transport/RestTransport.html#method_build

Change 547741 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] MachineVision: Use an HTTP proxy in production

https://gerrit.wikimedia.org/r/547741

Mholloway claimed this task.

The replacement of the client library was identified during security readiness review as likely more secure/performant. The extension now offers configurable HTTP proxy support. Resolving this.

Change 547741 merged by Mholloway:
[operations/mediawiki-config@master] MachineVision: Use an HTTP proxy in production

https://gerrit.wikimedia.org/r/547741