How should the MachineVision extension interact with external APIs from production?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Mholloway
	Oct 29 2019, 2:36 PM

Description

The MachineVision extension requests image metadata from external machine vision providers when an image is uploaded or a maintenance script is run. It's designed to be entirely self-contained (in contrast with the ContentTranslation extension, which communicates with external APIs through the cxserver Node.js service). The planned initial machine vision provider is Google Cloud Vision, and the extension contains a GoogleCloudVisionHandler class that communicates with Google Cloud Vision via its official PHP client library.

This task is about connecting to external APIs from MediaWiki in production. What needs to happen to allow the MachineVision extension to talk to external APIs from Wikimedia production? Do we need to make any changes to the architecture to allow it to do so?

Details

	Subject	Repo	Branch	Lines +/-
	MachineVision: Use an HTTP proxy in production	operations/mediawiki-config	master	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Ramsey-WMF	T225964 [SDC] Build a depicts tag suggestion tool that is powered by machine vision platforms
Resolved	• Mholloway	T226119 Build middleware to utilize machine vision API for structured data on commons depicts tag suggestion tool
Resolved	• Mholloway	T227349 Deploy the MachineVision extension to production
Resolved	• Mholloway	T236797 How should the MachineVision extension interact with external APIs from production?
Resolved	• Mholloway	T236426 Configure Google Cloud Vision credentials in production
Resolved	• Mholloway	T236843 Add HTTP proxy support to the MachineVision extension

Event Timeline

• Mholloway created this task.Oct 29 2019, 2:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 29 2019, 2:36 PM

• Mholloway triaged this task as High priority.Oct 29 2019, 2:36 PM

• Mholloway added a subtask: T236426: Configure Google Cloud Vision credentials in production.

• Mholloway moved this task from Needs triage to Needs investigation on the Product-Infrastructure-Team-Backlog-Deprecated board.Oct 29 2019, 2:43 PM

• Mholloway added a project: SRE.Oct 29 2019, 2:48 PM

• Mholloway added a parent task: T227349: Deploy the MachineVision extension to production.Oct 29 2019, 3:08 PM

It looks like what we need to do is ensure that our outbound requests can use an HTTP proxy. I'll create a dedicated subtask for that.

I see an http_proxy value defined in hieradata/common.yaml, but is it exposed to the MediaWiki appservers somehow?

• Mholloway mentioned this in T224917: Machine vision image metadata service.Oct 30 2019, 3:33 PM

As far as UX is concerned...

The HTTP request should not happen on page load, it should be deferred and either run in the background (scheduled job) or on the client (JavaScript). If it's run on the client, an API module would need to be created to function as a proxy to the external service.

Hi, I assumed the fetching of such data would happen via an async job indeed, upon image upload. Anything synchronous is effectively discouraged, even in post-send as even with relatively aggressive timeouts it's easy to clog up a lot of php workers waiting for an external provider which is lagging.

As far as proxy is concerned, yes your code will need to support setting an http/https proxy that will be passed down from the configuration.

In T236797#5620179, @dbarratt wrote:

As far as UX is concerned...

The HTTP request should not happen on page load, it should be deferred and either run in the background (scheduled job) or on the client (JavaScript). If it's run on the client, an API module would need to be created to function as a proxy to the external service.

I would assume that running in the client would defeat the purpose of storing such information trustfully.

On the other hand, I assume we would soon hit any rate-limiting google has on that API if we just run it from production (all requests will be coming from 2 IPs). We will need to tune the concurrency of such jobs, and also add a rate-limiting of sorts in change-propagation I guess.

In T236797#5620272, @Joe wrote:

I would assume that running in the client would defeat the purpose of storing such information trustfully.

On the other hand, I assume we would soon hit any rate-limiting google has on that API if we just run it from production (all requests will be coming from 2 IPs). We will need to tune the concurrency of such jobs, and also add a rate-limiting of sorts in change-propagation I guess.

If it's running through a proxy on production, that shouldn't be a problem...?

The HTTP requests for labels happen asynchronously in a deferred update on upload complete, or when a maintenance script is run. They're fetched from the DB when the user requests the related special page.

What I specifically need to know here is how to get the proxy info from the environment, and if there are any other barriers unique to production for making external HTTP requests. I see that there is an http_proxy setting in hiera, but as far as I can tell from operations-wmf-config that isn't exposed to the MW appservers.

In T236797#5620320, @Mholloway wrote:

The HTTP requests for labels happen asynchronously in a deferred update on upload complete, or when a maintenance script is run. They're fetched from the DB when the user requests the related special page.

What I specifically need to know here is how to get the proxy info from the environment, and if there are any other barriers unique to production for making external HTTP requests. I see that there is an http_proxy setting in hiera, but as far as I can tell from operations-wmf-config that isn't exposed to the MW appservers.

I would assume your code will have to allow to have a configuration variable to set the http proxy, and then we'll provide access to it via mediawiki-config and appropriate stanzas.

@Joe That sounds good, thanks. I'll update the code accordingly.

Basically what you need is:

a setting for the domains to exclude from proxying to
one for the proxy (if present)

i'm not sure if mediawiki already supports all that in its libraries, but i would assume as much. I'll check and get back to you.

dbarratt unsubscribed.Oct 30 2019, 6:05 PM

While investigating this yesterday I found that the Google client library that our Google request handler uses is built around Guzzle. From the docs, it looks like if HTTPS_PROXY is set in the environment, requests to the outside world should Just Work (TM). Further, Guzzle uses cURL as its underlying transport by default, so an http_proxy environment variable might get picked up by cURL itself.

Otherwise, it looks like the Google client library does not provide for an explicit proxy option at request time, so if I need to pass in a variable from extension config (which appears to be the case) then I'll have to update the handler to not use the Google client library and instead use one of the concrete MwHttpRequest subclasses (CurlHttpRequest or GuzzleHttpRequest) for these requests so that I can pass in the proxy setting.

So after some quick grepping, we already define a proxy in mediawiki-config, and it can be retrieved at $wmfLocalServices['urldownloader'], so:

if it's possible to pass the proxy to your library, it should be done like we do for other things, e.g in mediawiki-config

$wgMachineVisionProxy = ( $wmfRealm !== 'labs' ) ? $wmfLocalServices['urldownloader'] : false;

but we definitely can't set the HTTPS_PROXY environment variable globally.

I would assume it shouldn't be too hard to contribute upstream the option of adding a specific proxy setting that doesn't come from the environment variables in case?

At a glance, ImageAnnotatorClient takes a httpHandler config option which is a callable processing PSR-7 requests, so you can just drop a Guzzle client in there:

$guzzle = new \GuzzleHttp\Client( [ 'proxy' => $proxy ] );
$client = new ImageAnnotatorClient( [
    'transportConfig' => [
        'rest' => [
            'httpHandler' => function ( \Psr\Http\Message\RequestInterface $request ) use ( $guzzle ) {
                return $guzzle->send( $request );
            },
        ],
    ],
] );

https://github.com/googleapis/google-cloud-php-vision/blob/416e2cbc0b5f00b6b64ab295bc9580f1c4c1e244/src/V1/Gapic/ImageAnnotatorGapicClient.php#L201-L211
https://googleapis.github.io/gax-php/master/Google/ApiCore/Transport/RestTransport.html#method_build

Change 547741 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] MachineVision: Use an HTTP proxy in production

https://gerrit.wikimedia.org/r/547741

gerritbot added a project: Patch-For-Review.Nov 1 2019, 3:07 PM

CDanis subscribed.Nov 3 2019, 3:11 AM

The replacement of the client library was identified during security readiness review as likely more secure/performant. The extension now offers configurable HTTP proxy support. Resolving this.

• Mholloway closed subtask T236843: Add HTTP proxy support to the MachineVision extension as Resolved.Nov 5 2019, 4:50 PM

Change 547741 merged by Mholloway:
[operations/mediawiki-config@master] MachineVision: Use an HTTP proxy in production

https://gerrit.wikimedia.org/r/547741

• Mholloway closed subtask T236426: Configure Google Cloud Vision credentials in production as Resolved.Nov 8 2019, 2:03 PM

How should the MachineVision extension interact with external APIs from production?Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

How should the MachineVision extension interact with external APIs from production?
Closed, ResolvedPublic
Actions

Related Objects
Search...