Page MenuHomePhabricator

Move Wikisource OCR's API proxy to production
Closed, InvalidPublic1 Estimated Story Points

Description

Right now, the Wikisource OCR service is using an API proxy hosted on Wikimedia Cloud Services. This is problematic both from a reliability and privacy perspective (since Wikisource OCR is already a default tool on some Wikisources and soon will be on many more). We need to move to an API proxy that is hosted on a production server.

The Language team is already using an API proxy that is on a production server: cxserver.wikimedia.org. This proxy already handles sending requests to the Google Cloud API (among other APIs) and thus we may be able to share use of that server or set up a very similar service. The Language team's service was set up by @akosiaris from SRE and @KartikMistry on the Language team.

Event Timeline

@akosiaris - The Community Tech team needs basically the same thing that was set up for the Language team, just for a different Google Cloud Services API. Do you think it would make sense for them to both share the same proxy service or should they be handled as separate services?

Note also T243736#5849451. The current google-api-proxy instance is still open to the world (if they have the API key). I was never able to tell if the issue was on Google's side or our own, but we may want to make sure the IP restrictions are working properly before moving the proxy to production.

Naike set the point value for this task to 1.Sep 11 2020, 10:49 AM

I checked with @ifried; the client work on this will be in Q3 20-21. It would be helpful to them if the proxy was moved sometime before then.

This was brought to my attention yesterday by @WDoranWMF, sorry for missing it and many thanks for the ping.

Having read the task, I am afraid there are multiple misunderstandings in the description of the task.

Let me start by the fact that the language team IS NOT using an API proxy in production. cxserver.wikimedia.org is the public endpoint of the cxserver[1] service, powered by the cxserver[2] software, which is actively maintained by that team. It is exposing an API that able to to translate between languages using a variety of backends (some of which need to be reached via an outgoing proxy indeed). Definitely NOT a proxy and most definitely NOT reusable for generic proxy functionality.

Now, on to Google OCR for wikisource. I wasn't aware of this until today, but from my reading[3], it is a gadget relying on toolforge tool and needs to be included on a per project basis, for a list look at [3]. For e.g. enwikisource, the corresponding gadget page is in https://en.wikisource.org/wiki/MediaWiki:Gadget-GoogleOCR.js and the corresponding JS script is https://wikisource.org/w/index.php?title=MediaWiki:GoogleOCR.js&action=raw&ctype=text/javascript. In that JS, there is the following declaration

var toolUrl = "//ws-google-ocr.toolforge.org/api.php";

referencing the tool in toolforge, the code for which can be found in [4], a PHP codebase powering that tool. That tool, judging from [5] seems to be using a proxy indeed.

So, with all of the above, if this task is about having the toolforge tool use a proxy that is in production, I am afraid that's not possible. No software hosted in toolforge (or WMCS for that matter) is able to reach out to internal production infrastructure, as per our policies. If the task is about moving the tool to production, that's an entirely different discussion and way larger in scope than what is described in this task.

[1] https://cxserver.wikimedia.org/v2?doc
[2] https://gerrit.wikimedia.org/g/mediawiki/services/cxserver/+/refs/heads/master
[3] https://wikisource.org/wiki/Wikisource:Google_OCR
[4] https://gerrit.wikimedia.org/g/labs/tools/wikisource-ocr/+/refs/heads/master
[5] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/wikisource-ocr/+/refs/heads/master/config.php.dist#8

@akosiaris - Thanks for the reply and clearing up my misunderstandings. We have no need to keep the OCR service on Toolforge and would like to eventually move it into a MediaWiki extension (which would also make it easier for all the Wikisource projects to utilize). But it won't make any sense for us to do that unless there is an API proxy available in production that can communicate with the Google Vision API. What API proxy is cxserver.wikimedia.org utilizing to communicate with Google? Would it be possible for us to use that as well or have a similar proxy set up for this service?

@akosiaris - Thanks for the reply and clearing up my misunderstandings. We have no need to keep the OCR service on Toolforge and would like to eventually move it into a MediaWiki extension (which would also make it easier for all the Wikisource projects to utilize). But it won't make any sense for us to do that unless there is an API proxy available in production that can communicate with the Google Vision API. What API proxy is cxserver.wikimedia.org utilizing to communicate with Google? Would it be possible for us to use that as well or have a similar proxy set up for this service?

Happy to be of service. I have to say that cxserver uses no API proxy. It uses the HTTP forwarding proxy that all other services and MediaWiki use. If the OCR functionality is moved into a MediaWiki extension, then all it needs to use is the $wgCopyUploadProxy setting and it will be good to go.

@akosiaris - Thanks for that info! That's super helpful!

@akosiaris - I updated the documentation at Manual:$wgAllowCopyUploads. Feel free to tweak further.