Page MenuHomePhabricator

Security review for Google MT for Content Translation
Closed, ResolvedPublic

Description

Project Information

  • Name of tool/project: Content Translation / cxserver
  • Project home page: mediawiki.org/wiki/Content_translation
  • Name of team requesting review: Language
  • Primary contact: Kartik Mistry/ Santhosh Thottingal
  • Target date for deployment: September 2016
  • Link to code repository / patchset:
  • Programming Language(s) Used: JavaScript/PHP

Description of the tool/project

Google Translate is external Machine Translation service which provides access to Content Translation with limited capacity. This will be integrated inside Content Translation tool.

Description of how the tool will be used at WMF

Users of Content Translation will use Google Machine Translation.

Dependencies

None.

Has this project been reviewed before?

No.

Working test environment

No. Let me know if local key/setup is needed.

Post-deployment

name of team responsible for tool/project after deployment and primary contact
Language (Kartik Mistry)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Bawolff, Some updates on this:

In https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/461386 we changed what we send to exteranl MT engines. We now send a minimized HTML, by removing all attributes from all tags except the id attribute. The translated HTML from mt engines then expanded by re-applying the attributes we stripped in earlier step. This was done mainly to reduce the payload to MT engines(some of them counts characters sent in API).

But, as a result, this means any attributes set by external MT engines in MT output is completely discarded.

This is not same as sending plain text to MT engines, but practically very close to that in terms of security without compromising the capability to return semantic strucutre of HTML.

@Bawolff Does this address your early concerns about the translated content we get from MT engines?

I think you still need to sanitize -- for example, the <script> tag has its payload delivered by the node children, not the attributes. Potentially <style> is dangerous in the same way, if there is a CSS vulnerability. But again, a tag whitelist should be sufficient for this.

I think you still need to sanitize -- for example, the <script> tag has its payload delivered by the node children, not the attributes. Potentially <style> is dangerous in the same way, if there is a CSS vulnerability. But again, a tag whitelist should be sufficient for this.

We do that general whitelisting/blacklisting of tags already using Dompurify

Change 470543 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Add test for MT result sanitizing

https://gerrit.wikimedia.org/r/470543

Change 470543 merged by Santhosh:
[mediawiki/services/cxserver@master] Add test for MT result sanitizing

https://gerrit.wikimedia.org/r/470543

Mentioned in SAL (#wikimedia-operations) [2018-11-06T04:42:16Z] <kartik@deploy1001> Started deploy [cxserver/deploy@ddb0031]: Update cxserver to 17f9a10 (T144467, T198699, T208386)

Mentioned in SAL (#wikimedia-operations) [2018-11-06T04:47:42Z] <kartik@deploy1001> Finished deploy [cxserver/deploy@ddb0031]: Update cxserver to 17f9a10 (T144467, T198699, T208386) (duration: 05m 26s)

sbassett added a subscriber: Bawolff.

Hey everybody-

I'm going to have a look at this in addition to the work @Bawolff has already done. I should have some variety of analysis completed later this week or early next week. The Security-Team thanks you for your patience on this.

@santhosh et al-

I've had some time to review the code here (in master) and had a few questions/observations:

  1. SAFE_FOR_JQUERY doesn't appear to be used within DOMPurify's sanitize config (@Bawolff's concern in T144467#3851643). Can we confirm or disconfirm whether the service will send sanitized content to jQuery's html method? Or is DOMPurify only being used to modify html sent to an MT service, as is implied by the comment at T144467#4694645?
  1. Per https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/461386/, it seems that Yandex is the only translation service that sends/receives html at this point. Is this true? If so, is it still necessary to call sanitize within lib/mt/Google.js (via translateReducedHtml within translate)?
  1. I'm still not seeing anything in regards to addressing potential http leaks, as @Bawolff noted in T144467#3852115. If, as mentioned in (1), Yandex is the only MT service expected to send and receive html now, then we should only have to worry about that service. But some solution (be it parsoid or something along the lines of cure53's experimental http leaks preventive hook for DOMPurify https://github.com/cure53/DOMPurify/blob/master/demos/hooks-proxy-demo.html) should be implemented. I see there was discussion of potentially using CSP at https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/363156/, but for now, CSP is in report-only mode on most wikis and I'm not certain we have a great idea of when enforce mode will be deployed or exactly what that will look like. So it would be my recommendation to proactively deal with any potential 3rd-party http leaks (basically any leak outside of wikimedia wikis) as part of cxserver.
  1. A bit unrelated to this task, but still security-related - a quick npm audit found a low prototype pollution vulnerability within the currently-pinned service-runner package (dep: lodash). Can service-runner be upgraded to use lodash >=4.17.5?

Finally, I'm still evaluating DOMPurify, but it looks fairly solid IMO so far for sanitizing potential XSS payloads. I'd still like to review it a bit more and also further investigate solutions for potential http leaks. Of course if parsoid eventually becomes a part of the solution here, DOMPurify would no longer be required.

  1. A bit unrelated to this task, but still security-related - a quick npm audit found a low prototype pollution vulnerability within the currently-pinned service-runner package (dep: lodash). Can service-runner be upgraded to use lodash >=4.17.5?

This need to fix via: https://github.com/gwicke/kad/pull/1

Change 477459 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Use SAFE_FOR_JQUERY in the DOMPurify config

https://gerrit.wikimedia.org/r/477459

Thanks for the review!

  1. SAFE_FOR_JQUERY doesn't appear to be used within DOMPurify's sanitize config (@Bawolff's concern in T144467#3851643). Can we confirm or disconfirm whether the service will send sanitized content to jQuery's html method? Or is DOMPurify only being used to modify html sent to an MT service, as is implied by the comment at T144467#4694645?

The sanitized content will go to the CX translation editor front end, which is VisualEditor and uses jQuery. So I added that configuration also now https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/477459

  1. Per https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/461386/, it seems that Yandex is the only translation service that sends/receives html at this point. Is this true? If so, is it still necessary to call sanitize within lib/mt/Google.js (via translateReducedHtml within translate)?

Google also accepts html input and sends translated HTML. Google service was added after that patch.

  1. I'm still not seeing anything in regards to addressing potential http leaks, as @Bawolff noted in T144467#3852115. If, as mentioned in (1), Yandex is the only MT service expected to send and receive html now, then we should only have to worry about that service. But some solution (be it parsoid or something along the lines of cure53's experimental http leaks preventive hook for DOMPurify https://github.com/cure53/DOMPurify/blob/master/demos/hooks-proxy-demo.html) should be implemented. I see there was discussion of potentially using CSP at https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/363156/, but for now, CSP is in report-only mode on most wikis and I'm not certain we have a great idea of when enforce mode will be deployed or exactly what that will look like. So it would be my recommendation to proactively deal with any potential 3rd-party http leaks (basically any leak outside of wikimedia wikis) as part of cxserver.

As of now, we have Yandex and Google MT engines that sends/receives html.

As demoed in the tests athttps://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/470543, cxserver will not send or accept any html attributes other than 'id' attribute. We remove all attributes except Id before sending to external MT engine. And we remove all attibutes except id from the recieved html. Then we reapply all attributes to that DOM. This was originally done to reduce the payload to those external services where billing is per character basis. But it also helps with security.

I think that address the concerns mentioned in https://github.com/cure53/DOMPurify/tree/master/demos#hook-to-proxy-all-links-link. In addition to this all href values in links goes through mediawiki title sanitization for adapting it to target language wiki link. This happens on top of the machine translation output and in same request (v2/translate api)

Let me know if there is anything more required than this

  1. A bit unrelated to this task, but still security-related - a quick npm audit found a low prototype pollution vulnerability within the currently-pinned service-runner package (dep: lodash). Can service-runner be upgraded to use lodash >=4.17.5?

This is coming from service-runner and affecting all services. We have asked services team to update the dependencies.

Finally, I'm still evaluating DOMPurify, but it looks fairly solid IMO so far for sanitizing potential XSS payloads. I'd still like to review it a bit more and also further investigate solutions for potential http leaks. Of course if parsoid eventually becomes a part of the solution here, DOMPurify would no longer be required.

We were intially planning to port Parsoid sanitizer as library and use it here. But we changed that plan since we found less complex solution as explained above T144467#4692200. DOMPurify is now acts like a second layer of sanitizer just to catch any thing that goes beyond out html compression and expansion code.

@santhosh -

This need to fix via: https://github.com/gwicke/kad/pull/1
This is coming from service-runner and affecting all services. We have asked services team to update the dependencies

This should be fine, especially since npm audit considers this a low vulnerability. Something to keep an eye on though, perhaps with a sub-task.

The sanitized content will go to the CX translation editor front end, which is VisualEditor and uses jQuery. So I added that configuration also now https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/477459

This looks good to me and I also replied to @Nikerabbit's comment on the gerrit patch set. Unless there are certain undesirable side effects of the SAFE_FOR_JQUERY flag, I would advise that we follow the vendor's recommendation of enabling said flag for sanitized html being rendered by jQuery's html() (or other dom-writing) methods.

Google also accepts html input and sends translated HTML. Google service was added after that patch.

Ok, that's fine. And we'd definitely want to keep the DOMPurify-based sanitization in place for the Google MT client.

As of now, we have Yandex and Google MT engines that sends/receives html.

As demoed in the tests athttps://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/470543, cxserver will not send or accept any html attributes other than 'id' attribute. We remove all attributes except Id before sending to external MT engine. And we remove all attibutes except id from the recieved html. Then we reapply all attributes to that DOM. This was originally done to reduce the payload to those external services where billing is per character basis. But it also helps with security.

I think that address the concerns mentioned in https://github.com/cure53/DOMPurify/tree/master/demos#hook-to-proxy-all-links-link. In addition to this all href values in links goes through mediawiki title sanitization for adapting it to target language wiki link. This happens on top of the machine translation output and in same request (v2/translate api)

Just to confirm - lineardoc is something developed in-house using JavaScript's sax module, correct? It seems to work as intended via the tests within test/lineardoc/LinearDoc.test.js. However, I tried testing some of the translation process locally and did come across some issues. For example, sending the contents of cure53's http leaks reference file (https://github.com/cure53/HTTPLeaks/blob/master/leak.html) through the lineardoc reduction/expansion process, I end up with P7886. This actually looks pretty good, despite a couple of obvious shortcomings with various css and svg syntax. I've also seen DOMPurify.sanitize fail in certain situations, such as when the aforementioned P7886 is passed to it, along with large XSS test payloads such as https://pastebin.com/raw/48WdZR6L. This may not be that big of a deal, especially when the Google client is limiting the length of the source html to 5000 characters and these examples may be outside of the likely threat model, but it is at least mildly concerning in regards to the performance of these translation services. Perhaps further testing or the Language Team taking ownership of these risks is in order.

We were intially planning to port Parsoid sanitizer as library and use it here. But we changed that plan since we found less complex solution as explained above T144467#4692200

Yes, that sounds fine. In discussing this a bit more with the Security Team, I believe we're ok with using DOMPurify in cases like this.

Once I have some clarity around the lineardoc/sanitize issues mentioned above, I'd be fine signing off on the security review for these particular pieces (the MT clients) of the cxserver service.

Just to confirm - lineardoc is something developed in-house using JavaScript's sax module, correct?

Correct.

It seems to work as intended via the tests within test/lineardoc/LinearDoc.test.js. However, I tried testing some of the translation process locally and did come across some issues. For example, sending the contents of cure53's http leaks reference file (https://github.com/cure53/HTTPLeaks/blob/master/leak.html) through the lineardoc reduction/expansion process, I end up with P7886. This actually looks pretty good, despite a couple of obvious shortcomings with various css and svg syntax

The content inside styles, scripts, svgs should never be translated. So we were thinking to not send them to MT services at all. The content can be kept at ourside just like we hold attributes. If we do this, we no longer need to worry about modified content inside css, svgs. We did not do this before, since those tags did not appear in the use case of CX so far.

I did the first part of this at https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/477732

Change 477757 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Doc#reduce: Remove and store the content of non-translatable items

https://gerrit.wikimedia.org/r/477757

@santhosh -

A summary of where I think we're at right now:

  1. I applied your patches from https://gerrit.wikimedia.org/r/477732 and https://gerrit.wikimedia.org/r/477757, which seem to work well in alleviating the concerns around <style> and <svg> content. For now, within the process flow of 1) reducing html via lineardoc 2) sending said html off to the Google or Yandex MT services 3) expanding the received translated html 4) sending it through DOMPurify.sanitize, the only potential for http leaks I'm seeing are if they already existed within the source html. One odd thing I did notice is that, within my simulated translateHtml method, if I prepend html - as if an MT service had unexpectedly done so - the lineardoc expansion process seems to break a bit. I believe we are operating under the assumption that none of the MT services will do this, but the insertion of unexpected, additional html from an MT (prepending, appending or inserting within the source html) might be a scenario worthy of further testing.
  2. After further testing, the issues that I mentioned within T144467#4799236 seem to be due to DOMPurify's handling of certain stylesheet elements, e.g.
    1. <link rel="stylesheet" href="https://leaking.via/link-stylesheet" />
    2. <link rel="alternate stylesheet" href="https://leaking.via/link-alternate-stylesheet" />
    3. <style>@import 'https://leaking.via/css-import-string';</style>

      So something like:
const DOMPurify = createDOMPurify( ( new jsdom.JSDOM( '' ) ).window );
var safe = DOMPurify.sanitize( '<LINK REL="stylesheet" HREF="javascript:alert(1);">' );

results in an odd error:

const request = this._resourceLoader.fetch(url, {
TypeError: Cannot read property 'fetch' of null

Again, more something to be aware of which I just reported to DOMPurify: https://github.com/cure53/DOMPurify/issues/311

  1. More for my own curiosity as I assume it isn't a security issue, but I was wondering if you had any examples of the markup used for the textblocks/annotations described for the Youdao service, as noted within the Youdao MT client's docblock for translateHtml. I really just wanted to make sure it wasn't sending anything html- or xml-like to that service.
  2. Finally, there's been some more discussion re: the SAFE_FOR_JQUERY flag on the gerrit change set: https://gerrit.wikimedia.org/r/477459/

Change 477972 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Reject MT result if it is wrapped with other tags

https://gerrit.wikimedia.org/r/477972

  1. I applied your patches from https://gerrit.wikimedia.org/r/477732 and https://gerrit.wikimedia.org/r/477757, which seem to work well in alleviating the concerns around <style> and <svg> content. For now, within the process flow of 1) reducing html via lineardoc 2) sending said html off to the Google or Yandex MT services 3) expanding the received translated html 4) sending it through DOMPurify.sanitize, the only potential for http leaks I'm seeing are if they already existed within the source html.

Yes, thanks for confirming. I fixed some code quality issues on those patches today. The source HTML for CX translation is coming from the editor we use there - That is Visual Editor. And it operates on the MediaWiki DOM Spec.

One odd thing I did notice is that, within my simulated translateHtml method, if I prepend html - as if an MT service had unexpectedly done so - the lineardoc expansion process seems to break a bit. I believe we are operating under the assumption that none of the MT services will do this, but the insertion of unexpected, additional html from an MT (prepending, appending or inserting within the source html) might be a scenario worthy of further testing.

Eventhough it is extremely rare to happen, I added a check for that. https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/477972 ( Reject MT result if it is wrapped with other tags)

Again, more something to be aware of which I just reported to DOMPurify: https://github.com/cure53/DOMPurify/issues/311

Thanks, noted.

  1. More for my own curiosity as I assume it isn't a security issue, but I was wondering if you had any examples of the markup used for the textblocks/annotations described for the Youdao service, as noted within the Youdao MT client's docblock for translateHtml. I really just wanted to make sure it wasn't sending anything html- or xml-like to that service.

No, those services is not HTML aware services. By Textblock we mean, the instances of lineardoc/TextBlock. It will correspond to an example markup like <span class="sentence">This is <b>bold</b> and this is a <a href="#">link</a></span>. We dont send this directly to service.we extract the plain text from it and sends it. So we send This is bold and this is a link. Once translated, all these words in this sentence can appear in different order depending on target language. The annotations - the bold, <A> tags - need to be reapplied on top of this plain text by detecting where the word went in the target language. This is a complex process and documented here. And finally we need to wrap that with the <span class="sentence"> textblock

@santhosh -

Thanks for all of the quick follow-up on this. https://gerrit.wikimedia.org/r/477972 looks good as additional hardening and the new test within test/mt/Yandex.test.js runs well. I think this is looking pretty good, and would like to leave the SAFE_FOR_JQUERY flag implementation up to you and the Language Team to discuss further, as I believe I've probably provided as much relevant feedback within https://gerrit.wikimedia.org/r/477459/ as I can. Also, thanks for the information regarding the specific implementation of the Youdao service. I'm going to review that a bit more, but for now I'm not seeing any issues there, as it seems to be a more restrictive (fewer html tags/markup) version of the 1) reduce html 2) send to MT service 3) expand translated html process.

@santhosh et al -

One more follow-up item - is apertium.wmflabs.org only used for dev/testing purposes, or is the plan for cxserver to use that within a production capacity at some point? My concern is that, as of now, it's a completely open API that isn't even running TLS. Thanks.

@santhosh et al -

One more follow-up item - is apertium.wmflabs.org only used for dev/testing purposes, or is the plan for cxserver to use that within a production capacity at some point? My concern is that, as of now, it's a completely open API that isn't even running TLS. Thanks.

apertium.wmflabs.org Lab instance is no longer used by Production. It is only for testing purpose.

@KartikMistry - Ok, that sounds good. I just wanted to ensure apertium.wmflabs.org was only used for testing purposes and not in any production capacity. Thanks.

Change 477757 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Doc#reduce: Remove and store the content of non-translatable items

https://gerrit.wikimedia.org/r/477757

Change 477972 merged by Nikerabbit:
[mediawiki/services/cxserver@master] Reject MT result if it is wrapped with other tags

https://gerrit.wikimedia.org/r/477972

Change 477459 abandoned by Santhosh:
Use SAFE_FOR_JQUERY in the DOMPurify config

Reason:
Since the usefulness of this is doubtful,also considering the additional security enhancements we did in machine translation clients, I am abandoning this patch.

https://gerrit.wikimedia.org/r/477459

@santhosh - abandoning r477459 should be fine, but I would note that sending any DOMPurified content to jQuery's dom-writing functions (html(), etc) should be thoroughly vetted so as to ensure any potentially-dangerous payloads are being sanitized (via some method similar to DOMPurify's SAFE_FOR_JQUERY feature) as expected.

For the audit, I started by search uses of getCXServerUrl. Because all the ext.*.js (except sitemapper) are used by only CX1, we can ignore those and concentrate on these four uses.

Of these:

  • mw.cx.MwApiRequestManager.prototype.fetchSourcePageContent is unused and should be removed.
  • mw.cx.init.Translation.prototype.fetchSourePageContent uses the /page/ API to load source text. We trust the parsoid output, but this code constructs mw.cx.dm.Translation which then uses ve.init.target.parseDocument to parse the html.
  • mw.cx.MachineTranslationService.prototype.fetchProviders calls the /list/ API and is not relevant.
  • mw.cx.MachineTranslationService.prototype.fetchTranslation calls the /translate/ api and needs an audit. It is only called from mw.cx.MachineTranslationService.prototype.translate which is only called from ve.init.mw.CXTarget.prototype.translateSection which is called from ve.init.mw.CXTarget.prototype.prefetchTranslationForSection (it does not use the value so we can ignore that) and from ve.init.mw.CXTarget.prototype.changeContentSource which passes the result to ve.init.mw.CXTarget.prototype.setSectionContent which processes it using ve.createDocumentFromHtml.

@sbassett Based on above, assuming VE does not internally use jQuery for further processing, we are not currently using jQuery to process data returned from cxserver MT api. This code is quite stable, so the risk of introducing new usages through jQuery seems low as well.

Mentioned in SAL (#wikimedia-operations) [2018-12-20T12:48:15Z] <kartik@deploy1001> Finished deploy [cxserver/deploy@16f65cb]: Update cxserver to 803baa4 (T210581, T211889, T144467, T209473) (duration: 04m 42s)

@Nikerabbit - This all sounds reasonable. From your analysis (thanks!) it appears the two sinks we eventually arrive at (from CXServer's perspective) are ve.init.target.parseDocument and ve.createDocumentFromHtml, both of which are trusted and do not use jQuery in any way. Furthermore, VE seems unconcerned with the flag within its own sanitization module: https://gerrit.wikimedia.org/r/plugins/gitiles/VisualEditor/VisualEditor/+/master/src/ve.sanitize.js#23

All (@santhosh, @Nikerabbit, @KartikMistry, etc) - I'll plan to resolve this ticket today or tomorrow unless there any additional concerns.