Security review for Google MT for Content Translation
Open, NormalPublic

Description

Project Information

  • Name of tool/project: Content Translation / cxserver
  • Project home page: mediawiki.org/wiki/Content_translation
  • Name of team requesting review: Language
  • Primary contact: Kartik Mistry/ Santhosh Thottingal
  • Target date for deployment: September 2016
  • Link to code repository / patchset:
  • Programming Language(s) Used: JavaScript/PHP

Description of the tool/project

Google Translate is external Machine Translation service which provides access to Content Translation with limited capacity. This will be integrated inside Content Translation tool.

Description of how the tool will be used at WMF

Users of Content Translation will use Google Machine Translation.

Dependencies

None.

Has this project been reviewed before?

No.

Working test environment

No. Let me know if local key/setup is needed.

Post-deployment

name of team responsible for tool/project after deployment and primary contact
Language (Kartik Mistry)

There are a very large number of changes, so older changes are hidden. Show Older Changes
Arrbee added a comment.Jan 5 2018, 5:47 PM

I am not sure how 3 tags got removed from this earlier. Am adding them back now.

Arrbee edited projects, added Services (watching); removed Services.Jan 5 2018, 5:48 PM
Arrbee added a project: Language-2017-Oct-Dec.

So I don't really feel that comfortable with DOMPurify, given that preventing http leaks is outside their threat model. I'm also not really fan of taking unsafe html, and broadly trying to sanitize instead of having a small whitelist of known expected html tags.

It would be at least as big as the whitelist for parsoid output, wouldn't it?

If feeding parsoid data in, yes.

Would it be possible to just use text with the the google api instead of html? That would just seem safer overall.

We have the code for that, yes, but we do want to keep alignment between source and target elements (such as links). The text-based algorithm is not nearly as reliable as direct html translation is.

Failing that, maybe we could use DomPurify in combination of putting it in an iframe with restrictive sandbox attribute/CSP. Not sure how do-able this would be with ContentTranslation extension

Not to mention how one would do it with VisualEditor... In general I think it would make the MT api less useful if the clients of it would need to complicated sandboxing.

The method used doesnt matter to much to me - I just want to make sure that the security of our users is in the hands of code we control, not other peoples code. We dont need to be quite as paranoid as we would be with untrusted user input, but there should be no way an external service could compromise the security of our users.

Given everyones going to be at allhands/devsummit next week, maybe we could discuss this bug in person.

Given everyones going to be at allhands/devsummit next week, maybe we could discuss this bug in person.

So we did do that, but I was super tired during all hands, so I'm not sure the discussion was super fruitful on account of me being overtired (Sorry about that).

So thinking about this - what if we had the following workflow:

  1. We have the parsoid html to translate
  2. We send it off to google translate (or other backend)
  3. We send the result to rest api transform/html/to/wikitext
  4. We send that result to transform/wikitext/to/html to get back some safe html.

By round-tripping through the api's wikitext<->html conversion process, this should sanitize the html and render it safe

Ryuch added a subscriber: Ryuch.May 31 2018, 8:47 PM
Ryuch removed a subscriber: ryucheol.
Restricted Application edited projects, added ContentTranslation; removed CX-deployments. · View Herald TranscriptJun 13 2018, 6:33 PM
Restricted Application edited projects, added ContentTranslation; removed CX-cxserver. · View Herald TranscriptJun 22 2018, 1:50 PM
Restricted Application edited projects, added ContentTranslation; removed CX-deployments. · View Herald TranscriptJun 22 2018, 1:55 PM

@Bawolff, Some updates on this:

In https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/461386 we changed what we send to exteranl MT engines. We now send a minimized HTML, by removing all attributes from all tags except the id attribute. The translated HTML from mt engines then expanded by re-applying the attributes we stripped in earlier step. This was done mainly to reduce the payload to MT engines(some of them counts characters sent in API).

But, as a result, this means any attributes set by external MT engines in MT output is completely discarded.

This is not same as sending plain text to MT engines, but practically very close to that in terms of security without compromising the capability to return semantic strucutre of HTML.

@Bawolff Does this address your early concerns about the translated content we get from MT engines?

cscott added a subscriber: cscott.Oct 24 2018, 8:40 PM

I think you still need to sanitize -- for example, the <script> tag has its payload delivered by the node children, not the attributes. Potentially <style> is dangerous in the same way, if there is a CSS vulnerability. But again, a tag whitelist should be sufficient for this.

I think you still need to sanitize -- for example, the <script> tag has its payload delivered by the node children, not the attributes. Potentially <style> is dangerous in the same way, if there is a CSS vulnerability. But again, a tag whitelist should be sufficient for this.

We do that general whitelisting/blacklisting of tags already using Dompurify

Change 470543 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Add test for MT result sanitizing

https://gerrit.wikimedia.org/r/470543

Change 470543 merged by Santhosh:
[mediawiki/services/cxserver@master] Add test for MT result sanitizing

https://gerrit.wikimedia.org/r/470543

Mentioned in SAL (#wikimedia-operations) [2018-11-06T04:42:16Z] <kartik@deploy1001> Started deploy [cxserver/deploy@ddb0031]: Update cxserver to 17f9a10 (T144467, T198699, T208386)

Mentioned in SAL (#wikimedia-operations) [2018-11-06T04:47:42Z] <kartik@deploy1001> Finished deploy [cxserver/deploy@ddb0031]: Update cxserver to 17f9a10 (T144467, T198699, T208386) (duration: 05m 26s)

leila added a subscriber: leila.Tue, Nov 27, 5:29 PM
sbassett added a subscriber: Bawolff.

Hey everybody-

I'm going to have a look at this in addition to the work @Bawolff has already done. I should have some variety of analysis completed later this week or early next week. The Security-Team thanks you for your patience on this.

@santhosh et al-

I've had some time to review the code here (in master) and had a few questions/observations:

  1. SAFE_FOR_JQUERY doesn't appear to be used within DOMPurify's sanitize config (@Bawolff's concern in T144467#3851643). Can we confirm or disconfirm whether the service will send sanitized content to jQuery's html method? Or is DOMPurify only being used to modify html sent to an MT service, as is implied by the comment at T144467#4694645?
  1. Per https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/461386/, it seems that Yandex is the only translation service that sends/receives html at this point. Is this true? If so, is it still necessary to call sanitize within lib/mt/Google.js (via translateReducedHtml within translate)?
  1. I'm still not seeing anything in regards to addressing potential http leaks, as @Bawolff noted in T144467#3852115. If, as mentioned in (1), Yandex is the only MT service expected to send and receive html now, then we should only have to worry about that service. But some solution (be it parsoid or something along the lines of cure53's experimental http leaks preventive hook for DOMPurify https://github.com/cure53/DOMPurify/blob/master/demos/hooks-proxy-demo.html) should be implemented. I see there was discussion of potentially using CSP at https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/363156/, but for now, CSP is in report-only mode on most wikis and I'm not certain we have a great idea of when enforce mode will be deployed or exactly what that will look like. So it would be my recommendation to proactively deal with any potential 3rd-party http leaks (basically any leak outside of wikimedia wikis) as part of cxserver.
  1. A bit unrelated to this task, but still security-related - a quick npm audit found a low prototype pollution vulnerability within the currently-pinned service-runner package (dep: lodash). Can service-runner be upgraded to use lodash >=4.17.5?

Finally, I'm still evaluating DOMPurify, but it looks fairly solid IMO so far for sanitizing potential XSS payloads. I'd still like to review it a bit more and also further investigate solutions for potential http leaks. Of course if parsoid eventually becomes a part of the solution here, DOMPurify would no longer be required.

  1. A bit unrelated to this task, but still security-related - a quick npm audit found a low prototype pollution vulnerability within the currently-pinned service-runner package (dep: lodash). Can service-runner be upgraded to use lodash >=4.17.5?

This need to fix via: https://github.com/gwicke/kad/pull/1

Change 477459 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Use SAFE_FOR_JQUERY in the DOMPurify config

https://gerrit.wikimedia.org/r/477459

Thanks for the review!

  1. SAFE_FOR_JQUERY doesn't appear to be used within DOMPurify's sanitize config (@Bawolff's concern in T144467#3851643). Can we confirm or disconfirm whether the service will send sanitized content to jQuery's html method? Or is DOMPurify only being used to modify html sent to an MT service, as is implied by the comment at T144467#4694645?

The sanitized content will go to the CX translation editor front end, which is VisualEditor and uses jQuery. So I added that configuration also now https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/477459

  1. Per https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/461386/, it seems that Yandex is the only translation service that sends/receives html at this point. Is this true? If so, is it still necessary to call sanitize within lib/mt/Google.js (via translateReducedHtml within translate)?

Google also accepts html input and sends translated HTML. Google service was added after that patch.

  1. I'm still not seeing anything in regards to addressing potential http leaks, as @Bawolff noted in T144467#3852115. If, as mentioned in (1), Yandex is the only MT service expected to send and receive html now, then we should only have to worry about that service. But some solution (be it parsoid or something along the lines of cure53's experimental http leaks preventive hook for DOMPurify https://github.com/cure53/DOMPurify/blob/master/demos/hooks-proxy-demo.html) should be implemented. I see there was discussion of potentially using CSP at https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/363156/, but for now, CSP is in report-only mode on most wikis and I'm not certain we have a great idea of when enforce mode will be deployed or exactly what that will look like. So it would be my recommendation to proactively deal with any potential 3rd-party http leaks (basically any leak outside of wikimedia wikis) as part of cxserver.

As of now, we have Yandex and Google MT engines that sends/receives html.

As demoed in the tests athttps://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/470543, cxserver will not send or accept any html attributes other than 'id' attribute. We remove all attributes except Id before sending to external MT engine. And we remove all attibutes except id from the recieved html. Then we reapply all attributes to that DOM. This was originally done to reduce the payload to those external services where billing is per character basis. But it also helps with security.

I think that address the concerns mentioned in https://github.com/cure53/DOMPurify/tree/master/demos#hook-to-proxy-all-links-link. In addition to this all href values in links goes through mediawiki title sanitization for adapting it to target language wiki link. This happens on top of the machine translation output and in same request (v2/translate api)

Let me know if there is anything more required than this

  1. A bit unrelated to this task, but still security-related - a quick npm audit found a low prototype pollution vulnerability within the currently-pinned service-runner package (dep: lodash). Can service-runner be upgraded to use lodash >=4.17.5?

This is coming from service-runner and affecting all services. We have asked services team to update the dependencies.

Finally, I'm still evaluating DOMPurify, but it looks fairly solid IMO so far for sanitizing potential XSS payloads. I'd still like to review it a bit more and also further investigate solutions for potential http leaks. Of course if parsoid eventually becomes a part of the solution here, DOMPurify would no longer be required.

We were intially planning to port Parsoid sanitizer as library and use it here. But we changed that plan since we found less complex solution as explained above T144467#4692200. DOMPurify is now acts like a second layer of sanitizer just to catch any thing that goes beyond out html compression and expansion code.

sbassett added a comment.EditedWed, Dec 5, 12:46 AM

@santhosh -

This need to fix via: https://github.com/gwicke/kad/pull/1
This is coming from service-runner and affecting all services. We have asked services team to update the dependencies

This should be fine, especially since npm audit considers this a low vulnerability. Something to keep an eye on though, perhaps with a sub-task.

The sanitized content will go to the CX translation editor front end, which is VisualEditor and uses jQuery. So I added that configuration also now https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/477459

This looks good to me and I also replied to @Nikerabbit's comment on the gerrit patch set. Unless there are certain undesirable side effects of the SAFE_FOR_JQUERY flag, I would advise that we follow the vendor's recommendation of enabling said flag for sanitized html being rendered by jQuery's html() (or other dom-writing) methods.

Google also accepts html input and sends translated HTML. Google service was added after that patch.

Ok, that's fine. And we'd definitely want to keep the DOMPurify-based sanitization in place for the Google MT client.

As of now, we have Yandex and Google MT engines that sends/receives html.

As demoed in the tests athttps://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/470543, cxserver will not send or accept any html attributes other than 'id' attribute. We remove all attributes except Id before sending to external MT engine. And we remove all attibutes except id from the recieved html. Then we reapply all attributes to that DOM. This was originally done to reduce the payload to those external services where billing is per character basis. But it also helps with security.

I think that address the concerns mentioned in https://github.com/cure53/DOMPurify/tree/master/demos#hook-to-proxy-all-links-link. In addition to this all href values in links goes through mediawiki title sanitization for adapting it to target language wiki link. This happens on top of the machine translation output and in same request (v2/translate api)

Just to confirm - lineardoc is something developed in-house using JavaScript's sax module, correct? It seems to work as intended via the tests within test/lineardoc/LinearDoc.test.js. However, I tried testing some of the translation process locally and did come across some issues. For example, sending the contents of cure53's http leaks reference file (https://github.com/cure53/HTTPLeaks/blob/master/leak.html) through the lineardoc reduction/expansion process, I end up with P7886. This actually looks pretty good, despite a couple of obvious shortcomings with various css and svg syntax. I've also seen DOMPurify.sanitize fail in certain situations, such as when the aforementioned P7886 is passed to it, along with large XSS test payloads such as https://pastebin.com/raw/48WdZR6L. This may not be that big of a deal, especially when the Google client is limiting the length of the source html to 5000 characters and these examples may be outside of the likely threat model, but it is at least mildly concerning in regards to the performance of these translation services. Perhaps further testing or the Language Team taking ownership of these risks is in order.

We were intially planning to port Parsoid sanitizer as library and use it here. But we changed that plan since we found less complex solution as explained above T144467#4692200

Yes, that sounds fine. In discussing this a bit more with the Security Team, I believe we're ok with using DOMPurify in cases like this.

Once I have some clarity around the lineardoc/sanitize issues mentioned above, I'd be fine signing off on the security review for these particular pieces (the MT clients) of the cxserver service.

Just to confirm - lineardoc is something developed in-house using JavaScript's sax module, correct?

Correct.

It seems to work as intended via the tests within test/lineardoc/LinearDoc.test.js. However, I tried testing some of the translation process locally and did come across some issues. For example, sending the contents of cure53's http leaks reference file (https://github.com/cure53/HTTPLeaks/blob/master/leak.html) through the lineardoc reduction/expansion process, I end up with P7886. This actually looks pretty good, despite a couple of obvious shortcomings with various css and svg syntax

The content inside styles, scripts, svgs should never be translated. So we were thinking to not send them to MT services at all. The content can be kept at ourside just like we hold attributes. If we do this, we no longer need to worry about modified content inside css, svgs. We did not do this before, since those tags did not appear in the use case of CX so far.

I did the first part of this at https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/477732

Change 477757 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Doc#reduce: Remove and store the content of non-translatable items

https://gerrit.wikimedia.org/r/477757

sbassett added a comment.EditedThu, Dec 6, 3:49 AM

@santhosh -

A summary of where I think we're at right now:

  1. I applied your patches from https://gerrit.wikimedia.org/r/477732 and https://gerrit.wikimedia.org/r/477757, which seem to work well in alleviating the concerns around <style> and <svg> content. For now, within the process flow of 1) reducing html via lineardoc 2) sending said html off to the Google or Yandex MT services 3) expanding the received translated html 4) sending it through DOMPurify.sanitize, the only potential for http leaks I'm seeing are if they already existed within the source html. One odd thing I did notice is that, within my simulated translateHtml method, if I prepend html - as if an MT service had unexpectedly done so - the lineardoc expansion process seems to break a bit. I believe we are operating under the assumption that none of the MT services will do this, but the insertion of unexpected, additional html from an MT (prepending, appending or inserting within the source html) might be a scenario worthy of further testing.
  2. After further testing, the issues that I mentioned within T144467#4799236 seem to be due to DOMPurify's handling of certain stylesheet elements, e.g.
    1. <link rel="stylesheet" href="https://leaking.via/link-stylesheet" />
    2. <link rel="alternate stylesheet" href="https://leaking.via/link-alternate-stylesheet" />
    3. <style>@import 'https://leaking.via/css-import-string';</style>

      So something like:
const DOMPurify = createDOMPurify( ( new jsdom.JSDOM( '' ) ).window );
var safe = DOMPurify.sanitize( '<LINK REL="stylesheet" HREF="javascript:alert(1);">' );

results in an odd error:

const request = this._resourceLoader.fetch(url, {
TypeError: Cannot read property 'fetch' of null

Again, more something to be aware of which I just reported to DOMPurify: https://github.com/cure53/DOMPurify/issues/311

  1. More for my own curiosity as I assume it isn't a security issue, but I was wondering if you had any examples of the markup used for the textblocks/annotations described for the Youdao service, as noted within the Youdao MT client's docblock for translateHtml. I really just wanted to make sure it wasn't sending anything html- or xml-like to that service.
  2. Finally, there's been some more discussion re: the SAFE_FOR_JQUERY flag on the gerrit change set: https://gerrit.wikimedia.org/r/477459/

Change 477972 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Reject MT result if it is wrapped with other tags

https://gerrit.wikimedia.org/r/477972

  1. I applied your patches from https://gerrit.wikimedia.org/r/477732 and https://gerrit.wikimedia.org/r/477757, which seem to work well in alleviating the concerns around <style> and <svg> content. For now, within the process flow of 1) reducing html via lineardoc 2) sending said html off to the Google or Yandex MT services 3) expanding the received translated html 4) sending it through DOMPurify.sanitize, the only potential for http leaks I'm seeing are if they already existed within the source html.

Yes, thanks for confirming. I fixed some code quality issues on those patches today. The source HTML for CX translation is coming from the editor we use there - That is Visual Editor. And it operates on the MediaWiki DOM Spec.

One odd thing I did notice is that, within my simulated translateHtml method, if I prepend html - as if an MT service had unexpectedly done so - the lineardoc expansion process seems to break a bit. I believe we are operating under the assumption that none of the MT services will do this, but the insertion of unexpected, additional html from an MT (prepending, appending or inserting within the source html) might be a scenario worthy of further testing.

Eventhough it is extremely rare to happen, I added a check for that. https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/477972 ( Reject MT result if it is wrapped with other tags)

Again, more something to be aware of which I just reported to DOMPurify: https://github.com/cure53/DOMPurify/issues/311

Thanks, noted.

  1. More for my own curiosity as I assume it isn't a security issue, but I was wondering if you had any examples of the markup used for the textblocks/annotations described for the Youdao service, as noted within the Youdao MT client's docblock for translateHtml. I really just wanted to make sure it wasn't sending anything html- or xml-like to that service.

No, those services is not HTML aware services. By Textblock we mean, the instances of lineardoc/TextBlock. It will correspond to an example markup like <span class="sentence">This is <b>bold</b> and this is a <a href="#">link</a></span>. We dont send this directly to service.we extract the plain text from it and sends it. So we send This is bold and this is a link. Once translated, all these words in this sentence can appear in different order depending on target language. The annotations - the bold, <A> tags - need to be reapplied on top of this plain text by detecting where the word went in the target language. This is a complex process and documented here. And finally we need to wrap that with the <span class="sentence"> textblock

@santhosh -

Thanks for all of the quick follow-up on this. https://gerrit.wikimedia.org/r/477972 looks good as additional hardening and the new test within test/mt/Yandex.test.js runs well. I think this is looking pretty good, and would like to leave the SAFE_FOR_JQUERY flag implementation up to you and the Language Team to discuss further, as I believe I've probably provided as much relevant feedback within https://gerrit.wikimedia.org/r/477459/ as I can. Also, thanks for the information regarding the specific implementation of the Youdao service. I'm going to review that a bit more, but for now I'm not seeing any issues there, as it seems to be a more restrictive (fewer html tags/markup) version of the 1) reduce html 2) send to MT service 3) expand translated html process.

@santhosh et al -

One more follow-up item - is apertium.wmflabs.org only used for dev/testing purposes, or is the plan for cxserver to use that within a production capacity at some point? My concern is that, as of now, it's a completely open API that isn't even running TLS. Thanks.

@santhosh et al -

One more follow-up item - is apertium.wmflabs.org only used for dev/testing purposes, or is the plan for cxserver to use that within a production capacity at some point? My concern is that, as of now, it's a completely open API that isn't even running TLS. Thanks.

apertium.wmflabs.org Lab instance is no longer used by Production. It is only for testing purpose.

@KartikMistry - Ok, that sounds good. I just wanted to ensure apertium.wmflabs.org was only used for testing purposes and not in any production capacity. Thanks.

Change 477757 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Doc#reduce: Remove and store the content of non-translatable items

https://gerrit.wikimedia.org/r/477757

Change 477972 merged by Nikerabbit:
[mediawiki/services/cxserver@master] Reject MT result if it is wrapped with other tags

https://gerrit.wikimedia.org/r/477972

Change 477459 abandoned by Santhosh:
Use SAFE_FOR_JQUERY in the DOMPurify config

Reason:
Since the usefulness of this is doubtful,also considering the additional security enhancements we did in machine translation clients, I am abandoning this patch.

https://gerrit.wikimedia.org/r/477459

@santhosh - abandoning r477459 should be fine, but I would note that sending any DOMPurified content to jQuery's dom-writing functions (html(), etc) should be thoroughly vetted so as to ensure any potentially-dangerous payloads are being sanitized (via some method similar to DOMPurify's SAFE_FOR_JQUERY feature) as expected.