Page MenuHomePhabricator

Content Translation Beta Feature security review
Closed, ResolvedPublic

Description

Content Translation need to be reviewed from security point of view for Beta Feature deployment in production.

Event Timeline

KartikMistry raised the priority of this task from to Needs Triage.
KartikMistry updated the task description. (Show Details)
KartikMistry added a project: LE-Sprint-81.
KartikMistry moved this task to Backlog on the LE-Sprint-81 board.
KartikMistry added a subscriber: KartikMistry.
csteipp claimed this task.Jan 5 2015, 7:42 PM
csteipp triaged this task as High priority.
csteipp added a project: MediaWiki-Core-Team.

Chris,

We have a few questions about this security review.

  1. Is a security review mandatory for deploying to production? If so, this may affect our planned deployment date.
  2. How long does a security review typically take? Is there any way we can make this a rush?
  3. Is there any documentation or files we can look at to make the process easier?

Thanks,

Joel

Chris,
We have a few questions about this security review.

  1. Is a security review mandatory for deploying to production? If so, this may affect our planned deployment date.

As a rule, yes. There are some cases where I'm ok not doing it, I have the beginnings of a formal policy that outline my thinking on it,

https://www.mediawiki.org/wiki/Does_my_application_need_a_security_review

But right now, our process is to have new extensions reviewed for security and performance before they are enabled in production.

  1. How long does a security review typically take? Is there any way we can make this a rush?

The earlier I'm involved, the easier it is. With nothing else on my plate (which unfortunately isn't the case right now) I could do your review in a little over a day (~1kloc/hr). Here is what I'm looking for:
https://www.mediawiki.org/wiki/Security_for_developers/Architecture

Assuming you guys made good choices about the features and were careful about how you did things, it should just be me reading your code and saying everything looks good. If I find issues, your team will need to fix them before it's deployed to production.

To speed things up, if you have a dataflow diagram, or any documentation about the security thought that you put into the project, that will help me review the security of the design much faster. Or if someone from your team can explain them to me, that also helps. Then it's just a quick review to make sure you didn't make major mistakes in the implementation.

  1. Is there any documentation or files we can look at to make the process easier?

https://www.mediawiki.org/wiki/Security_for_developers
https://www.mediawiki.org/wiki/Security_for_developers/Architecture

Thanks,
Joel

Ok, first issue. In poking around, the cx server is allowing reflected xss on that domain. Since it's on a wikimedia.org domain, we can't run that in production. Demo https://people.wikimedia.org/~csteipp/cx-1.html

I haven't gotten into the cx code enough to understand what it does for sanitization of the html, so I'm not sure how big of an issue this is to fix.

@csteipp, does this document https://www.mediawiki.org/wiki/Content_translation/cxserver help you to understand what cxserver does with the content and what is returned? cxserver is comparable to parsoid, eventhough cxserver is less sophisticated. It handles page content and process it, act as bridge for MT backends etc. I hope it helps. Happy to provide any clarification you need.

Arrbee moved this task from Backlog to In Progress on the LE-Sprint-81 board.Jan 7 2015, 7:38 AM
Arrbee added a subscriber: Arrbee.Jan 7 2015, 5:06 PM

An update: we moved the deployment to next week (12th January - Monday). I hope that will ease the rush a bit. Thanks.

@csteipp, does this document https://www.mediawiki.org/wiki/Content_translation/cxserver help you to understand what cxserver does with the content and what is returned? cxserver is comparable to parsoid, eventhough cxserver is less sophisticated. It handles page content and process it, act as bridge for MT backends etc. I hope it helps. Happy to provide any clarification you need.

@santhosh, that clarifies where the flaw was introduced. On that page it states,

/mt/:from/:to/:provider? - Get the machine translation of posted content from source language(from) to target language(to) using a the given MT service provider. If the provider is not given the default configured provider will be used. The HTML returned will be wellformed and valid. The translated content will have the annotations/markup from the source html, mapped to appropriate words. Currently one backend provider is configured - Apertium. The input HTML must be valid and well formed since there is a parsing involved before passing to apertium. Note that content to translate must be posted.

Unfortunately, there also should be a requirement there that javascript in the request is removed. It looks like there is some sanitization-- the text in <script> tags is removed. But to be effective, you need to either whitelist the elements/attributes that are allowed, or do an html->wikitext->html conversion. You could also get around this by encoding the returned data (json encoding where html entities are all encoded would work), and set a different content-type of the returned document.

modules/tools/ext.cx.tools.reference.js

In the ReferenceCard.prototype.start function, I'm fairly sure an attacker can embed a div with an id that matches the reference id you're using, and their own html in the data-mw attribute. Since it looks like the source article is being pulled from production I haven't been able to verify it, but in general, embedding then rendering html in an attribute that users can control on the page is a bad idea.

In general, you shouldn't be unserializing and serializing html with code like,

$( '<div>' ).html( mwData.body.html );

This is only exploitable on IE8, but also is bad for performance.

In general, you shouldn't be unserializing and serializing html with code like,
$( '<div>' ).html( mwData.body.html );
This is only exploitable on IE8, but also is bad for performance.

Sorry, that should be .html( whatever.html() ). In my notes I put line 213 above 196 on ext.cx.tools.reference.js.. so it's ok as is.

Not exploitable, but ext.cx.tools.reference.js line 251, you should verify that $section.data( 'source' ) is an integer before using it in $().

Unfortunately, there also should be a requirement there that javascript in the request is removed. It looks like there is some sanitization-- the text in <script> tags is removed. But to be effective, you need to either whitelist the elements/attributes that are allowed, or do an html->wikitext->html conversion. You could also get around this by encoding the returned data (json encoding where html entities are all encoded would work), and set a different content-type of the returned document.

Is there an existing whitelisting tool or even just a whitelist that we could use? Html->wikitext->html sounds excessive and bad for performance. Encoded json seems easiest to implement, but is that sufficient, since in the CX tool we add the output pretty much directly to the DOM?

Change 183447 had a related patch set uploaded (by Nikerabbit):
Abstract source and translation section access

https://gerrit.wikimedia.org/r/183447

Patch-For-Review

Change 183475 had a related patch set uploaded (by Nikerabbit):
Be more strict about the node returned by getElementById

https://gerrit.wikimedia.org/r/183475

Patch-For-Review

Change 183475 merged by jenkins-bot:
Be more strict about the node returned by getElementById

https://gerrit.wikimedia.org/r/183475

Is there an existing whitelisting tool or even just a whitelist that we could use? Html->wikitext->html sounds excessive and bad for performance. Encoded json seems easiest to implement, but is that sufficient, since in the CX tool we add the output pretty much directly to the DOM?

htmlpurifier is probably the closest match. We don't run it anywhere else in production, but should work fine.

Encoded json will work to prevent the xss when the service is (ab)used by itself, as long as you convert <,>,& to \u003c, \u003e, \u0026. That will prevent all browsers from interpreting the result as html.

Json encoding wont help if there is ever a way that unsanitized, user controlled data would ever be sent off to the mt service. As far as I can tell, it's only the (sanitized) existing article html. But if I'm wrong about that, or there's a chance that could happen with a feature in the future, then doing htmlpurifier, or a html/wikitext/html conversion would be safer.

htmlpurifier seems to be written in PHP, which is not ideal given that cxserver and cx extension are written in JavaScript. After some searching, I could not easily find similar tool for JavaScript which would allow customizing the whitelist easily.

I really do want to add sanitizing, because that will also protect users from third party services (which we don't currently have). We are planning to implement the json+encoding approach for now. In principle, the cxserver mt API can accept any user supplied string, but not in the context of our translation tool, which will only send source html of wikipedia articles provided by parsoid. This way we should have more time to work on a whitelisting based approach.

Change 183743 had a related patch set uploaded (by Jsahleen):
Security: Prevent XSS (cxserver)

https://gerrit.wikimedia.org/r/183743

Patch-For-Review

Change 183746 had a related patch set uploaded (by Jsahleen):
Security: Prevent XSS

https://gerrit.wikimedia.org/r/183746

Patch-For-Review

@csteipp @Nikerabbit

I have modified the communication between cxserver and the front end for MT. Response from the server is now sent as JSON with <,> and " converted to html entities. Decoded on the client.

Not sure I did this right so please verify. :)

Patches updated following comments from Krinkle. Code to encode as html entities and decode on the client side removed.

Patches updated following comments from Krinkle. Code to encode as html entities and decode on the client side removed.

Will this be deployed to wmflabs soon? I'd like to retest it. I'm not entirely sure if the fix is right.

I agree with Timo that you shouldn't unescape the html entities by hand on the client, which is why I recommended the return should look something like

{"html":"\u003cscript\003ealert(1);\u003c/script\003e"}

which is valid json, should get decoded to the original html as soon as you parse it, and there's no chance any browser is going to content sniff that to html.

@csteipp You can test using our instance on Beta Cluster at: http://en.wikipedia.beta.wmflabs.org/wiki/Special:ContentTranslation and cxserver using http://cxserver-beta.wmflabs.org (Extension code always run on master and cxserver code need manual updates after merge. I'll update once patch(es) are in.)

Change 183447 merged by jenkins-bot:
Abstract source and translation section access

https://gerrit.wikimedia.org/r/183447

Change 183746 abandoned by Jsahleen:
Security: Use new JSON format for /mt endpoint

Reason:
Bad dependency. Easier to just fix by starting new patch.

https://gerrit.wikimedia.org/r/183746

Change 183976 had a related patch set uploaded (by Jsahleen):
Security: Use new JSON format from /mt endpoint

https://gerrit.wikimedia.org/r/183976

Patch-For-Review

The Drafts feature allows for stored xss.

As one user, I add javascript to a draft with something like,

curl 'http://ca.wikipedia.beta.wmflabs.org/w/api.php' -H 'Cookie: GeoIP=US::38.0000:-97.0000:v4; uls-previous-languages=%5B%22ca%22%5D; mediaWiki.user.sessionId=PAyvQEpFzk6j1ovk2qKRksiyHBdBC1o1; cawikiUserID=25; cawikiUserName=CSteipp+OAuth; cawikiSession=REDACTED; centralauth_User=CSteipp+OAuth; centralauth_Session=REDACTED; cawiki-cx-published=%7B%22username%22%3A%22CSteipp%20OAuth%22%7D; cawiki-mw-tour=%7B%22version%22%3A1%2C%22tours%22%3A%7B%22cxpublish%22%3A%7B%22startTime%22%3A1420834298093%2C%22step%22%3A%22suggestmovestart%22%7D%7D%7D' --data 'action=cxpublish&format=json&from=en&to=ca&sourcetitle=Stored+procedure&title=Stored+procedure+(stored+XSS)&html=%3Cp%20id%3D%22cx8%22%20data-source%3D%228%22%20data-cx-state%3D%22mt%22%20data-cx-weight%3D%22296%22%20contenteditable%3D%22true%22%3E%3Cspan%20class%3D%22cx-segment%22%20data-segmentid%3D%229%22%3EUn%20%3Cb%3Eprocediment%20emmagatzemat%3C%2Fb%3E%20%C3%A9s%20una%20%3Ca%20class%3D%22cx-target-link%22%20data-linkid%3D%2210%22%20href%3D%22Subrutina%22%20rel%3D%22mw%3AWikiLink%22%20title%3D%22Subroutine%22%3Esubrutina%3C%2Fa%3E%20disponible%20a%20aplicacions%20que%20accedeixen%20un%20%3Ca%20class%3D%22cx-target-link%22%20data-linkid%3D%2212%22%20href%3D%22Base%20de%20dades%22%20rel%3D%22mw%3AWikiLink%22%20title%3D%22Database%20management%20system%22%3Esistema%20de%20base%20de%20dades%3C%2Fa%3E%20%3Ca%20class%3D%22cx-target-link%22%20data-linkid%3D%2211%22%20href%3D%22Base%20de%20dades%20relacional%22%20rel%3D%22mw%3AWikiLink%22%20title%3D%22Relational%20database%22%3Erelacional.%3C%2Fa%3E%20%3C%2Fspan%3E%3Cspan%20class%3D%22cx-segment%22%20data-segmentid%3D%2213%22%3EUn%20%3Cb%3Eproc%3C%2Fb%3Eediment%20emmagatzemat%20(de%20vegades%20cridat%20un%20proc%2C%20%3Cb%3Esproc%3C%2Fb%3E%2C%20%3Cb%3EStoPro%3C%2Fb%3E%2C%20%3Cb%3EStoredProc%3C%2Fb%3E%2C%20StoreProc%2C%20%3Cb%3Esp%3C%2Fb%3E%20o%20SP)%20%C3%A9s%20de%20fet%20emmagatzemat%20en%20el%20%3Ca%20class%3D%22cx-target-link%22%20data-linkid%3D%2214%22%20href%3D%22Diccionari%20de%20dades%22%20rel%3D%22mw%3AWikiLink%22%20title%3D%22Data%20dictionary%22%3Ediccionari%20de%20dada%3C%2Fa%3E%20de%20la%20base%20de%20dades.%3Cscript%3Ealert(1)%3B%3C%2Fscript%3E%3Cimg%20src%3D%22%23%22%20onerror%3Dalert(2)%20%2F%3E%3C%2Fspan%3E%3C%2Fp%3E&status=draft&sourcerevision=640502284&progress=%7B%22any%22%3A0%2E034705123695626686%2C%22human%22%3A0%2C%22mt%22%3A0%2E034705123695626686%2C%22mtSectionsCount%22%3A1%7D&token=REDACTED%2B%5C'

When I can send another user to the draft by sending them a url with draft=<draftid>, and javascript is executed in their browser.

There doesn't seem to be the ability to delete/suppress drafts, even though they are publicly available to all logged in users. If someone publishes something defamatory in a draft, and publishes a link to the draft, I'm not seeing a way for admins to delete it.

It's probably debatable if drafts need to be subjected to all of the anti-spam checks, but since they can be shared with other users, I think they need to at least be deleteable, if not integrate with CheckUser. Or did the team check with any functionaries and determine that wasn't needed?

csteipp added a subscriber: aaron.Jan 10 2015, 12:46 AM

Although the API will throttle and enforce blocks when a user actually publishes a translation, I'm a little concerned that there isn't any throttling on adding drafts. It would be trivial to script something that repeatedly added new, large drafts at a high rate, and the affect wouldn't be visible until the table/db started having performance issues, since there's no way to get visibility into all of the drafts that exist.

Also, blocked users will be able to add drafts too, so mitigating an attack will be difficult.

A few other minor issues:

  • Not exploitable, but it seems like to/from should be validated when saving a translation. Right now, any string is valid (http://ca.wikipedia.beta.wmflabs.org/wiki/Especial:ContentTranslationStats).
  • In Stats.php, @aaron should take a look at the query. If it turns into a DoS vector, we can disable the special page, but it would be nice if it started out with decent performance.
  • In Translator.php, you may want to add a comment for line 9 that newFromUser needs to be used, instead of directly creating the object, so CentralAuth checks are done. Otherwise unattached accounts will have null authorship.

@csteipp,

  • Drafts are private to the creator. User A cannot access draft by User B even if the user get draft id. See T78008
  • Drafts can be deleted by the users who created it. See T75979

@csteipp. To reiterate what Santhosh said, the drafts are only visible to the user who created them and they can be deleted. This would seem to take care of the comments about drafts you made. We already have approval from legal on storing drafts in the db since no one else can see them.

As for throttling saving drafts and the minor issues mentioned in the next to the last comment, are these critical? Do they need to be taken care of before we release?

My patches for using json with unicode sequences are here:

https://gerrit.wikimedia.org/r/#/c/183743/
https://gerrit.wikimedia.org/r/#/c/183976/

They need to be merged at the same time.

Change 183743 merged by jenkins-bot:
Security: Prevent XSS in /mt endpoint

https://gerrit.wikimedia.org/r/183743

Change 183976 merged by jenkins-bot:
Security: Use new JSON format from /mt endpoint

https://gerrit.wikimedia.org/r/183976

@csteipp,

  • Drafts are private to the creator. User A cannot access draft by User B even if the user get draft id. See T78008
  • Drafts can be deleted by the users who created it. See T75979

Is beta missing the patches for this? I'm able to share a draft link between two accounts (which executes javascript) as of just a few minutes ago.

@csteipp

It looks like the necessary patch is waiting on review from Amir. Should be available shortly.

Change 184579 had a related patch set uploaded (by Nikerabbit):
Validate language codes in CX API

https://gerrit.wikimedia.org/r/184579

Patch-For-Review

Change 184579 merged by jenkins-bot:
Validate language codes in CX API

https://gerrit.wikimedia.org/r/184579

@csteipp

All the security related patches are merged and on beta-labs now. Please verify that things are as you need them and if they are resolve this task.

If there are any issues, please email me. We are trying to get in the last deployment window tomorrow

Thanks,

Joel

csteipp closed this task as Resolved.Jan 14 2015, 12:22 AM

Unified list of issues and their fixes bellow for reference. I'm closing this, assuming the Language team is going to address T86723 in a reasonable amount of time.

cx server is allowing reflected xss

(Not exploitable) verify that $section.data( 'source' ) is an integer before using it in $().

(Not exploitable) html in the data-mw attribute

The drafts feature allows for stored xss

  • Fixed by making drafts private to the individual translator (T78008)

blocked users can add drafts

to/from should be validated when saving

add a comment on Translator.php line 9 that newFromUser needs to be used

  • Looks like several patches did this and fixed other places it wasn't used
bd808 moved this task from Done to Archive on the MediaWiki-Core-Team board.Jan 20 2015, 10:11 PM