Page MenuHomePhabricator

Security Review For Parsoid-PHP
Closed, ResolvedPublic

Description

Project Information

Description of the tool/project

I will just point to https://www.mediawiki.org/wiki/Parsoid for now. I can follow up with any additional info as required.
But, specifically, the context for this is that we are porting Parsoid to PHP which will be integrated into core as a composer library and will run in-process on REST API (see T221738 ) requests made to MediaWiki. We want to deploy this in July/August.

Parsoid/JS (currently deployed on the Wikimedia cluster) is not exposed directly to the internet. All requests to it go through RESTBase exposed REST API for wikis (Ex: https://en.wikipedia.org/api/rest_v1/ ). But, with the Parsoid/PHP offering which will be integrated into core, we can similarly deploy to an internal cluster that is not directly accessible on the internet and disable it on the app cluster and elsewhere where the MediaWiki API is exposed to the internet.

Description of how the tool will be used at WMF

This will replace Parsoid/JS which is used by VisualEditor, Flow, ContentTranslation, Android App, Kiwix. Eventually, this will serve all page view and edit views.

Dependencies

List dependencies, or upstream projects that this project relies on.
The composer.json file in the repository is the authoritative source for this.

  • In production mode, all except two libraries are used by mediawiki core. The two new libraries are wikipeg and zest. wikipeg is a wikimedia fork of the pegjs project zest is a wikimedia port of zest.js repository. Both these libraries are developed / maintained by Wikimedia engineers.
  • In developer mode, all except one library are used by mediawiki core. The exception is alea which is a wikimedia port of alea.js. This port is by Wikimedia engineers and is also maintained by us.

Has this project been reviewed before?

Not that I know of

Working test environment

Please link or describe setup process for setting up a test environment.
Parser Tests
php bin/parserTests.php runs tests in all but the selser mode across all test files (~6K tests and finishes in ~55 secs on my laptop).
php bin/parserTests.php --wt2html --wt2wt --html2wt --html2html --selser auto runs tests in the specified modes across all test files (~26K tests and takes ~5m on my laptop)
Add the --quiet option to suppress verbose output.

All parser tests should be green.

PHP Unit Tests:

composer test will run phpcs, phpunit, and phan jobs.

Running in integrated / production mode
For both of these above scenarios (parser tests and unit tests), you don't need a working MediaWiki installation. You can run those standalone simply by checking out the Parosid repository and running those commands after installing the vendor modules.

But, to test and review the Parsoid REST API, you will need to run in MW-integration mode. https://github.com/wikimedia/parsoid/tree/T227209/extension has the instructions for running Parsoid in integrated mode.

Parsoid/JS API is documented on wiki. Parsoid/PHP supports all of those endpoints (there are still test failures on the API tests and Arlo is working through those).

Parsoid/PHP in integrated mode also exists on scandium. Assuming you have access to the server, you can hit the Parsoid API by logging to that server or via an ssh tunnel. We'll send the curl commands to access those endpoints if you need them.

Running Parsoid/PHP against an external wiki
From a security review point of view, Parsoid when run in this mode is simply yet another MediaWiki API client. I am including it here simply for completeness' sake.

For development and debugging purposes, Parsoid also supports accessing an external MediaWiki installation via its action API. Parsoid/JS doesn't require that external wiki to have installed the ParsoidBatchAPI extension, but Parsoid/PHP depends on that currently.

The bin/parse.js script lets you point Parsoid to an arbitrary mediawiki installation. bin/parse.php also supports this. bin/parse.php --help should be informative if you wish to run this.

Post-deployment

Name of team responsible for tool/project after deployment and primary contact.

Related Objects

Event Timeline

sbassett renamed this task from {WIP] Security Review For Parsoid-PHP to [WIP] Security Review For Parsoid-PHP.Jul 9 2019, 8:59 PM

@sbassett, @Reedy, I had said I would cut a security branch end of day today for you to start security review. But, I am going to do it Wednesday morning from the hackathon to give us one more day on Tuesday to round out and fix a few more things.

@ssastry - no problem, thanks for the update.

ssastry renamed this task from [WIP] Security Review For Parsoid-PHP to Security Review For Parsoid-PHP.Aug 19 2019, 10:29 AM
ssastry updated the task description. (Show Details)

@sbassett, @Reedy I overestimated my abilities to be able to get this done during Wikimania.:-) But, we fixed a number of bugs in Parsoid in the interim (including a few crashers) so that is a good thing.

As in the description, I have cut a T227209 branch of the Parsoid repository. This version is 99.4% green on parser tests and has also had some initial performance tweaking done as well. We will continue correctness and performance work on the master branch and at different points between now and deployment, we can provide you with an aggregated diff of changes relative to this branch as necessary.

Please contact @Arlolra if you need any additional clarifications between now and Aug 28th. I will be checking email sporadically between now and then, but otherwise will be away.

@Arlolra, could you please go over the description and update / correct anything in there that needs updating / clarification? Thanks!

For development and debugging purposes, Parsoid also supports accessing an external MediaWiki installation via its action API. Parsoid/JS doesn't require that external wiki to have installed the ParsoidBatchAPI extension, but Parsoid/PHP depends on that currently (* TODO: Verify *).

Yeah, it currently depends on the BatchAPI to populate media info,
https://github.com/wikimedia/parsoid/blob/master/src/Config/Api/DataAccess.php#L155

@Reedy - could use your eyes on this. Thank you!

@Reedy - could use your eyes on this. Thank you!

@Reedy Maybe ping @Arlolra when you are about to begin so that he could update the branch with the latest changes on master.

Hey @ssastry - thanks for cutting the security review branch. @Reedy and I will plan to review that soon and reach out to @Arlolra with any questions.

Update: @Reedy and I are having a look at this and T230140 and should have some reports soon. We're going to timebox to maybe two weeks or so - hopefully that doesn't push back too much on any targeted deployment dates.

Update: @Reedy and I are having a look at this and T230140 and should have some reports soon. We're going to timebox to maybe two weeks or so - hopefully that doesn't push back too much on any targeted deployment dates.

Thanks! That sounds good. Should I rebase the security branch with latest changes to master?

Thanks! That sounds good. Should I rebase the security branch with latest changes to master?

@ssastry - Sure, that'd be great.

Should I rebase the security branch with latest changes to master?

Done, https://phabricator.wikimedia.org/diffusion/GPAR/history/T227209/

@ssastry @Arlolra - I've been looking at the T227209 branch of parsoid this week. I wanted to list some assumptions and questions here in an attempt to constrain the scope of this review as much as possible while still ensuring I'm covering all relevant pieces.

  1. Unless I'm missing something, everything within /guides seems relevant to the older Node-based parsoid service. I assume these will be updated or removed, eventually?
  2. Assuming the above is correct, I think the best/easiest way for me to actually security-test Parsoid/PHP is with bin/parse.php, as described within the Working test environment section above. I'd also be interested in testing within MW-integration mode, though if this would be fairly redundant (and seems a temporary testing solution anyways), I wouldn't want to waste time testing against it as well. I suppose I'm most interested in the best/easiest way to test Parsoid/PHP's inherent functionality. If that's not accomplished via bin/parse.php, or that simply is not representative enough of how Parsoid/PHP will be used within the context of Wikimedia projects, please let me know.
  3. I'm not planning to analyze any PHP or JS dependencies from composer or npm. A quick composer security:check and npm audit show there aren't any reported vulnerable packages anyways. An npm outdated shows 29 packages, but these do not appear to have corresponding security issues. I'm also curious how relevant the dependencies within package.json/npm-shrinkwrap.json are for Parsoid/PHP - can you shed some light on that?
  4. I'm guessing that I can ignore most things in /lib, /tools and /extension (as /extension is a temporary solution for testing) - please let me know if this is incorrect. While the tests are interesting, at around ~26k total tests, it's kind of an oppressive volume to do anything with in the context of a security review, though I have given them a very high-level overview. Anyhow, I've mainly been reviewing the 195-ish php files within /src as that seems to be where the bulk of Parsoid/PHP's functionality lies.

If the above sounds good and there are no lingering issues or concerns, I should be able to have some meaningful results later next week.

Yes, the only code relevant to Parsoid/PHP is in (a) src/ and (b) extension/ (c) tests/. All code in lib/, tools/ and some code in tests/ are Parsoid/JS code. So, the npm dependencies aren't relevant for Parsoid/PHP.

The code in extension/ will be moved to core when we are ready to use Parsoid/PHP strictly as a composer library. So, that code is relevant and the only difference is whether it will live in the core MediaWiki repo or in the Parsoid repo. That code in extension/ is how Parsoid will integrate with MediaWiki. So, the API code as well as config and data access functionality comes from there.

@ssastry - Sounds good, thanks for the clarifications.

Update: So far I'm not finding a lot that concerns me, which I guess is a good thing. A lot of the basic things I tend to look for don't really apply to Parsoid/PHP (security headers, tls, authn/z issues, encryption issues, policy compliance, code and project cleanliness, etc.) And certain classes of vulnerabilities like SQLi, CSRF, etc. obviously do not apply. And as noted previously, all relevant dependencies came back with a clean bill of health from composer security:check and snyk. I'd like to play around with more problematic inputs through next week if I can, just to ensure that I've done enough due diligence there. But so far, so good.

Update: So far I'm not finding a lot that concerns me, which I guess is a good thing. A lot of the basic things I tend to look for don't really apply to Parsoid/PHP (security headers, tls, authn/z issues, encryption issues, policy compliance, code and project cleanliness, etc.) And certain classes of vulnerabilities like SQLi, CSRF, etc. obviously do not apply. And as noted previously, all relevant dependencies came back with a clean bill of health from composer security:check and snyk. I'd like to play around with more problematic inputs through next week if I can, just to ensure that I've done enough due diligence there. But so far, so good.

Thanks for the update! :)

Security Review Summary - T227209 - October 2019
As was already stated within T227209#5531185, the code for Parsoid/PHP looks pretty good from a security standpoint. With any codebase this size, I wasn't able to test every line or code path, but I did itereate through several standard checks while attempting to break it as much as I could via bin/parse.php and MW-integration mode. I would currently classify this with an overall risk level of low with findings below:

Vulnerable Packages
Nothing found for any PHP dependencies via composer's security:check and snyk. npm audit and snyk found some vulnerable Node dependencies for the dev dependency nsp (oddly enough, the old package which turned into audit) and service-runner's ms dependency. As I understand it, these packages are not relevant for Parsoid/PHP. Reporting here for the sake of completeness.

Outdated Packages
npm outdated came back with around 15 outdated packages, again not security-related per se and not relevant for Parsoid/PHP as I understand it. Reporting here for the sake of completeness.

Static Analysis Findings
I might recommend setting up the SecurityCheckPlugin as a test script for Parsoid/PHP, as it's already using Phan. It probably makes sense to use the seccheck-generic script as described here though possibly not in Jenkins, at least not in any voting capacity. I ran the seccheck-generic config against the code base and one test did come back with an XSS warning:

tests/porting/hybrid/runDOMPass.php:470 SecurityCheck-XSS Echoing expression that was not html escaped (Caused by: tests/porting/hybrid/runDOMPass.php +462)

Obviously this isn't something user-facing or even something that would ever be executed within a production context within a client susceptible to XSS attacks, but it might be worth having a look at [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/parsoid/+/refs/heads/master/tests/porting/hybrid/runDOMPass.php#81 | runDOMPostProcessor ]] to see if the output ($out) is being generated as expected. Or setting a Phan docblock annotation to ignore it.

Additionally, I noticed that Phan's current config for Parsoid/PHP doesn't appear to leverage any of its bundled plugins. Some of these create a lot of noise (UnknownElementTypePlugin, NumericalComparison, WhitespacePlugin et al) but some of them are fairly useful from a security-adjacent standpoint. For example, there were a few things that caught my eye, which are probably fine, but might warrant some review:

src/Utils/TokenUtils.php:376 PhanPluginDollarDollar $$ Variables are not allowed.
src/Utils/TokenUtils.php:377 PhanPluginDollarDollar $$ Variables are not allowed.
src/Wt2Html/TT/ParserFunctions.php:116 PhanPluginAlwaysReturnMethod Method \Parsoid\Wt2Html\TT\ParserFunctions::noTrimRes has a return type of array, but may fail to return a value
tests/porting/hybrid/runDOMPass.php:411 PhanPluginDuplicateArrayKey Duplicate/Equivalent array key value('omit') detected in array - the earlier entry will be ignored.

These were found using some of the basic Phan plugins documented here. And here's a handy way to list and format some of the bundled plugins:

ls -p1 vendor/phan/phan/.phan/plugins/ |grep -Ev "(/|README)" |sed -e s/\.php// |xargs printf "\t\'%s\',\n"

General Security Issues

  1. I noticed a few issues with bin/parse.php. Since this is a test script, these may not really be issues:
    1. The --inputfile, --domain and --apiURL options do not do much in the way of validation. It's trivial to pass garbage data to them to cause PHP warnings and exceptions.
    2. The --html2wt and --wt2wt options seem to return unsanitized data, whereas the --html2html and --wt2html options do not. I assume this is by design, but obviously any potentially dangerous html/css/javascript could be returned with the --html2wt and --wt2wt options.
  2. While testing with bin/parse.php and the --html2html and --wt2html options I found an odd notice when sending data with the <poem> tag: PHP Notice: Uninitialized string offset: 0 in src/Ext/Poem/Poem.php on line 55
  3. I was able to get an exception in src/Utils/Util.php when testing with bin/parse.php --html2html with the following payload: <script\x20type=\"text/javascript\">javascript:alert(1);</script> (note the \x20 character): BadMethodCallException from line 61 of src/Utils/Util.php: Use lowercase tag names

Thanks @sbassett. We'll take a look and address these.

Change 542604 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Enable phan-taint-check-plugin

https://gerrit.wikimedia.org/r/542604

npm audit and snyk found some vulnerable Node dependencies for the dev dependency nsp (oddly enough, the old package which turned into audit) and service-runner's ms dependency. As I understand it, these packages are not relevant for Parsoid/PHP. Reporting here for the sake of completeness.

Correct, but other WMF projects use service-runner, so hopefully that's fixed upstream.

I might recommend setting up the SecurityCheckPlugin as a test script for Parsoid/PHP, as it's already using Phan.

Working on this in https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/542604

@Arlolra, wonder if we should handle T234057: Get rid of hybrid testing code to get rid of some of the noise from that code (which isn't used in production and was only used in the earlier stages of porting to aid QA).

@sbassett , we are considering resolving T235902: Tracking: Shadow Parsoid/PHP deployment to production cluster to handle mirrored reparse traffic next week. But since that doesn't involve any interaction with Parsoid clients, is it okay from your POV to proceed with that before this security ticket is resolved? We'll address all security feedback from here before we resolve T229015: Tracking: Direct live production traffic at Parsoid/PHP.

@ssastry - with the overall risk level for this review being low and some of the issues turned up during the review already being dealt with, it should be fine to go ahead with T235902.

  1. While testing with bin/parse.php and the --html2html and --wt2html options I found an odd notice when sending data with the <poem> tag: PHP Notice: Uninitialized string offset: 0 in src/Ext/Poem/Poem.php on line 55

This was presumably fixed by,
https://github.com/wikimedia/parsoid/commit/c3804d755f7b01b61b757e7cae0a39ea6974e85e#diff-a41351ac3655dc5860d34e355a1a8e82R39

Change 547021 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Throw if Remex did any invalid name coercion

https://gerrit.wikimedia.org/r/547021

  1. I was able to get an exception in src/Utils/Util.php when testing with bin/parse.php --html2html with the following payload: <script\x20type=\"text/javascript\">javascript:alert(1);</script> (note the \x20 character): BadMethodCallException from line 61 of src/Utils/Util.php: Use lowercase tag names

This was pretty interesting. The "Use lowercase tag names" error was just some defensive coding we added during the port since DOM node names from the library we're using in JS are returned in uppercase and lowercase in PHP. However, Remex adds some name coercion for characters that libxml will throw on, which results in an uppercase "U".

See https://github.com/wikimedia/remex-html/blob/master/RemexHtml/DOM/DOMUtils.php#L22

So, the script\x20type turns into something like scriptU00005Cx20type.

When parsing (the wt2html direction), the sanitizer will prevent this from ever occurring. But serializing (html2wt) is another story, since we feed Remex client input. The above patch (T227209#5617108) returns a ClientError if given HTML of this form. That may be too strict though, we'll see.

Change 547021 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Throw if Remex did any invalid name coercion

https://gerrit.wikimedia.org/r/547021

Change 547681 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Leverage phan's bundled plugins

https://gerrit.wikimedia.org/r/547681

Change 547681 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Leverage phan's bundled plugins

https://gerrit.wikimedia.org/r/547681

Additionally, I noticed that Phan's current config for Parsoid/PHP doesn't appear to leverage any of its bundled plugins. Some of these create a lot of noise (UnknownElementTypePlugin, NumericalComparison, WhitespacePlugin et al) but some of them are fairly useful from a security-adjacent standpoint. For example, there were a few things that caught my eye, which are probably fine, but might warrant some review:

src/Utils/TokenUtils.php:376 PhanPluginDollarDollar $$ Variables are not allowed.
src/Utils/TokenUtils.php:377 PhanPluginDollarDollar $$ Variables are not allowed.
src/Wt2Html/TT/ParserFunctions.php:116 PhanPluginAlwaysReturnMethod Method \Parsoid\Wt2Html\TT\ParserFunctions::noTrimRes has a return type of array, but may fail to return a value
tests/porting/hybrid/runDOMPass.php:411 PhanPluginDuplicateArrayKey Duplicate/Equivalent array key value('omit') detected in array - the earlier entry will be ignored.

These were found using some of the basic Phan plugins documented here. And here's a handy way to list and format some of the bundled plugins:

ls -p1 vendor/phan/phan/.phan/plugins/ |grep -Ev "(/|README)" |sed -e s/\.php// |xargs printf "\t\'%s\',\n"

Leveraging bundled plugins was addressed in T227209#5637738

  1. I noticed a few issues with bin/parse.php. Since this is a test script, these may not really be issues:
    1. The --inputfile, --domain and --apiURL options do not do much in the way of validation. It's trivial to pass garbage data to them to cause PHP warnings and exceptions.

Yes, this is a test script, so not really a concern.

  1. The --html2wt and --wt2wt options seem to return unsanitized data, whereas the --html2html and --wt2html options do not. I assume this is by design, but obviously any potentially dangerous html/css/javascript could be returned with the --html2wt and --wt2wt options.

Yes, this is by design. Sanitization happens while generating html, not serialization.

I might recommend setting up the SecurityCheckPlugin as a test script for Parsoid/PHP, as it's already using Phan.

Working on this in https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/542604

phan-taint-check-plugin@latest (either 2.1.0 or master) and mediawiki-phan-config@0.8.0 seem to have incompatible constraints in the version of phan required.

I'm going to leave that patch as WIP until that can be rectified but otherwise consider the above concerns addressed.

Should this task be resolved or do we want to look at the diff between the branch and master now?

https://github.com/wikimedia/parsoid/compare/T227209...master

phan-taint-check-plugin@latest (either 2.1.0 or master) and mediawiki-phan-config@0.8.0 seem to have incompatible constraints in the version of phan required.

I'm going to leave that patch as WIP until that can be rectified but otherwise consider the above concerns addressed.

Yes, this is a known issue, sorry if I lead you on a wild goose chase there. Hopefully the issue is fixed soon.

Should this task be resolved or do we want to look at the diff between the branch and master now?

https://github.com/wikimedia/parsoid/compare/T227209...master

I think we can resolve this task for now as everything from the posted review (T227209#5548653) has been addressed, I believe. Once Parsoid/PHP is closer to a full production deployment, perhaps another security review branch can be cut just to ensure that no significant security issues have been introduced post-review.

sbassett moved this task from Waiting to Done on the user-sbassett board.
sbassett moved this task from Waiting to Done on the deprecated-security-team-reviews board.

Since this task is resolved, I'm going to remove the branch. But, for posterity, it was the same as master 3d193c9.

Change 542604 abandoned by Arlolra:
[mediawiki/services/parsoid@master] [WIP] Enable phan-taint-check-plugin

Reason:
This should presumably already be covered with extension-phan

https://gerrit.wikimedia.org/r/542604