Page MenuHomePhabricator

Security Concept Review For Parsoid-PHP
Closed, ResolvedPublic

Description

Project Information

  • Name of project: Parsoid/PHP (PHP Port of Parsoid)
  • Project home page: https://www.mediawiki.org/wiki/Parsing/Notes/Moving_Parsoid_Into_Core / https://www.mediawiki.org/wiki/Parsoid
  • Name of team which owns the project: Parsing
  • Primary contact for the project: @ssastry
  • Target date for deployment: Q1 2019/20
  • Link to code repository: https://gerrit.wikimedia.org/r/q/project:mediawiki%252Fservices%252Fparsoid
  • Is this a brand-new project: No
  • Has this project ever been reviewed before: (Phab tasks, etc.): Nothing formally as far as I know. But,we have had incident-specific reviews and any reviewing done in the context of clients that use Parsoid; We also had a long-standing nsp check for Parsoid's node modules which we stayed on top of.
  • Has any risk assessment (STRIDE, etc.) been performed: Not that I know of.
  • Is there an existing RFC or has this been presented to the community: Not relevant
  • Is this project tied to a team quarterly goal: This is tied to the Platform Evolution CDP
  • Does this project require its own privacy policy: No

Description of the project and how it will be used

I will just point to https://www.mediawiki.org/wiki/Parsoid for now. I can follow up with any additional info as required.
But, specifically, the context for this is that we are porting Parsoid to PHP which will be integrated into core as a composer library and will run in-process on REST API (see T221738 ) requests made to MediaWiki. We want to deploy this in July/August.

Parsoid/JS (currently deployed on the Wikimedia cluster) is not exposed directly to the internet. All requests to it go through RESTBase exposed REST API for wikis (Ex: https://en.wikipedia.org/api/rest_v1/ ). But, with the Parsoid/PHP offering which will be integrated into core, we can similarly deploy to an internal cluster that is not directly accessible on the internet and disable it on the app cluster and elsewhere where the MediaWiki API is exposed to the internet. But, this is all part of the security review - to figure out appropriate boundaries, exposed surfaces, and necessary hardening to prepare this codebase for deployment in a timely manner without compromising security and privacy needs when we go from Parsoid/JS that ran an independent stateless service to Parsoid/PHP that will run while bundled with MediaWiki.

Description of any sensitive data to be collected or exposed

None

Technologies employed

Whatever MediaWiki core uses (PHP 7.2, composer, phan, etc.)

Dependencies and vendor code

Dependencies are listed in the composer.json file of the repository

Working test environment

None provided as of yet.

Event Timeline

ssastry created this task.Apr 25 2019, 10:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2019, 10:33 PM
ssastry updated the task description. (Show Details)Apr 25 2019, 10:37 PM
Tgr added a subscriber: Tgr.Apr 25 2019, 10:46 PM

Description of any sensitive data to be collected or exposed

Deleted/oversighted content, maybe? With Parsoid-JS this wasn't an issue since everything went through the MediaWiki API which had its own permission checking, with Parsoid-PHP it's less clear where that would happen (presumably in the MediaWiki bindings but that code mostly doesn't exist yet).

Also, in wikis where not everyone can read everything, you can expose private data by e.g. embedding it as a template and asking the parser to parse it. The existing parser is probably already vulnerable to this and MediaWiki in general does not give any guarantees. We might be able to improve the state of things with Parsoid if we want.

I would like to explicitly separate two projects: (a) deployment of Parsoid/PHP as a replacement for Parsoid/JS (b) making Parsoid/PHP the default wikitext engine on Wikimedia cluster.

This security concept review request is only for (a) and as I alluded to in the description, we can adopt more stringent isolation as necessary ... for example, by deploying this on a cluster that isn't exposed to the internet. And, we can choose not to expose the Parsoid REST API on other app servers (probably by disabling it via configuration or not even deploying the Parsoid/PHP code to those servers or whatever other mechanisms make sense).

In the time it takes us to go from (a) to (b), we can address other concerns that we need to factor in if the Parsoid REST API is exposed directly.

Description of any sensitive data to be collected or exposed

Deleted/oversighted content, maybe? With Parsoid-JS this wasn't an issue since everything went through the MediaWiki API which had its own permission checking, with Parsoid-PHP it's less clear where that would happen (presumably in the MediaWiki bindings but that code mostly doesn't exist yet).

Yes, Parsoid-PHP composer library code will not deal with cookies or authentication. It is meant to be a purely input -> output transformation library. The config objects that MediaWiki constructs and passes into Parsoid's PHP entry points will likely deal with authentication and access issues.

cscott added a subscriber: cscott.EditedApr 26 2019, 3:05 PM

I recall that we did have the issue with Parsoid being exposed to sensitive content, specifically usernames and deleted revisions. The details are foggy, but I vaguely recall that I had to remove some information from the <head> of the Parsoid document because it could expose information about deleted revisions.

In theory Parsoid previously was interacting with mediawiki core via the public action API, so it "shouldn't" have access to anything that's not otherwise available by the public API. And yet... there was this one case I need to hunt down where there was a loophole we had to plug. It may have been caching related; perhaps restbase ensured that it didn't return a deleted oversighted page but we could still return metadata from that page as part of the "previous revision" header information from Parsoid, if the follow-on page was generated and cached before the admin action occurred? It might have been username for the previous revision?

Further, once Parsoid is integrated with core, it will be doing direct requests to the backend DB instead of using the public API, and so there is a greater chance that we 'leak' information from the DB which should otherwise be hidden. Deleted/oversighted content, as @Tgr mentions, is probably the most obvious case. But technically Parsoid will have access (via the DB) to other sensitive content (like, say, IP addresses of Users) and we should pay close attention to the Parsoid-to-core interface code to make sure that sensitive data doesn't leak.

Integrated Parsoid will probably also be used on closed wikis (within the WMF, eg: officewiki), so again whatever access controls are on the individual pages will need to be applied to the content made available via the Parsoid API. Previously this was taken care of for us because our API request for the sensitive content would be rejected by the action API; now we will instead have to reproduce those access checks ourselves.

I recall that we did have the issue with Parsoid being exposed to sensitive content, specifically usernames and deleted revisions. The details are foggy, but I vaguely recall that I had to remove some information from the <head> of the Parsoid document because it could expose information about deleted revisions.

You are thinking of T125266: Remove user name and edit comment from html <head>

cscott added a comment.EditedApr 26 2019, 3:22 PM

Yes, indeed. I just finished digging through git log to find that patch, and came here and you'd beaten me to it. ;)

The interesting part of that task is T120409: RESTBase should honor wiki-wide deletion/suppression of users, which is part of the access control mechanism RESTBase had been doing for us, which we'd potentially need to reimplement if our API were directly accessible.

@ssastry - thanks for submitting this. I was just curious if there's a more concrete deployment date in mind than "Q1 2019/20". That would help us a bit more with our scheduling.

sbassett triaged this task as Normal priority.Apr 29 2019, 2:59 PM

@ssastry - thanks for submitting this. I was just curious if there's a more concrete deployment date in mind than "Q1 2019/20". That would help us a bit more with our scheduling.

Nope. Not yet. A lot depends on how the rest of the porting goes, and what we discover during additional QA testing and performance benchmarking. But, while we are going to try and be aggressive on our end to get it out sooner than later, it is probably safe to say end-July at the earliest. As we get further along, I'll provide additional updates on timeline.

sbassett lowered the priority of this task from Normal to Low.Apr 30 2019, 5:18 PM
sbassett updated the task description. (Show Details)
ssastry moved this task from Backlog to Deployment on the Parsoid-PHP board.May 2 2019, 4:40 AM

@ssastry - looking at https://gerrit.wikimedia.org/r/q/project:mediawiki%252Fservices%252Fparsoid - are most of the patch sets here related to the new PHP-Parser development? Is there a specific start date or branch/topic we should be paying attention to right now for the new development? Or does master for mediawiki/services/parsoid fulfill that role right now? Trying to understand if it's worth the Security-Team's time to have a look at recent patch sets or if we should wait for a more stable release branch, etc. Thanks.

ssastry added a comment.EditedMay 7 2019, 6:27 PM

@ssastry - looking at https://gerrit.wikimedia.org/r/q/project:mediawiki%252Fservices%252Fparsoid - are most of the patch sets here related to the new PHP-Parser development? Is there a specific start date or branch/topic we should be paying attention to right now for the new development? Or does master for mediawiki/services/parsoid fulfill that role right now? Trying to understand if it's worth the Security-Team's time to have a look at recent patch sets or if we should wait for a more stable release branch, etc. Thanks.

So, porting started in January with it picking in earnest in February. All PHP code is in the the master branch under the src/ directory with a composer.json file in the root directory. The vast majority of patches since then have been PHP code being ported and merged and tweaked. But, there have also been javascript patches when we decided to refactor code or fix something to aid and simplify the porting process.

I don't think it will be useful for you to track patches in gerrit since many of those patches tend to be raw ports and not all of them are immediately verifiable and testable till some other components get ported. If you want to look at PHP code, I can point you to code that has already been ported and tested and verified to a reasonable degree. When we are done with this, we expect about 50K+ lines of PHP code.

But one thing to note is that there is a sanitizer that sanitizes parsed tokens before it is built into a DOM. That sanitizer code is a very close mirror of the sanitizer code used in the PHP parser. But, the route is a circuitous one. PHP Parser Sanitizer was ported to Javascript in 2012 which now got ported back to PHP, and in the process, we have also updated both sanitizers to more closely match PHP parser's sanitizer behavior. Here is the PHP sanitizer code which we have tested and you can review right away. The equivalent JS code (which is currently in production) is here.

sbassett added a comment.EditedMay 7 2019, 6:48 PM

Ok, thanks, @ssastry. I suppose we (the Security-Team) can have a look at some of the code in src/, particularly the sanitizer code. Though as this isn't a code review (at least not yet) we probably won't get too deep in the weeds there. At some point, we should perform a more focused analysis and risk assessment, though we'd want to get to a more stable release point first.

sbassett claimed this task.Jun 11 2019, 7:25 PM
sbassett closed this task as Resolved.Jun 12 2019, 6:27 PM

The Security-Team is fine with this from a conceptual point of view. I'm going to resolve this task for now in favor of the incoming MW REST API and Parsoid-PHP security review requests to be submitted by @EvanProdromou and @ssastry, respectively.