As discussed in T177765#4867361, Proton should not have access to the internal Wikimedia network (*.wmnet, IP addresses), and should probably only have to those external pages which are expected to be used for rendering the page (in a first approximation, Wikimedia domains only). So a web proxy or CSP injection or some other mechanism for ensuring that is needed.
|Resolved||ovasileva||T181079 [GOAL] Provide an expanded reading experience by improving the ways that users can download articles of interest for later consumption|
|Resolved||None||T181084 [EPIC] Deploy the mediawiki-services-chromium-render service (Proton)|
|Resolved||holger.knust||T210651 Switch all PDF render traffic to new Proton service|
|Resolved||Tgr||T213362 Limit what URLs Proton can access|
- Mentioned In
- rESCD9373c428125d: Add gistcdn.githack.com to host blacklist
T210651: Switch all PDF render traffic to new Proton service
T214975: proton experienced a period of high CPU usage, busy queue, lockups
rMSCRaaf8131fd555: Add Puppeteer domain blacklist regexp
rESCD97b702f8dc8b: Set blacklist regex
rESCD1380c5024423: Set blacklist regex
rESCDb64d29e48e2c: Set blacklist regex
rMSCR898828301e61: Add Puppeteer domain blacklist regexp
rMSCRc8a8f39f2678: Add Puppeteer domain blacklist regexp
rMSCRb673cabf0fdc: Add Puppeteer domain blacklist regexp
rMSCR6551e5dbdfea: Add Puppeteer domain blacklist regexp
rMSCR975ccdceb91e: Add Puppeteer domain blacklist regexp
rMSCR4a20acd66107: Add Puppeteer domain blacklist regexp
rMSCRfafffb2a6417: Add Puppeteer domain blacklist regexp
rMSCR76fa24dec9be: Add Puppeteer domain blacklist regexp
rMSCR37aa301fc867: Add Puppeteer URL whitelist regexp
rMSCRea16212799c0: Add Puppeteer URL whitelist regexp
rMSCR1d38c961205c: Add Puppeteer URL whitelist regexp
rMSCRe13755e9d31d: Add Puppeteer URL whitelist regexp
rMSCRa2e597ecf796: Add Puppeteer URL whitelist regexp
rMSCR41adab32f3c0: Add Puppeteer URL whitelist regexp
rMSCR1c90da6dbdee: Add Puppeteer URL whitelist regexp
rMSCR0931d4b20b21: Add Puppeteer URL whitelist regexp
rMSCR0bb307af4f43: Add Puppeteer URL whitelist regexp
rMSCRb940b462778a: Add Puppeteer URL whitelist regexp
rMSCR79f77f392f5e: Add Puppeteer URL whitelist regexp
T210652: Handoff Proton service to Reading Infrastructure
T177765: Security review of mediawiki-services-chromium-render
- Mentioned Here
- T210651: Switch all PDF render traffic to new Proton service
T177765: Security review of mediawiki-services-chromium-render
As discussed in T177765#4867361, Proton should not have access to the internal Wikimedia network
This is actually not a problem. All the queries that the service receives contains the fixed domain and title to request. Moreover, external clients can only access it via RB, so they cannot control the domain the service receives as a param. Finally, Proton checks with RB that the title exists before requesting its contents.
@mobrovac I think what @Tgr was getting at is not about the title/url requested but if the page content that chromium will load has references to internal network resources, since chromium will load that content and the browser will try to fetch them (things like image srcs for example).
I'm not sure how exploitable that kind of vector really is, but it is something we should really double-check.
Currently, the CSP for the HTML Proton will receive would be something like
content-security-policy: default-src 'none'; media-src *; img-src *; style-src http://*.wikipedia.org https://*.wikipedia.org 'unsafe-inline'; frame-ancestors 'self'
I guess we could further restrict the CSP for proton and rely on chromium for securing where it can go.
IMO a stronger upstream CSP is nice to have but relying on just that is fragile; it is too easy to change it at the source without understanding what effect that will have on Proton. Whatever filtering is used should ideally be more self-contained than that.
Other options that were raised:
- Set up some lightweight standalone proxy, force all requests through it (use the --proxy-server Chromium option in Puppeteer), and do the filtering there. Easy but one more moving part.
- Use Puppeteer's request interception mode to abort non-whitelisted requests, as Piotr said above. Very easy and means URL filtering code is in the same place as the rest of Proton. The question is, how reliable is the Proton code? The have tests for request interception, so it's probably OK to rely on it not breaking without anyone noticing.
- Inject CSP within Puppeteer? I seem to recall this being mentioned in some discussion, but at a glance it does not seem possible.
If we can do a combination of a stronger CSP (limit images/media to the wiki domain + uploads) and request interception, I think that would be pretty robust and still easy to do.
Regarding stronger CSP, we might need a bit of an investigation. We definitely do not want the CSP to vary by user agent, so we need to look how narrow we can make the CSP without breaking anything. I think wikidomain + *.wikimedia.org + wikidata (just in case) should work well. I'll look into it.
I agree that varying CSP by UA is non-ideal – it's fragile and yet another moving part (it'd have to be implemented in Varnish, right?).
We could actually inject the CSP using the very same request/response interception mechanism in Proton: intercept all requests and test them against a whitelist, and intercept all responses and override the CSP header. This has the advantage that all related code is in one place, as @Tgr states above.
Hrrm… I thought we might be able to listen to [[ https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#event-response | the response event ]]. I'd assumed that the headers property was mutable but you might be right.
As an interim solution - which most probably will provide enough of security - let's provide:
- a config variable, where we can store a regex to whitelist URL [or possible an array of regexes]
- if the config option is set, use the Page Request Interception, if it doesn't match the regex abort the request?
Does it solve your concerns @Tgr?
No, it just needs to be done. They can review at any time if they have the capacity, but Proton is already in production and even if the patch doesn't work it cannot make it *less* secure.
This did not go well; adding an external CSS import (tested via (1) and (2)) breaks the service: it just hangs and times out after 60 sec. (No log entries on the Proton channel; RESTBase logs the timeout, but that's not too informative).
As far as I can remember this is identical to how I tested it on Vagrant, so not quite sure how to debug it...
I forgot that we ended up using a whitelist instead of a blacklist (specifically .*:.*|[\d.]+|.*\.wmnet) so this is probably an unrelated bug. Also I can't reproduce it now: https://test.wikipedia.org/api/rest_v1/page/pdf/Test-T213362?new_pdf=1 works fine. Without new_pdf=1 it does time out though, so I'll just assume I got confused and tested it on Electron, not Proton. So apparently Electron hangs in the presence of @import rules, but it's about to be decommissioned, so we probably don't care.
On the other hand the external style is not applied, even though the URL filtering does not apply to it. Not sure what's going on there. https://eo.wikipedia.beta.wmflabs.org/api/rest_v1/page/pdf/Test-T213362?new_pdf=1 does work as expected, at least.
Confirmed on beta that the patch is working as expected. Not sure why in production @import gets filtered out anyway, but it's probably not a good use of time to try debugging why something worked correctly, now that the reason it theoretically shouldn't have is fixed.