Page MenuHomePhabricator

Parsoid/PHP responds with 404 for titles with slashes and revision
Closed, DuplicatePublic

Description

I noticed this while testing RESTBase against Parsoid/PHP. I'm not sure whether the problem is in Parsoid or the MW router, but when requesting the page bundle for a specific revision of a title that contains a slash in it, Parsoid/PHP responds with a 404:

$ curl -H'Host: en.wikipedia.beta.wmflabs.org' https://en.wikipedia.beta.wmflabs.org/w/rest.php/en.wikipedia.beta.wmflabs.org/v3/page/pagebundle/User%3APchelolo%2FOnDemand_Test/275844
{"messageTranslations":{"en":"The requested relative path (/en.wikipedia.beta.wmflabs.org/v3/page/pagebundle/User:Pchelolo/OnDemand_Test/275844) did not match any known handler"},"httpCode":404,"httpReason":"Not Found"}

The same occurs when issuing the request for the html route, so this may be a problem with the MW REST router rather than Parsoid. Naturally, the page and revid exist:

$ curl -H'Host: en.wikipedia.beta.wmflabs.org' https://en.wikipedia.beta.wmflabs.org/wiki/User%3APchelolo%2FOnDemand_Test?oldid=275844
< HTTP/2 200 
< date: Sun, 13 Oct 2019 11:07:35 GMT
< content-type: text/html; charset=UTF-8
< server: deployment-mediawiki-07.deployment-prep.eqiad.wmflabs
< x-powered-by: PHP/7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1

Fetching page bundles for pages that do not have / in the name works as expected:

$ curl -H'Host: en.wikipedia.beta.wmflabs.org' https://en.wikipedia.beta.wmflabs.org/w/rest.php/en.wikipedia.beta.wmflabs.org/v3/page/html/User%3APchelolo/347883
< HTTP/2 200 
< date: Sun, 13 Oct 2019 11:11:38 GMT
< content-type: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"
< server: deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs
< x-powered-by: PHP/7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1

To make things even more interesting, if the page bundle of a title with / is requested without the revision, the request succeeds:

$ curl -H'Host: en.wikipedia.beta.wmflabs.org' https://en.wikipedia.beta.wmflabs.org/w/rest.php/en.wikipedia.beta.wmflabs.org/v3/page/html/User%3APchelolo%2FOnDemand_Test
< HTTP/2 200 
< date: Sun, 13 Oct 2019 11:14:38 GMT
< content-type: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"
< content-length: 1242
< server: deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs
< x-powered-by: PHP/7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1

Event Timeline

mobrovac triaged this task as High priority.Oct 13 2019, 11:16 AM
mobrovac created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 13 2019, 11:16 AM

Several of the new Core REST API endpoints accept titles. We should see if they are similarly affected.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/542457 proposes a REST API param validator related to titles and may (1) also be affected by this and (2) might be a good place, or at least inspiration, for a reusable solution.

I haven't looked too deep into it, but I think the problem is that the title gets decoded before looking up the appropriate handler, which then presents a challenge in this case. Parsoid declares the following (relevant) route handlers:

{
	"path": "/{domain}/v3/page/{format}/{title}",
	"class": "MWParsoid\\Rest\\Handler\\PageHandler",
	"factory": "MWParsoid\\Rest\\Handler\\PageHandler::factory",
	"method": "GET"
}, {
	"path": "/{domain}/v3/page/{format}/{title}/{revision}",
	"class": "MWParsoid\\Rest\\Handler\\PageHandler",
	"factory": "MWParsoid\\Rest\\Handler\\PageHandler::factory",
	"method": "GET"
}

So the same code path handles both cases, and thus works both with and without a revision parameter present. What I think is happening is that when the path en.wikipedia.beta.wmflabs.org/v3/page/pagebundle/User%3APchelolo%2FOnDemand_Test/275844 is encountered by MW, it is first decoded into /en.wikipedia.beta.wmflabs.org/v3/page/pagebundle/User:Pchelolo/OnDemand_Test/275844, but that cannot be matched to any of the above routes, as

format: pagebundle
title: User:Pchelolo
revision: OnDemand_Test
???: 275844

Conversely, because of this matching algorithm, the request for en.wikipedia.beta.wmflabs.org/v3/page/html/User%3APchelolo%2FOnDemand_Test (i.e. without the revision) succeeds because:

format: pagebundle
title: User:Pchelolo
revision: OnDemand_Test

However, it should be noted here that the request succeeds only because Parsoid can handle requests both with and without the revid, and not because the handler look-up algorithm matched the route correctly.

Ok, I was partially correct. The problem is indeed that MW tries to match the decoded path, which fails because of the slash. However, it turns out that that's not on MW, but rather on Apache: it fully decodes the path before passing it on to PHP.

Ok, I was partially correct. The problem is indeed that MW tries to match the decoded path, which fails because of the slash. However, it turns out that that's not on MW, but rather on Apache: it fully decodes the path before passing it on to PHP.

I just checked and we have a number of test titles in round trip testing that have a "/" in them and they all successfully go through rt testing. So, yes, this must be something not in the M/W-Parsoid combo. Is there a way to prevent Apache from decoding? Or, not sure if it is feasible for the M/W code to not fail on "/" because the revid ought to be a number .. but presumably that kind of type information isn't available in the config .. so, I think the qn. is if Apache can be configured not to decode titles.