Page MenuHomePhabricator

Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS)
Open, MediumPublicBUG REPORT

Assigned To
None
Authored By
DannyS712
Nov 14 2019, 2:04 AM
Referenced Files
F31166333: Capture.PNG
Nov 23 2019, 1:59 PM
F31075726: Screenshot from 2019-11-14 12-28-58.png
Nov 14 2019, 5:30 PM
F31075704: image.png
Nov 14 2019, 5:27 PM
Tokens
"Grey Medal" token, awarded by valerio.bozzolan."Evil Spooky Haunted Tree" token, awarded by geraki.

Description

Recently, banwiki was created T234768: Create Balinese Wikipedia
The local name for the template namespace is Mal

There is currently a page with the title Mal:;
However, trying to visit it via https://ban.wikipedia.org/wiki/Mal:; fails, with the title-invalid-empty message
It is accessible via https://ban.wikipedia.org/w/index.php?curid=2090, but it cannot be moved from there: https://ban.wikipedia.org/wiki/Kusus:Pindahkan_halaman/Mal:; fails with notargettext

See the database row:

MariaDB [banwiki_p]> SELECT * FROM page WHERE page_title = ';';
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id | page_namespace | page_title | page_restrictions | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
|    2090 |             10 | ;          |                   |                0 |           0 | 0.879928170267 | 20191104014040 | 20191104014041     |       25661 |      132 | wikitext           | NULL      |
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
1 row in set (0.01 sec)

Is there a maintenance script or command that can help?

Steps to reproduce

Visit one of these pages:

Or, instead, visit one of these websites and in the search bar type ; and click:

You will be redirected to the homepage, instead of visiting that page.

Reproduced:

  • wed Mar, 17 2021, 22:54:38, CET from Italy from logged-in and out
  • ...

Served from:

  • mw1366.eqiad.wmnet
  • mw1411.eqiad.wmnet
  • ...

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ema triaged this task as Medium priority.Nov 15 2019, 11:02 AM
ema moved this task from Triage to Caching on the Traffic board.

hmm it looks like ATS URL parsing is at fault here. ATS is using a semi colon as a separator between the URL path and the URL params, even on the initial parsing of the request URL before remapping, ATS already drops the semicolon:
DEBUG: <URL.cc:1606 (url_describe)> (http) PATH: "wiki/Mal:", PATH_LEN: 9,

and from the source code: https://github.com/apache/trafficserver/blob/master/proxy/hdrs/URL.cc#L1368-L1370

if (*cur == ';') {
  path_end = cur;
  goto parse_params1;
}

BTW, Checking RFC 3986, I'm not sure that https://ban.wikipedia.org/wiki/Mal:; is a valid URL where path = /wiki/Mal:;

considering that the RFC lists the semicolon as a reserved character (https://tools.ietf.org/html/rfc3986#section-2.2):

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

@Vgutierrez I guess what you quoted wouldn't be valid, but https://ban.wikipedia.org/wiki/Mal:%3B should be perfectly okay and yet I get "bad title".

There's some confusion on historical standards interpretation here, I think. There are some ancient standards that reference the semicolon as URI-level delimiter (even in ways that might not be application-layer defined), for example https://tools.ietf.org/html/rfc2396#section-1.6 .

However, the modern RFCs (and we're using a pretty loose definition of "moderm"; still quite old and well-established) such as RFC 3986 don't treat the semicolon as anything special in the Path part of the URI. (Its presence in sub-delims doesn't make it any more special that other such characters, and interpretation is applayer-specific within the Path component).

We've spent some pages of typing on IRC on these topics which I won't echo all of here, but I think the most succint modern reference that the semicolon shouldn't terminate the Path component of a URI is the end of the first paragraph of https://tools.ietf.org/html/rfc3986#section-3.3 , which states The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI.

ATS is certainly unique in its interpretation here, among other modern-ish revproxies we've used. We'll have to loop through ATS developers at this point as well and try to figure out what the reasoning is for ATS's behavior, so this may take a little while to sort out. Worst case, we can patch the ATS URI parser locally if we have to. It's also notable that all of the relevant ATS code for handling the semicolon this way date back to the initial git commit from 10 years ago...

@Vgutierrez I guess what you quoted wouldn't be valid, but https://ban.wikipedia.org/wiki/Mal:%3B should be perfectly okay and yet I get "bad title".

Right, we are currently discussing the issue with ATS developers, I'll update this task as soon as we have some news regarding this issue.

so, I've been doing some tests, and ATS doesn't drop the url-encoded version of the semicolon, so https://ban.wikipedia.org/wiki/Mal:%3B should work. @ema maybe some URL normalization step on varnish/ats-be is messing with us here?

So:

vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3120/wiki/Mal:%3B" -v -o /dev/null -H 'X-Forwarded-Proto: https' 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 400 Bad Request
< X-Cache: cp1075 miss, cp1075 pass
< X-Cache-Status: pass
vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3120/wiki/Mal:%3B" -v -o /dev/null -H 'X-Forwarded-Proto: https' 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 200 OK
< X-Cache: cp1089 miss, cp1075 pass
< X-Cache-Status: pass

but asking directly to ats-be on cp1075 gives a 200:

vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3128/wiki/Mal:%3B" -v -o /dev/null 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 200 OK
< X-Cache-Int: cp1075 miss

@ema please correct me if I'm wrong, but it looks to me like varnish-fe is url-decoding the semicolon so /wiki/Mal:%3B hits ats-be like /wiki/Mal:; and that messes with ATS.

And that's what ats-backend shows:

vgutierrez@cp1075:~$ sudo -i atslog-backend ban.wikipedia.org
Date:2019-11-19 Time:10:00:36 ConnAttempts:0 ConnReuse:0 TTFetchHeaders:114 OriginServer:appservers-rw.discovery.wmnet OriginServerTime:114 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:400 OriginStatus:400 ReqURL:http://ban.wikipedia.org/wiki/Mal: BereqURL:GET https://appservers-rw.discovery.wmnet/wiki/Mal: HTTP/1.1 ReqHeader:User-Agent:curl/7.52.1 ReqHeader:Host:ban.wikipedia.org ReqHeader:X-Client-IP:127.0.0.1 ReqHeader:Cookie:- RespHeader:X-Cache-Int:cp1075 miss RespHeader:Backend-Timing:D=109906 t=1574157636005580

Well, now all templates (Mal) said "Galat script: no module." What just happen? Nevertheless, It works. I used Google Chrome for info. Already tried using Edge, same happen.

Capture.PNG (113×336 px, 4 KB)

Well, now all templates (Mal) said "Galat script: no module." What just happen? Nevertheless, It works. I used Google Chrome for info. Already tried using Edge, same happen.

Capture.PNG (113×336 px, 4 KB)

That's definitely another issue. Created T238998: Mal:Navbox at banwiki says Module not found despite the module exists for that.

Just came across this on enwiki when I couldn't access the page - I had to query the page table and redirect using https://en.wikipedia.org/wiki/Special:Redirect/page/25247567

I ran into this when I was unable to block https://en.wikipedia.org/w/index.php?target=Printf%28%22Herro+World%22%29%3B&title=Special:Contributions. All the user links in revision histories, etc., use path-style parameters which produce a broken link such as Special:Block/Printf("Herro_World");. Until this is fixed, we can use AbuseFilter to disallow creation of usernames/pages with semicolons.

@BBlack @ema @Vgutierrez Did you get a response on this from ATS developers? Did we file an issue or something? (I can't find one in https://github.com/apache/trafficserver/issues.)

BTW, Checking RFC 3986, I'm not sure that https://ban.wikipedia.org/wiki/Mal:; is a valid URL where path = /wiki/Mal:;

considering that the RFC lists the semicolon as a reserved character (https://tools.ietf.org/html/rfc3986#section-2.2):

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

The semi-colon being in this list is precisely what makes it a valid url, not an invalid one. URLs containing = or & in popular query string form, are not invalid either.

The significance of these characters in the spec is that they must never be liberally percent-encoded or decoded as part of some kind of normalization logic as they may be used for a special meaning. This in contrary to e.g. other ASCII characters which a browser or web server is allowed to proactively encode or decode based on its preferred form. For example, /P/P/, /P/%50/ and /%50/P/ can be considered equivalent (where %50 = P), but /P%2FP/ is definitely something else (where %2F = /).

The use case for ; being reserved is that CGI scripts often used ;a=foo;b=bar instead of the current convention of ?a=foo&b=bar as a way to pass query parameters.

For a url like https://example/foo/bar?x=y;z=1#zed the path is /foo/bar?x=y;z=1. But, whether the the slashes, question marks, ampersands or semi-colons actually have special significance in any given URI is entirely the concern of the underlying application. The traffic layers should make no assumptions about those. E.g. /Foo? is not equivalent to /Foo, and /Foo?a=1&b=2 is not equivalent to /Foo?b=2&a=1. While an empty query string and the order of params might be insignificant, that's not upto HTTP.

Krinkle updated the task description. (Show Details)

I ran into this when I was unable to block https://en.wikipedia.org/w/index.php?target=Printf%28%22Herro+World%22%29%3B&title=Special:Contributions. All the user links in revision histories, etc., use path-style parameters which produce a broken link such as Special:Block/Printf("Herro_World");. Until this is fixed, we can use AbuseFilter to disallow creation of usernames/pages with semicolons.

@MusikAnimal do we need a note here to get the title blacklist have the line removed when the fix is implemented?

Simple repro:

krinkle@people1002$ echo -e "Hello world.\n" > 'foo;'
krinkle@people1002$ cat foo\;
Hello world.

krinkle@people1002$ curl -I 'http://localhost/~krinkle/foo;'
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Fri, 25 Sep 2020 20:44:12 GMT
Content-Length: 14
…

$ curl -I 'https://people.wikimedia.org/~krinkle/foo;'
HTTP/2 404 
date: Fri, 25 Sep 2020 20:44:46 GMT
server: Apache
age: 0
x-cache: cp2041 miss, cp2029 pass

I suspect this is likely caused by the same issue in our Varnish or ATS configuration and confirms that Apache can and does serve it without issue.

With the dupe merger, maybe we owe a status update here:

We're pretty sure this is a bug in Apache Traffic Server. There's some obscure and/or interesting things about how that bug came to be, and various HTTP standards, and why an HTTP server would even care about a ; in the first place, but that's all relatively-irrelevant. The bottom line is that these URLs should work as-is (without any need for percent-encoding), and we're pretty certain it's ATS that's breaking them. We have some backlogged followup with upstream ATS to do here I'm sure.

Aklapper renamed this task from Pages whose title ends with semicolon (;) are intermittently inaccessible to Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS).Feb 13 2021, 5:56 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Note: Pages whose title contain multiple semicolons (;) are accessible even if they do end with semicolon
https://el.wikipedia.org/wiki/Από_πού_ερχόμαστε;_Τι_είμαστε;_Πού_πάμε;

Some observations:

@BBlack The last status update on this bug was ~18 months ago, and indicated the issue was an upstream bug and you were following up there, with a fallback to a WMF-specific patch if upstream got stuck. I see no indication there is any question this behaviour is a bug (cf. eg. Krinkle's comment above). It's also a problem that makes certain pages inaccessible on all projects, breaks contribution histories and other core features for certain users, and necessitates manually prohibiting a character in page and user names that is intended to be permitted.

IOW: an update would be appreciated.