Page MenuHomePhabricator

Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS)
Open, MediumPublicBUG REPORT

Description

Recently, banwiki was created T234768: Create Balinese Wikipedia
The local name for the template namespace is Mal

There is currently a page with the title Mal:;
However, trying to visit it via https://ban.wikipedia.org/wiki/Mal:; fails, with the title-invalid-empty message
It is accessible via https://ban.wikipedia.org/w/index.php?curid=2090, but it cannot be moved from there: https://ban.wikipedia.org/wiki/Kusus:Pindahkan_halaman/Mal:; fails with notargettext

See the database row:

MariaDB [banwiki_p]> SELECT * FROM page WHERE page_title = ';';
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id | page_namespace | page_title | page_restrictions | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
|    2090 |             10 | ;          |                   |                0 |           0 | 0.879928170267 | 20191104014040 | 20191104014041     |       25661 |      132 | wikitext           | NULL      |
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
1 row in set (0.01 sec)

Is there a maintenance script or command that can help?

Steps to reproduce

Visit one of these pages:

Or, instead, visit one of these websites and in the search bar type ; and click:

You will be redirected to the homepage, instead of visiting that page.

Reproduced:

  • wed Mar, 17 2021, 22:54:38, CET from Italy from logged-in and out
  • ...

Served from:

  • mw1366.eqiad.wmnet
  • mw1411.eqiad.wmnet
  • ...

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Aklapper renamed this task from Broken page on banwiki to Cannot access page on banwiki which ends with semicolon (;) in Chrome browser.Nov 14 2019, 7:37 PM
Aklapper renamed this task from Cannot access page on banwiki which ends with semicolon (;) in Chrome browser to Cannot access page on banwiki whose title ends with semicolon (;).

There is nothing wrong with that title. Mal:; is perfectly valid. Similarly, T238276 reports a problem with ;, which is also perfectly valid.

It seems to me that the semicolon is somehow being dropped when processing the request. Therefore requests for Mal:; instead return the results for Mal: (which is invalid, explaining the error messages seen here), and requests for ; instead return the results for empty title (which redirects to the main page, explaining the behaviors in T238276).

matmarex renamed this task from Cannot access page on banwiki whose title ends with semicolon (;) to Pages whose title ends with semicolon (;) are intermittently inaccessible.Nov 14 2019, 7:49 PM
matmarex added a subscriber: Wargo.

This affects all wikis. I could also reproduce with https://en.wikipedia.org/wiki/;, which will either redirect to the article about semicolons, or the main page.

I ran curl -I "https://ban.wikipedia.org/wiki/Mal:;" in a loop for a while to see if this affects particular servers.

Out of 307 attempts, 82 returned HTTP 200, and all of those were served by mw1273.eqiad.wmnet.

225 remaining attempts returned HTTP 400, they were served by 49 different servers (including mw1273, actually).

So mw1273 sometimes works correctly, and everything else is broken.

I think this has to do something with the differences between ATS and varnish normalizing request URLs (if they do any). @ema @BBlack Can you double check?

I think this has to do something with the differences between ATS and varnish normalizing request URLs (if they do any). @ema @BBlack Can you double check?

Yeah this seems suspiciously-likely. As for the inconsistency of results: Currently (as of this writing), requests arriving via our Singapore and San Francisco edges are handled entirely by ATS, and Amsterdam has been in-transition lately, currently 75% converted (so different users will see different things, potentially), while the core US sites are mostly-Varnish still. We'll look into this!

I think this has to do something with the differences between ATS and varnish normalizing request URLs (if they do any). @ema @BBlack Can you double check?

Good call. When it comes to ATS vs Varnish backends, I'm getting HTTP 400 with both.

cp3052 (ATS):

HTTP/2 400
date: Fri, 15 Nov 2019 10:15:40 GMT
content-type: text/html; charset=UTF-8
server: mw1327.eqiad.wmnet
x-powered-by: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
[..]
x-cache: cp3052 miss, cp3052 pass

cp3064/cp1089 (Varnish):

HTTP/2 400
date: Fri, 15 Nov 2019 10:15:49 GMT
server: mw1328.eqiad.wmnet
x-powered-by: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
[...]
x-cache: cp1089 pass, cp3064 pass, cp3052 pass

However, the TLS terminator used by cp3052 is ATS, while for instance on cp2010 we run nginx and I get a 200 there:

HTTP/2 200 
date: Fri, 15 Nov 2019 10:22:50 GMT
content-type: text/html; charset=UTF-8
server: mw1273.eqiad.wmnet
x-powered-by: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
[...]
x-cache: cp1085 hit/4, cp2006 miss, cp2010 hit/4

@Vgutierrez do you think ats-tls vs nginx can make a difference here?

ema triaged this task as Medium priority.Nov 15 2019, 11:02 AM
ema moved this task from Triage to Caching on the Traffic board.

hmm it looks like ATS URL parsing is at fault here. ATS is using a semi colon as a separator between the URL path and the URL params, even on the initial parsing of the request URL before remapping, ATS already drops the semicolon:
DEBUG: <URL.cc:1606 (url_describe)> (http) PATH: "wiki/Mal:", PATH_LEN: 9,

and from the source code: https://github.com/apache/trafficserver/blob/master/proxy/hdrs/URL.cc#L1368-L1370

if (*cur == ';') {
  path_end = cur;
  goto parse_params1;
}

BTW, Checking RFC 3986, I'm not sure that https://ban.wikipedia.org/wiki/Mal:; is a valid URL where path = /wiki/Mal:;

considering that the RFC lists the semicolon as a reserved character (https://tools.ietf.org/html/rfc3986#section-2.2):

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

@Vgutierrez I guess what you quoted wouldn't be valid, but https://ban.wikipedia.org/wiki/Mal:%3B should be perfectly okay and yet I get "bad title".

There's some confusion on historical standards interpretation here, I think. There are some ancient standards that reference the semicolon as URI-level delimiter (even in ways that might not be application-layer defined), for example https://tools.ietf.org/html/rfc2396#section-1.6 .

However, the modern RFCs (and we're using a pretty loose definition of "moderm"; still quite old and well-established) such as RFC 3986 don't treat the semicolon as anything special in the Path part of the URI. (Its presence in sub-delims doesn't make it any more special that other such characters, and interpretation is applayer-specific within the Path component).

We've spent some pages of typing on IRC on these topics which I won't echo all of here, but I think the most succint modern reference that the semicolon shouldn't terminate the Path component of a URI is the end of the first paragraph of https://tools.ietf.org/html/rfc3986#section-3.3 , which states The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI.

ATS is certainly unique in its interpretation here, among other modern-ish revproxies we've used. We'll have to loop through ATS developers at this point as well and try to figure out what the reasoning is for ATS's behavior, so this may take a little while to sort out. Worst case, we can patch the ATS URI parser locally if we have to. It's also notable that all of the relevant ATS code for handling the semicolon this way date back to the initial git commit from 10 years ago...

@Vgutierrez I guess what you quoted wouldn't be valid, but https://ban.wikipedia.org/wiki/Mal:%3B should be perfectly okay and yet I get "bad title".

Right, we are currently discussing the issue with ATS developers, I'll update this task as soon as we have some news regarding this issue.

so, I've been doing some tests, and ATS doesn't drop the url-encoded version of the semicolon, so https://ban.wikipedia.org/wiki/Mal:%3B should work. @ema maybe some URL normalization step on varnish/ats-be is messing with us here?

So:

vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3120/wiki/Mal:%3B" -v -o /dev/null -H 'X-Forwarded-Proto: https' 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 400 Bad Request
< X-Cache: cp1075 miss, cp1075 pass
< X-Cache-Status: pass
vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3120/wiki/Mal:%3B" -v -o /dev/null -H 'X-Forwarded-Proto: https' 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 200 OK
< X-Cache: cp1089 miss, cp1075 pass
< X-Cache-Status: pass

but asking directly to ats-be on cp1075 gives a 200:

vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3128/wiki/Mal:%3B" -v -o /dev/null 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 200 OK
< X-Cache-Int: cp1075 miss

@ema please correct me if I'm wrong, but it looks to me like varnish-fe is url-decoding the semicolon so /wiki/Mal:%3B hits ats-be like /wiki/Mal:; and that messes with ATS.

And that's what ats-backend shows:

vgutierrez@cp1075:~$ sudo -i atslog-backend ban.wikipedia.org
Date:2019-11-19 Time:10:00:36 ConnAttempts:0 ConnReuse:0 TTFetchHeaders:114 OriginServer:appservers-rw.discovery.wmnet OriginServerTime:114 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:400 OriginStatus:400 ReqURL:http://ban.wikipedia.org/wiki/Mal: BereqURL:GET https://appservers-rw.discovery.wmnet/wiki/Mal: HTTP/1.1 ReqHeader:User-Agent:curl/7.52.1 ReqHeader:Host:ban.wikipedia.org ReqHeader:X-Client-IP:127.0.0.1 ReqHeader:Cookie:- RespHeader:X-Cache-Int:cp1075 miss RespHeader:Backend-Timing:D=109906 t=1574157636005580

Well, now all templates (Mal) said "Galat script: no module." What just happen? Nevertheless, It works. I used Google Chrome for info. Already tried using Edge, same happen.

Capture.PNG (113×336 px, 4 KB)

Well, now all templates (Mal) said "Galat script: no module." What just happen? Nevertheless, It works. I used Google Chrome for info. Already tried using Edge, same happen.

Capture.PNG (113×336 px, 4 KB)

That's definitely another issue. Created T238998: Mal:Navbox at banwiki says Module not found despite the module exists for that.

Just came across this on enwiki when I couldn't access the page - I had to query the page table and redirect using https://en.wikipedia.org/wiki/Special:Redirect/page/25247567

I ran into this when I was unable to block https://en.wikipedia.org/w/index.php?target=Printf%28%22Herro+World%22%29%3B&title=Special:Contributions. All the user links in revision histories, etc., use path-style parameters which produce a broken link such as Special:Block/Printf("Herro_World");. Until this is fixed, we can use AbuseFilter to disallow creation of usernames/pages with semicolons.

@BBlack @ema @Vgutierrez Did you get a response on this from ATS developers? Did we file an issue or something? (I can't find one in https://github.com/apache/trafficserver/issues.)

BTW, Checking RFC 3986, I'm not sure that https://ban.wikipedia.org/wiki/Mal:; is a valid URL where path = /wiki/Mal:;

considering that the RFC lists the semicolon as a reserved character (https://tools.ietf.org/html/rfc3986#section-2.2):

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

The semi-colon being in this list is precisely what makes it a valid url, not an invalid one. URLs containing = or & in popular query string form, are not invalid either.

The significance of these characters in the spec is that they must never be liberally percent-encoded or decoded as part of some kind of normalization logic as they may be used for a special meaning. This in contrary to e.g. other ASCII characters which a browser or web server is allowed to proactively encode or decode based on its preferred form. For example, /P/P/, /P/%50/ and /%50/P/ can be considered equivalent (where %50 = P), but /P%2FP/ is definitely something else (where %2F = /).

The use case for ; being reserved is that CGI scripts often used ;a=foo;b=bar instead of the current convention of ?a=foo&b=bar as a way to pass query parameters.

For a url like https://example/foo/bar?x=y;z=1#zed the path is /foo/bar?x=y;z=1. But, whether the the slashes, question marks, ampersands or semi-colons actually have special significance in any given URI is entirely the concern of the underlying application. The traffic layers should make no assumptions about those. E.g. /Foo? is not equivalent to /Foo, and /Foo?a=1&b=2 is not equivalent to /Foo?b=2&a=1. While an empty query string and the order of params might be insignificant, that's not upto HTTP.

Krinkle updated the task description. (Show Details)

I ran into this when I was unable to block https://en.wikipedia.org/w/index.php?target=Printf%28%22Herro+World%22%29%3B&title=Special:Contributions. All the user links in revision histories, etc., use path-style parameters which produce a broken link such as Special:Block/Printf("Herro_World");. Until this is fixed, we can use AbuseFilter to disallow creation of usernames/pages with semicolons.

@MusikAnimal do we need a note here to get the title blacklist have the line removed when the fix is implemented?

Simple repro:

krinkle@people1002$ echo -e "Hello world.\n" > 'foo;'
krinkle@people1002$ cat foo\;
Hello world.

krinkle@people1002$ curl -I 'http://localhost/~krinkle/foo;'
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Fri, 25 Sep 2020 20:44:12 GMT
Content-Length: 14
…

$ curl -I 'https://people.wikimedia.org/~krinkle/foo;'
HTTP/2 404 
date: Fri, 25 Sep 2020 20:44:46 GMT
server: Apache
age: 0
x-cache: cp2041 miss, cp2029 pass

I suspect this is likely caused by the same issue in our Varnish or ATS configuration and confirms that Apache can and does serve it without issue.

With the dupe merger, maybe we owe a status update here:

We're pretty sure this is a bug in Apache Traffic Server. There's some obscure and/or interesting things about how that bug came to be, and various HTTP standards, and why an HTTP server would even care about a ; in the first place, but that's all relatively-irrelevant. The bottom line is that these URLs should work as-is (without any need for percent-encoding), and we're pretty certain it's ATS that's breaking them. We have some backlogged followup with upstream ATS to do here I'm sure.

Aklapper renamed this task from Pages whose title ends with semicolon (;) are intermittently inaccessible to Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS).Feb 13 2021, 5:56 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!