Page MenuHomePhabricator

Pages whose title ends with semicolon (;) are intermittently inaccessible
Open, MediumPublicBUG REPORT

Description

Recently, banwiki was created T234768: Create Balinese Wikipedia
The local name for the template namespace is Mal

There is currently a page with the title Mal:;
However, trying to visit it via https://ban.wikipedia.org/wiki/Mal:; fails, with the title-invalid-empty message
It is accessible via https://ban.wikipedia.org/w/index.php?curid=2090, but it cannot be moved from there: https://ban.wikipedia.org/wiki/Kusus:Pindahkan_halaman/Mal:; fails with notargettext

See the database row:

MariaDB [banwiki_p]> SELECT * FROM page WHERE page_title = ';';
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id | page_namespace | page_title | page_restrictions | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
|    2090 |             10 | ;          |                   |                0 |           0 | 0.879928170267 | 20191104014040 | 20191104014041     |       25661 |      132 | wikitext           | NULL      |
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
1 row in set (0.01 sec)

Is there a maintenance script or command that can help?

Event Timeline

Restricted Application added a project: User-DannyS712. · View Herald TranscriptNov 14 2019, 2:04 AM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald Transcript
DannyS712 moved this task from Unsorted to Reports on the User-DannyS712 board.Nov 14 2019, 2:14 AM
DannyS712 changed the subtype of this task from "Task" to "Bug Report".
DannyS712 added subscribers: Urbanecm, Ladsgroup, jhsoby.

Content is:
&#59;<noinclude>{{dokumentasi}}<!-- PLEASE ADD THIS TEMPLATE'S CATEGORIES AND INTERWIKIS TO THE /doc SUBPAGE, THANKS --></noinclude>

Maybe page should be just deleted and recreated?

Cannot reproduce. I can access the URL https://ban.wikipedia.org/wiki/Mal:; in Firefox 70. Which browser and version is this about?

Also see T238276 which seems to be about the same issue.

Cannot reproduce. I can access the URL https://ban.wikipedia.org/wiki/Mal:; in Firefox 70. Which browser and version is this about?
Also see T238276 which seems to be about the same issue.

https://ban.wikipedia.org/wiki/Mal:; in Chrome loads as a web page, but fails to load the actual wiki page

Google Chrome 78.0.3904.97

Firefox 70:

It fails inconsistently for me. It doesn't depend on the browser, I was able to get both of these results in Chrome, and also using curl.

Aklapper renamed this task from Broken page on banwiki to Cannot access page on banwiki which ends with semicolon (;) in Chrome browser.Nov 14 2019, 7:37 PM
Aklapper renamed this task from Cannot access page on banwiki which ends with semicolon (;) in Chrome browser to Cannot access page on banwiki whose title ends with semicolon (;).

There is nothing wrong with that title. Mal:; is perfectly valid. Similarly, T238276 reports a problem with ;, which is also perfectly valid.

It seems to me that the semicolon is somehow being dropped when processing the request. Therefore requests for Mal:; instead return the results for Mal: (which is invalid, explaining the error messages seen here), and requests for ; instead return the results for empty title (which redirects to the main page, explaining the behaviors in T238276).

matmarex renamed this task from Cannot access page on banwiki whose title ends with semicolon (;) to Pages whose title ends with semicolon (;) are intermittently inaccessible.Nov 14 2019, 7:49 PM
matmarex added a subscriber: Wargo.

This affects all wikis. I could also reproduce with https://en.wikipedia.org/wiki/;, which will either redirect to the article about semicolons, or the main page.

I ran curl -I "https://ban.wikipedia.org/wiki/Mal:;" in a loop for a while to see if this affects particular servers.

Out of 307 attempts, 82 returned HTTP 200, and all of those were served by mw1273.eqiad.wmnet.

225 remaining attempts returned HTTP 400, they were served by 49 different servers (including mw1273, actually).

So mw1273 sometimes works correctly, and everything else is broken.

Restricted Application added a project: Operations. · View Herald TranscriptNov 14 2019, 7:56 PM
Ladsgroup added subscribers: ema, BBlack.EditedNov 14 2019, 8:05 PM

I think this has to do something with the differences between ATS and varnish normalizing request URLs (if they do any). @ema @BBlack Can you double check?

I think this has to do something with the differences between ATS and varnish normalizing request URLs (if they do any). @ema @BBlack Can you double check?

Yeah this seems suspiciously-likely. As for the inconsistency of results: Currently (as of this writing), requests arriving via our Singapore and San Francisco edges are handled entirely by ATS, and Amsterdam has been in-transition lately, currently 75% converted (so different users will see different things, potentially), while the core US sites are mostly-Varnish still. We'll look into this!

I think this has to do something with the differences between ATS and varnish normalizing request URLs (if they do any). @ema @BBlack Can you double check?

Good call. When it comes to ATS vs Varnish backends, I'm getting HTTP 400 with both.

cp3052 (ATS):

HTTP/2 400
date: Fri, 15 Nov 2019 10:15:40 GMT
content-type: text/html; charset=UTF-8
server: mw1327.eqiad.wmnet
x-powered-by: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
[..]
x-cache: cp3052 miss, cp3052 pass

cp3064/cp1089 (Varnish):

HTTP/2 400
date: Fri, 15 Nov 2019 10:15:49 GMT
server: mw1328.eqiad.wmnet
x-powered-by: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
[...]
x-cache: cp1089 pass, cp3064 pass, cp3052 pass

However, the TLS terminator used by cp3052 is ATS, while for instance on cp2010 we run nginx and I get a 200 there:

HTTP/2 200 
date: Fri, 15 Nov 2019 10:22:50 GMT
content-type: text/html; charset=UTF-8
server: mw1273.eqiad.wmnet
x-powered-by: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
[...]
x-cache: cp1085 hit/4, cp2006 miss, cp2010 hit/4

@Vgutierrez do you think ats-tls vs nginx can make a difference here?

ema triaged this task as Medium priority.Fri, Nov 15, 11:02 AM
ema moved this task from Triage to Caching on the Traffic board.

hmm it looks like ATS URL parsing is at fault here. ATS is using a semi colon as a separator between the URL path and the URL params, even on the initial parsing of the request URL before remapping, ATS already drops the semicolon:
DEBUG: <URL.cc:1606 (url_describe)> (http) PATH: "wiki/Mal:", PATH_LEN: 9,

and from the source code: https://github.com/apache/trafficserver/blob/master/proxy/hdrs/URL.cc#L1368-L1370

if (*cur == ';') {
  path_end = cur;
  goto parse_params1;
}

BTW, Checking RFC 3986, I'm not sure that https://ban.wikipedia.org/wiki/Mal:; is a valid URL where path = /wiki/Mal:;

considering that the RFC lists the semicolon as a reserved character (https://tools.ietf.org/html/rfc3986#section-2.2):

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

@Vgutierrez I guess what you quoted wouldn't be valid, but https://ban.wikipedia.org/wiki/Mal:%3B should be perfectly okay and yet I get "bad title".

BBlack added a comment.EditedFri, Nov 15, 12:47 PM

There's some confusion on historical standards interpretation here, I think. There are some ancient standards that reference the semicolon as URI-level delimiter (even in ways that might not be application-layer defined), for example https://tools.ietf.org/html/rfc2396#section-1.6 .

However, the modern RFCs (and we're using a pretty loose definition of "moderm"; still quite old and well-established) such as RFC 3986 don't treat the semicolon as anything special in the Path part of the URI. (Its presence in sub-delims doesn't make it any more special that other such characters, and interpretation is applayer-specific within the Path component).

We've spent some pages of typing on IRC on these topics which I won't echo all of here, but I think the most succint modern reference that the semicolon shouldn't terminate the Path component of a URI is the end of the first paragraph of https://tools.ietf.org/html/rfc3986#section-3.3 , which states The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI.

ATS is certainly unique in its interpretation here, among other modern-ish revproxies we've used. We'll have to loop through ATS developers at this point as well and try to figure out what the reasoning is for ATS's behavior, so this may take a little while to sort out. Worst case, we can patch the ATS URI parser locally if we have to. It's also notable that all of the relevant ATS code for handling the semicolon this way date back to the initial git commit from 10 years ago...

@Vgutierrez I guess what you quoted wouldn't be valid, but https://ban.wikipedia.org/wiki/Mal:%3B should be perfectly okay and yet I get "bad title".

Right, we are currently discussing the issue with ATS developers, I'll update this task as soon as we have some news regarding this issue.

so, I've been doing some tests, and ATS doesn't drop the url-encoded version of the semicolon, so https://ban.wikipedia.org/wiki/Mal:%3B should work. @ema maybe some URL normalization step on varnish/ats-be is messing with us here?

So:

vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3120/wiki/Mal:%3B" -v -o /dev/null -H 'X-Forwarded-Proto: https' 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 400 Bad Request
< X-Cache: cp1075 miss, cp1075 pass
< X-Cache-Status: pass
vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3120/wiki/Mal:%3B" -v -o /dev/null -H 'X-Forwarded-Proto: https' 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 200 OK
< X-Cache: cp1089 miss, cp1075 pass
< X-Cache-Status: pass

but asking directly to ats-be on cp1075 gives a 200:

vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3128/wiki/Mal:%3B" -v -o /dev/null 2>&1 |egrep "X-Cache|HTTP/1.1 (200|400)"
< HTTP/1.1 200 OK
< X-Cache-Int: cp1075 miss

@ema please correct me if I'm wrong, but it looks to me like varnish-fe is url-decoding the semicolon so /wiki/Mal:%3B hits ats-be like /wiki/Mal:; and that messes with ATS.

And that's what ats-backend shows:

vgutierrez@cp1075:~$ sudo -i atslog-backend ban.wikipedia.org
Date:2019-11-19 Time:10:00:36 ConnAttempts:0 ConnReuse:0 TTFetchHeaders:114 OriginServer:appservers-rw.discovery.wmnet OriginServerTime:114 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:400 OriginStatus:400 ReqURL:http://ban.wikipedia.org/wiki/Mal: BereqURL:GET https://appservers-rw.discovery.wmnet/wiki/Mal: HTTP/1.1 ReqHeader:User-Agent:curl/7.52.1 ReqHeader:Host:ban.wikipedia.org ReqHeader:X-Client-IP:127.0.0.1 ReqHeader:Cookie:- RespHeader:X-Cache-Int:cp1075 miss RespHeader:Backend-Timing:D=109906 t=1574157636005580

Well, now all templates (Mal) said "Galat script: no module." What just happen? Nevertheless, It works. I used Google Chrome for info. Already tried using Edge, same happen.

Well, now all templates (Mal) said "Galat script: no module." What just happen? Nevertheless, It works. I used Google Chrome for info. Already tried using Edge, same happen.

That's definitely another issue. Created T238998: Mal:Navbox at banwiki says Module not found despite the module exists for that.

DannyS712 added a comment.EditedThu, Nov 28, 12:06 PM

Just came across this on enwiki when I couldn't access the page - I had to query the page table and redirect using https://en.wikipedia.org/wiki/Special:Redirect/page/25247567