Page MenuHomePhabricator

Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS)
Closed, ResolvedPublicBUG REPORT

Assigned To
None
Authored By
DannyS712
Nov 14 2019, 2:04 AM
Referenced Files
F31166333: Capture.PNG
Nov 23 2019, 1:59 PM
F31075726: Screenshot from 2019-11-14 12-28-58.png
Nov 14 2019, 5:30 PM
F31075704: image.png
Nov 14 2019, 5:27 PM
Tokens
"Love" token, awarded by Sannita."Grey Medal" token, awarded by valerio.bozzolan."Evil Spooky Haunted Tree" token, awarded by geraki.

Description

Recently, banwiki was created T234768: Create Balinese Wikipedia
The local name for the template namespace is Mal

There is currently a page with the title Mal:;
However, trying to visit it via https://ban.wikipedia.org/wiki/Mal:; fails, with the title-invalid-empty message
It is accessible via https://ban.wikipedia.org/w/index.php?curid=2090, but it cannot be moved from there: https://ban.wikipedia.org/wiki/Kusus:Pindahkan_halaman/Mal:; fails with notargettext

See the database row:

MariaDB [banwiki_p]> SELECT * FROM page WHERE page_title = ';';
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id | page_namespace | page_title | page_restrictions | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
|    2090 |             10 | ;          |                   |                0 |           0 | 0.879928170267 | 20191104014040 | 20191104014041     |       25661 |      132 | wikitext           | NULL      |
+---------+----------------+------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
1 row in set (0.01 sec)

Is there a maintenance script or command that can help?

Steps to reproduce

Visit one of these pages:

Or, instead, visit one of these websites and in the search bar type ; and click:

You will be redirected to the homepage, instead of visiting that page.

Reproduced:

  • wed Mar, 17 2021, 22:54:38, CET from Italy from logged-in and out
  • ...

Served from:

  • mw1366.eqiad.wmnet
  • mw1411.eqiad.wmnet
  • ...

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

BTW, Checking RFC 3986, I'm not sure that https://ban.wikipedia.org/wiki/Mal:; is a valid URL where path = /wiki/Mal:;

considering that the RFC lists the semicolon as a reserved character (https://tools.ietf.org/html/rfc3986#section-2.2):

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

The semi-colon being in this list is precisely what makes it a valid url, not an invalid one. URLs containing = or & in popular query string form, are not invalid either.

The significance of these characters in the spec is that they must never be liberally percent-encoded or decoded as part of some kind of normalization logic as they may be used for a special meaning. This in contrary to e.g. other ASCII characters which a browser or web server is allowed to proactively encode or decode based on its preferred form. For example, /P/P/, /P/%50/ and /%50/P/ can be considered equivalent (where %50 = P), but /P%2FP/ is definitely something else (where %2F = /).

The use case for ; being reserved is that CGI scripts often used ;a=foo;b=bar instead of the current convention of ?a=foo&b=bar as a way to pass query parameters.

For a url like https://example/foo/bar?x=y;z=1#zed the path is /foo/bar?x=y;z=1. But, whether the the slashes, question marks, ampersands or semi-colons actually have special significance in any given URI is entirely the concern of the underlying application. The traffic layers should make no assumptions about those. E.g. /Foo? is not equivalent to /Foo, and /Foo?a=1&b=2 is not equivalent to /Foo?b=2&a=1. While an empty query string and the order of params might be insignificant, that's not upto HTTP.

Krinkle updated the task description. (Show Details)

I ran into this when I was unable to block https://en.wikipedia.org/w/index.php?target=Printf%28%22Herro+World%22%29%3B&title=Special:Contributions. All the user links in revision histories, etc., use path-style parameters which produce a broken link such as Special:Block/Printf("Herro_World");. Until this is fixed, we can use AbuseFilter to disallow creation of usernames/pages with semicolons.

@MusikAnimal do we need a note here to get the title blacklist have the line removed when the fix is implemented?

Simple repro:

krinkle@people1002$ echo -e "Hello world.\n" > 'foo;'
krinkle@people1002$ cat foo\;
Hello world.

krinkle@people1002$ curl -I 'http://localhost/~krinkle/foo;'
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Fri, 25 Sep 2020 20:44:12 GMT
Content-Length: 14
…

$ curl -I 'https://people.wikimedia.org/~krinkle/foo;'
HTTP/2 404 
date: Fri, 25 Sep 2020 20:44:46 GMT
server: Apache
age: 0
x-cache: cp2041 miss, cp2029 pass

I suspect this is likely caused by the same issue in our Varnish or ATS configuration and confirms that Apache can and does serve it without issue.

With the dupe merger, maybe we owe a status update here:

We're pretty sure this is a bug in Apache Traffic Server. There's some obscure and/or interesting things about how that bug came to be, and various HTTP standards, and why an HTTP server would even care about a ; in the first place, but that's all relatively-irrelevant. The bottom line is that these URLs should work as-is (without any need for percent-encoding), and we're pretty certain it's ATS that's breaking them. We have some backlogged followup with upstream ATS to do here I'm sure.

Aklapper renamed this task from Pages whose title ends with semicolon (;) are intermittently inaccessible to Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS).Feb 13 2021, 5:56 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Note: Pages whose title contain multiple semicolons (;) are accessible even if they do end with semicolon
https://el.wikipedia.org/wiki/Από_πού_ερχόμαστε;_Τι_είμαστε;_Πού_πάμε;

Some observations:

@BBlack The last status update on this bug was ~18 months ago, and indicated the issue was an upstream bug and you were following up there, with a fallback to a WMF-specific patch if upstream got stuck. I see no indication there is any question this behaviour is a bug (cf. eg. Krinkle's comment above). It's also a problem that makes certain pages inaccessible on all projects, breaks contribution histories and other core features for certain users, and necessitates manually prohibiting a character in page and user names that is intended to be permitted.

IOW: an update would be appreciated.

Same problem with an article of WPJA. Whereas it works fine for human interface:

$curl -L 'https://ja.wikipedia.org/wiki/%E3%81%8A%E3%81%A7%E3%81%8B%E3%81%91%E3%83%AC%E3%82%B9%E3%82%BF%E3%83%BC%E3%82%8C%E3%82%8C%E3%82%8C%E3%81%AE%E3%82%8C(%5E%5E%3B'

It leads to a problem with the API:

$ curl -L 'https://ja.wikipedia.org/api/rest_v1/page/mobile-sections/%E3%81%8A%E3%81%A7%E3%81%8B%E3%81%91%E3%83%AC%E3%82%B9%E3%82%BF%E3%83%BC%E3%82%8C%E3%82%8C%E3%82%8C%E3%81%AE%E3%82%8C(%5E%5E%3B'
curl: (47) Maximum (50) redirects followed

@BBlack The last status update on this bug was ~18 months ago, and indicated the issue was an upstream bug and you were following up there, with a fallback to a WMF-specific patch if upstream got stuck. I see no indication there is any question this behaviour is a bug (cf. eg. Krinkle's comment above). It's also a problem that makes certain pages inaccessible on all projects, breaks contribution histories and other core features for certain users, and necessitates manually prohibiting a character in page and user names that is intended to be permitted.

IOW: an update would be appreciated.

Retagging with non-icebox Traffic to at least get an update

since this bug was reported back in 2019, our CDN stack has changed a little bit, we currently use HAProxy + varnish + ATS. Trying to fetch https://ban.wikipedia.org/wiki/Mal:; currently results in a 404 Bad title response and the same happens with https://ban.wikipedia.org/wiki/Mal:%3B

For https://ban.wikipedia.org/wiki/Mal:;, the URL reaches ATS unaltered and the trailing semi-colon gets removed and as a consequence the URL hitting mediawiki is https://ban.wikipedia.org/wiki/Mal:
For https://ban.wikipedia.org/wiki/Mal:%3B varnish translates /Mal:%3B to /Mal:; and again ATS removes the trailing semi-colon.

Considering that the semi-colon (;) is a reserved character according to https://www.rfc-editor.org/rfc/rfc3986#section-2.2. That section explicitly states:

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component.

IMHO we should comply with RFC3986 and start using the encoded version (%3B) of the semi-colon and configure Varnish to stop decoding %3B to ;. However since this collides with the feedback that @BBlack provided in this very same task on

With the dupe merger, maybe we owe a status update here:

We're pretty sure this is a bug in Apache Traffic Server. There's some obscure and/or interesting things about how that bug came to be, and various HTTP standards, and why an HTTP server would even care about a ; in the first place, but that's all relatively-irrelevant. The bottom line is that these URLs should work as-is (without any need for percent-encoding), and we're pretty certain it's ATS that's breaking them. We have some backlogged followup with upstream ATS to do here I'm sure.

some internal discussion must happen to move forward with this task. I'll address this issue tomorrow during the Traffic team weekly meeting.

Change 882663 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Possibly mitigate ATS bug with semicolon in Path

https://gerrit.wikimedia.org/r/882663

T261624 was merged here; in that ticket I asked:

On testing, I can see that it is no longer possible to create user names ending in semi-colons (at least on en.Wikipedia - not tested elsewhere); however the handling of legacy accounts with such names is, as demonstrated, sub-optimal. Perhaps we should force renames?

That question does not seem to have been addressed.

Change 882663 merged by BBlack:

[operations/puppet@production] Possibly mitigate ATS bug with semicolon in Path

https://gerrit.wikimedia.org/r/882663

With the merge above, I think this issue is at least mitigated for now. It's not a great long-term solution, but it should alleviate the user-facing side of this in practice.

I can reliably access the pages now. I think we can call this fixed. Thanks!

The question I highlighted in my last post remains unaddressed.

That question does not change the resolved status of this ticket though

T261624 was merged here; in that ticket I asked:

On testing, I can see that it is no longer possible to create user names ending in semi-colons (at least on en.Wikipedia - not tested elsewhere); however the handling of legacy accounts with such names is, as demonstrated, sub-optimal. Perhaps we should force renames?

That question does not seem to have been addressed.

I suggested that the blacklist entry disallowing user names ending in semi-colons be removed: https://meta.wikimedia.org/wiki/Talk:Title_blacklist#T238285_workaround

Even if that doesn't happen, I think we haven't forced renames in the past, e.g. T254045: One can register a username with an equals sign in it (except when required for technical reasons, like the SUL renames some years ago).

Change 912312 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Varnish/ATS semicolon workaround for Restbase

https://gerrit.wikimedia.org/r/912312

Change 912312 merged by BBlack:

[operations/puppet@production] Varnish/ATS semicolon workaround for Restbase

https://gerrit.wikimedia.org/r/912312