Page MenuHomePhabricator

Figure out how Varnish errorpage was enabled for empty 404 text/html from mw/index.php?action=raw
Closed, ResolvedPublic

Description

In a change that appears to have gone live yesterday or today, some raw actions are now returning an HTML 404 page. The previous behavior, I believe, was that it would return an empty document when a nonexistent page was requested. For example: https://meta.wikimedia.org/w/index.php?action=raw&title=User:Sage%20(Wiki%20Ed)/dashboard%20modules/nonexistent.json

This changes the behavior of the mediawiki_api Ruby gem when requesting the wikitext of a nonexistent title, and may affect other programmatic users of action=raw as well.

Event Timeline

For the record, the (more common) use of action=raw for CSS and JavaScript is not affected.

(These still return an empty response like before).

But urls with other extensions, or no extensions, now display the following 404 Not Found page: https://meta.wikimedia.org/w/index.php?action=raw&title=Example/nonexistent

Screen Shot 2018-03-22 at 13.44.44.png (796×1 px, 123 KB)

Note that this error page does not come from MediaWiki, it comes from Varnish.

Varnish is seeing the empty 404 response from MediaWiki, and deciding to replace the empty one with a more useful one. It does this because the content-type is communicated by MediaWiki as "text/html" and so Varnish believes it is safe to do so.

As far as I know, this logic was added several months ago when Thumbor was deployed, which (unlike our previous thumbnail generators) outputted errors via the status code only, not with a nice page. So we added logic in Varnish (the cache proxy we use), to turn those empty responses into a more useful one.

I'm not sure why this would've changed recently. I suppose we can try to track it down. However, it is very likely that the change was made intentionally to solve something elsewhere.

In this particular case I believe it is behaving correctly. When handling page-existence checks in your bot, it is fine to assume empty=non-existent. I suppose you know the danger there that a page can exist whilst being empty, but that's a choice you can make. However, you should not assume that empty is the only way to find that a page is empty. For that, one should first check the HTTP status code. If that is 404, it means the page doesn't exist. If that is 200 OK, it means the page exists. Beyond that, one could additionally treat empty as non-existent if preferred.

Back to the original issue, to avoid error pages for JSON (which is a supported content-type), you can set ctype=application/json:

https://meta.wikimedia.org/w/index.php?action=raw&title=Example/nonexistent.json&ctype=application/json

This continues to receive an empty response like before. So while I'm not yet sure which exact code commit caused the change in behaviour, it only affects cases where HTML is assumed.

Krinkle renamed this task from Investigate impact of change in index.php?action=raw 404 responses to How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw.Mar 22 2018, 8:58 PM
Krinkle added a project: Traffic.

@Krinkle thanks much! From that description, I'm guessing this won't affect many people. I was querying for arbitrary .json pages and needing to handle redlinks to such pages.

Krinkle subscribed.

The behavior change last week was not limited to JSON pages. My app uses the mediawiki ruby API gem's get_wikitext method, and it ended up posting the 404 html to nonexistent talk pages before I deployed the fix to how 404s are handled.

It might make sense to patch the gem to return empty string for 404s, which would restore the previous behavior. Then again, maybe no one else has code that was inadvertently relying on the old behavior to signal a nonexistent page like mine was. What do you think, @zeljkofilipin?

I'm not sure why this would've changed recently. I suppose we can try to track it down. However, it is very likely that the change was made intentionally to solve something elsewhere.

The change triggering this is https://gerrit.wikimedia.org/r/#/c/421267 (stop sending Accept-Encoding to the application layer).

When sending a request to mediawiki with Accept-Encoding: gzip, which was the behavior of varnish before https://gerrit.wikimedia.org/r/#/c/421267, I get the following response. Note the lack of Content-Length.

< HTTP/1.1 404 Not Found
< Date: Tue, 27 Mar 2018 10:20:28 GMT
* Server mw1269.eqiad.wmnet is not blacklisted
< Server: mw1269.eqiad.wmnet
< X-Powered-By: HHVM/3.18.6-dev
< X-Content-Type-Options: nosniff
< Cache-control: public, s-maxage=0, max-age=1209600
< P3P: CP="This is not a P3P policy! See https://meta.wikimedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
< Content-Encoding: gzip
< Vary: Accept-Encoding,Cookie,Authorization
< Backend-Timing: D=36073 t=1522146028820843
< Transfer-Encoding: chunked
< Content-Type: text/x-wiki; charset=UTF-8

When sending a request without Accept-Encoding, which is what varnish currently does, Content-Length is instead set to 0:

< HTTP/1.1 404 Not Found
< Date: Tue, 27 Mar 2018 10:26:17 GMT
* Server mw1268.eqiad.wmnet is not blacklisted
< Server: mw1268.eqiad.wmnet
< X-Powered-By: HHVM/3.18.6-dev
< X-Content-Type-Options: nosniff
< Cache-control: public, s-maxage=0, max-age=1209600
< P3P: CP="This is not a P3P policy! See https://meta.wikimedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
< Vary: Accept-Encoding,Cookie,Authorization
< Backend-Timing: D=27711 t=1522146377074949
< Content-Length: 0
< Content-Type: text/x-wiki; charset=UTF-8

We do make varnish generate a 404 page if the application-layer response has Content-Length==0 and status code >=400. Content-Length being unset is different from Content-Length being 0, hence the new behavior. @BBlack, @Gilles: should we perhaps consider CL:unset as CL:0 for the purpose of generating the synthetic error page from varnish, unless we do have cases where the applayer response is 404, CL is unset, but the response does have a body?

ema triaged this task as Medium priority.Mar 27 2018, 10:36 AM

It might make sense to patch the gem to return empty string for 404s, which would restore the previous behavior. Then again, maybe no one else has code that was inadvertently relying on the old behavior to signal a nonexistent page like mine was. What do you think, @zeljkofilipin?

It is unlikely that I will have the time to patch the gem, since the gem was developed to be used with Ruby Selenium framework, and it has been deprecated. See T139740: Port Selenium tests from Ruby to Node.js and T173488: Announce Selenium Ruby framework deprecation on appropriate mailing lists (QA, engineering, wikitech-l).

Feel free to patch the gem yourself. I should have the time to help with reviews and releasing new version of the gem.

So, recapping a bit what's already been mentioned above: the proximate cause of behavior change was the AE:gzip patch, but this was just unconvering a more-subtle problem that really originated with T169683 -> https://gerrit.wikimedia.org/r/#/c/365589/ . The reason that patch didn't affect this case back when it was deployed was that the MW stack (maybe hhvm or apache) was gzipping the empty response pointlessly (it seems to have been doing a lot of pointless gzipping before the AE:gzip fixup), and a gzipped version doesn't happen to have an explicit CL:0 (it would've had CL:20 if non-chunked, but it was chunked due to streaming gzip processing anyways).

We should revisit the intent and rationale in T169683 , since that seems to be in direct conflict with the goals here in this ticket.

The reason for the patch was that Thumbor wasn't generating pretty error outputs for a variety of [45]xx cases, and so it was decided that we should solve this for the general case, since it's not desirable to have body-silent [45]xx regardless of the generating application. As it turns out, apparently MediaWiki does have a use for body-silent 404s, at least.

Another issue with that patch is that Content-Length: 0 condition only catches a subset of body-free cases anyways (as the gzip issue highlights). It's easy to have body-free responses slip past the CL:0 condition by using any of: Transfer-Encoding: chunked, Content-encoding: gzip, or by lacking both of those and providing no CL header in the HTTP/1.0 -like "close-delimited" output behavior. But TL;DR on this issue is that we're best off leaving it as it is, as the other conditions aren't reliable indicators of empty bodies, either, so we just have to accept that we can't catch them.

I see a couple reasonable paths forward:

  1. We can give up on doing this for the general case (it doesn't work for some general sub-cases anyways), and move this VCL to be cache_upload -specific (where it only potentially affects swift/thumbor and maps) and not cache_text/cache_misc.
  2. We could keep it in the cluster-common VCL, but carve out an exception for MediaWiki 404s based on the selected backend.

Regardless, it seems fishy we're doing this in both the FE and BE VCL layers. Shouldn't this be BE-only if it's trying to translate for application-generated CL:0 responses? We'll need it to be BE-only to use approach (2) above anyways.

Sure, the behavior can be reverted in these ways, but the rationale to have more human-friendly error pages should apply here, shouldn't it?

The fact that it broke a deprecated gem isn't a big enough argument to go back to empty error pages in that context, imho. It seems more like an issue of fixing the gem for that case. The new error is semantically correct.

Yeah, I guess I was taking it as implicitly true that it's correct for MW to desire content-free 404s in the cases being complained about here. In any case, the only reliable way for MW to avoid it (as it does for e.g. https://meta.wikimedia.org/w/index.php?action=raw&title=Example/nonexistent.js&ctype=text/javascript) is to emit a 200 rather than a 404. Other hacks might temporarily work due to CE/TE nits avoiding the CL:0 condition, but won't be reliable in the long term... e.g. above:

[...] It might make sense to patch the gem to return empty string for 404s, which would restore the previous behavior.

If this means what it sounds like, it's a transient hack at best: it would still return a 404 status code, but having ruby return an "empty body" somehow triggers the switch from CL:0-like behavior to TE:chunked-like behavior, thus evading the explicit CL:0 filter in current VCL code, but not really evading the intent of the code (to replace empty 404 bodies). As transport protocols/endpoints/software undergo further subtle changes, workarounds that still use 404+empty but subtly evade the CL:0 condition will fail again in the future.

I think the new behavior isn't a significant problem; it'll only really be an issue for anyone who was relying on the old behavior (and we don't know of anyone else who was). It also isn't particularly a bug in the gem, which always returned a 404 code, and just switched to also returning the varnish HTML.

However, the error message is actively misleading: "Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes."

Agreed. If anything, this may also very well fix issues for people writing new things with MediaWiki because it's now finally working as intended by MediaWiki's PHP code.

The 404 for content-type wikitext is explicitly coded into mediawiki:/includes/actions/RawAction.php.

Thanks @ema and @BBlack for the analysis. I would not have thought of the Gzip change triggering this behaviour change, but that makes perfect sense indeed, given how MediaWiki was compressing "nothing" into compressed version encapsulating nothing, making it "something".

Krinkle renamed this task from How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw to Figure out how Varnish errorpage was enabled for empty 404 text/html from mw/index.php?action=raw.Mar 29 2018, 1:14 AM
Krinkle changed the task status from Declined to Resolved.
Krinkle assigned this task to BBlack.