Page MenuHomePhabricator

ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content)
Closed, ResolvedPublicBUG REPORT

Description

When I tried to mark https://www.wikidata.org/wiki/Help:About_data for translation, I got the following:

Request from [snip] via cp4031.ulsfo.wmnet, ATS/8.0.5
Error: 502, Malformed Server Response Status at 2019-11-04 21:01:37 GMT

I tried again, and it worked properly

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Nikerabbit added a subscriber: Nikerabbit.

Jobs are started
2019-11-04T21:01:52 mw1300 INFO [Job: TranslationsUpdateJob][Request ID: XcCRvwpAMEwAAAxc7joAAACY][Title: Help:About data] Starting TranslationsUpdateJob

The log notice for the 502 request (_id:AW44OWRQghP2xm4v2gAB)
2019-11-04T21:01:55 cp4030.ulsfo.wmnet NOTICE -

So Translate worked correctly as far as I can see. Maybe the response was malformed, but there isn't enough information for me to debug. Also, this is the only one occurrence in last 60 days.

Jobs are started
2019-11-04T21:01:52 mw1300 INFO [Job: TranslationsUpdateJob][Request ID: XcCRvwpAMEwAAAxc7joAAACY][Title: Help:About data] Starting TranslationsUpdateJob

The log notice for the 502 request (_id:AW44OWRQghP2xm4v2gAB)
2019-11-04T21:01:55 cp4030.ulsfo.wmnet NOTICE -

So Translate worked correctly as far as I can see. Maybe the response was malformed, but there isn't enough information for me to debug. Also, this is the only one occurrence in last 60 days.

Just got it again: Request from [snip] via cp4029.ulsfo.wmnet, ATS/8.0.5 Error: 502, Malformed Server Response Status at 2019-11-14 18:33:15 GMT when marking https://www.wikidata.org/wiki/Wikidata:Lexicographical_data for translation

DannyS712 reopened this task as Open.EditedNov 17 2019, 8:15 AM

And again: Request from [snip] via cp4032.ulsfo.wmnet, ATS/8.0.5 Error: 502, Malformed Server Response Status at 2019-11-17 08:14:44 GMT when marking https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions for translation

As I said above, there isn't enough information for me to debug.

I wonder if Traffic could check if this is caused by timeout or if there is way to debug why it is considered malformed.

For 2019-11-14 18:33:15 GMT, ats-be in cp4030 is reporting 20191114.18h33m15s CONNECT: could not connect to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/wiki/Special:PageTranslation' (setting last failure time) connect_result=5
Same on 2019-11-17 08:14:44 GMT in cp4032: 20191117.08h14m44s CONNECT: could not connect to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/wiki/Special:PageTranslation' (setting last failure time) connect_result=5

I just got 502 again on commons during a normal edit (Request from [snip] via cp4030.ulsfo.wmnet, ATS/8.0.5 Error: 502, Malformed Server Response Status at 2019-11-20 00:25:16 GMT) - is it just me?

I don't think so, at 00:25 GMT we had 75 requests (including yours) failing against appservers-rw.discovery.wmnet from ats-be@cp4030:

vgutierrez@cp4030:/var/log/trafficserver$ fgrep 20191120.00h25m error.log |fgrep appservers-rw.discovery.wmnet |wc -l
75

I think this kind of issue becomes visible with non idempotent requests like POSTs and that's why you're noticing it when performing R/W actions like marking a page for translation or submitting an edit

Again when trying to edit on species wiki: Request from [snip] via cp4028.ulsfo.wmnet, ATS/8.0.5 Error: 502, Server Hangup at 2019-11-20 07:36:11 GMT

From ATS source code, it looks like ATS logs connect errors on the first attempt but it's configured to perform 3 attempts, so a log line indicating a connection error isn't the same as a 502 iff the request is retryable:

//////////////////////////////////////////
// on the first connect attempt failure //
// record the failue                   //
//////////////////////////////////////////
if (0 == s->current.attempts) {
  Log::error("CONNECT:[%d] could not connect [%s] to %s for '%s'", s->current.attempts,
             HttpDebugNames::get_server_state_name(conn_state),
             ats_ip_ntop(&s->current.server->dst_addr.sa, addrbuf, sizeof(addrbuf)), url_string ? url_string : "<none>");
}
vgutierrez@cp1075:/var/log/trafficserver$ sudo -i traffic_ctl config match connect_attempts_max_retries
proxy.config.http.connect_attempts_max_retries: 3
proxy.config.http.connect_attempts_max_retries_dead_server: 1

I find this pretty worrisome for the following reasons:

  1. right now we have one remap rule that catches all the requests handled by appservers-rw. This means that ATS only tracks one counter of server connection retries for all the sites handled by appservers-rw.discovery.wmnet, wikipedia, wikidata, wikivoyage.
  2. if one mw server behind appservers-rw.discovery.wmnet misbehaves and triggers enough BAD_INCOMING_RESPONSE errors, it could trigger ATS to mark appservers-rw.discovery.wmnet as down
  3. At that point that ats-be instance would return 5xx (502/504) for every request that otherwise would be handled by appservers-rw.discovery.wmnet

I find this pretty worrisome for the following reasons:

  1. right now we have one remap rule that catches all the requests handled by appservers-rw. This means that ATS only tracks one counter of server connection retries for all the sites handled by appservers-rw.discovery.wmnet, wikipedia, wikidata, wikivoyage.
  2. if one mw server behind appservers-rw.discovery.wmnet misbehaves and triggers enough BAD_INCOMING_RESPONSE errors, it could trigger ATS to mark appservers-rw.discovery.wmnet as down
  3. At that point that ats-be instance would return 5xx (502/504) for every request that otherwise would be handled by appservers-rw.discovery.wmnet

This is clearly unacceptable and very worrisome, if that's the case.

What would happen if we did return some bad response consistently for a specific url because of a code deploy? I guess a 5xx from a backend doesn't translate to BAD_INCOMING_RESPONSE but it's anyways quite damaging.

nope, a 5xx doesn't translate to BAD_INCOMING_RESPONSE, actually is specifically whitelisted:

case STATUS_CODE_SERVER_ERROR:
    TxnDebug("http_trans", "[is_response_valid] Response Error: Origin Server returned 500 - allowing");
    return true;
DannyS712 renamed this task from Bug: 502 error when marking page for translation to 502 errors on ATS/8.0.5.Nov 20 2019, 10:12 AM
DannyS712 changed the subtype of this task from "Task" to "Bug Report".

nope, a 5xx doesn't translate to BAD_INCOMING_RESPONSE, actually is specifically whitelisted:

case STATUS_CODE_SERVER_ERROR:
    TxnDebug("http_trans", "[is_response_valid] Response Error: Origin Server returned 500 - allowing");
    return true;

That lowers my level of worry but there is ofc the case you were talking about: a backend server in a bad state.

If a single ATS backend hits it enough times before pybal depools it, it can happen that this ats-be depools all appservers. On the other hand, for most idempotent requests, varnish-fe will try another backend (IIRC) so the impact would be mitigated.

One real problem is: some of our requests are incredibly long and might overflow the timeouts in ATS-BE - in that case, pybal won't depool a backend but we might still get an error counter increased, correct?

One real problem is: some of our requests are incredibly long and might overflow the timeouts in ATS-BE - in that case, pybal won't depool a backend but we might still get an error counter increased, correct?

Correct, or at least that's my understanding after checking this piece of code:

bool
HttpTransact::is_response_valid(State *s, HTTPHdr *incoming_response)
{
  if (s->current.state != CONNECTION_ALIVE) {
    ink_assert((s->current.state == CONNECTION_ERROR) || (s->current.state == OPEN_RAW_ERROR) ||
               (s->current.state == PARSE_ERROR) || (s->current.state == CONNECTION_CLOSED) ||
               (s->current.state == INACTIVE_TIMEOUT) || (s->current.state == ACTIVE_TIMEOUT));

    s->hdr_info.response_error = CONNECTION_OPEN_FAILED;
    return false;
  }

@Joe on the other hand ATS doesn't reap connections for hosts marked as down, and because ats-be uses KA it should have plenty of available connections against appserver-rw.discovery.wmnet. I believe this is the reason why we aren't seeing a broader affectation right now

On cp1075:

$ sudo stap -ve 'probe process("/usr/bin/traffic_server").statement("retry_server_connection_not_open@./proxy/http/HttpTransact.cc:3612") { printf("%d retry %d max_retries resp_error %d %s\n", $s->current->attempts, $max_retries, $s->hdr_info->response_error, user_string_n($url_string, 128)) }'
[...]
0 retry 3 max_retries resp_error 4 https://appservers-rw.discovery.wmnet/wiki/Special:EntityData/Q59006642.json

Matching error in error.log:

20191120.15h54m54s CONNECT:[0] could not connect [BAD_INCOMING_RESPONSE] to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/wiki/Special:EntityData/Q59006642.json'

We are probing HttpTransact::retry_server_connection_not_open at line 3612:

3594 void
3595 HttpTransact::retry_server_connection_not_open(State *s, ServerState_t conn_state, unsigned max_retries)
3596 {
3597   ink_assert(s->current.state != CONNECTION_ALIVE);
3598   ink_assert(s->current.state != ACTIVE_TIMEOUT);
3599   ink_assert(s->current.attempts <= max_retries);
3600   ink_assert(s->current.server->had_connect_fail());
3601   char addrbuf[INET6_ADDRSTRLEN];
3602 
3603   char *url_string = s->hdr_info.client_request.url_string_get(&s->arena);
3604 
3605   TxnDebug("http_trans", "[%d] failed to connect [%d] to %s", s->current.attempts, conn_state,
3606            ats_ip_ntop(&s->current.server->dst_addr.sa, addrbuf, sizeof(addrbuf)));
3607 
3608   //////////////////////////////////////////
3609   // on the first connect attempt failure //
3610   // record the failue                   //
3611   //////////////////////////////////////////
3612   if (0 == s->current.attempts) {
3613     Log::error("CONNECT:[%d] could not connect [%s] to %s for '%s'", s->current.attempts,
3614                HttpDebugNames::get_server_state_name(conn_state),
3615                ats_ip_ntop(&s->current.server->dst_addr.sa, addrbuf, sizeof(addrbuf)), url_string ? url_string : "<none>");
3616   }

$s->hdr_info->response_error is 4, aka MISSING_STATUS_CODE:

enum ResponseError_t {
    NO_RESPONSE_HEADER_ERROR,
    BOGUS_OR_NO_DATE_IN_RESPONSE,
    CONNECTION_OPEN_FAILED,
    MISSING_REASON_PHRASE,
    MISSING_STATUS_CODE,
    NON_EXISTANT_RESPONSE_HEADER,
    NOT_A_RESPONSE_HEADER,
    STATUS_CODE_SERVER_ERROR,
    TOTAL_RESPONSE_ERROR_TYPES
};

So somehow we're either failing parsing a valid response status code, or we're not getting a decent response from the origins.

nice stap, playing a little bit with debug logging for a specific client ip I've been able to get the same information:

[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:6371 (is_response_valid)> (http_trans) [2494107211] [is_response_valid] Response Error: Missing status code
[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:3448 (handle_response_from_server)> (http_trans) [2494107211] [handle_response_from_server] (hrfs)
[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:3606 (retry_server_connection_not_open)> (http_trans) [2494107211] [0] failed to connect [2] to 10.2.2.1
[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:3627 (retry_server_connection_not_open)> (http_trans) [2494107211] [retry_server_connection_not_open] attempts now: 1, max: 3
[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:3518 (handle_response_from_server)> (http_trans) [2494107211] [handle_response_from_server] Error. Retrying...

it looks like the issue is happening at mw servers as well, nginx (TLS termination for appservers-rw.discovery.wmnet) is screaming a lot with wikidata requests, for the request I've pasted in the previous comment:

2019/11/21 02:57:04 [error] 20871#20871: *165777371 upstream prematurely closed connection while reading upstream, client: 10.64.0.130, server: www.wikimedia.org, request: "GET /wiki/Special:EntityData/Q38646387.json?vgutierrez=1&RANDOM=14622 HTTP/1.1", upstream: "http://10.64.0.62:80/wiki/Special:EntityData/Q38646387.json?vgutierrez=1&RANDOM=14622", host: "www.wikidata.org"

Interesting enough, on the apache access log, that request looks OK:

2019-11-21T02:57:04	73916	10.64.0.62	proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/304	14079	GET	http://www.wikidata.org/wiki/Special:EntityData/Q38646387.json?vgutierrez=1&RANDOM=14622	-	-	-	10.64.0.130	curl/7.52.1	-	-	-	10.64.0.62

Why would a 304 have 14079 bytes of content-length is the first thing I'd ask. It looks like we do something wrong somewhere, but it's not reproducible consistently in my tests. @Vgutierrez please post full request headers for repro cases so we can debug further, but the issue seems to be apache sends back a content-length of 14079 bytes for a 304.

Do I need to be concerned about my IP being released during discussion // should this task be private?

Just confirmed we only get 304s with a non-zero content-length for requests to Special:EntityData on wikidata.

I still don't have a reliable repro case.

@Joe it seems pretty easy to trigger:

vgutierrez@mw1330:~$ curl --resolve www.wikidata.org:80:10.64.32.32 "www.wikidata.org/wiki/Special:EntityData/Q38646387.json" -o /dev/null -v -H 'if-modified-since: Sat, 30 Mar 2019 07:05:28 GMT'
* Added www.wikidata.org:80:10.64.32.32 to DNS cache
* Hostname www.wikidata.org was found in DNS cache
*   Trying 10.64.32.32...
* TCP_NODELAY set
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to www.wikidata.org (10.64.32.32) port 80 (#0)
> GET /wiki/Special:EntityData/Q38646387.json HTTP/1.1
> Host: www.wikidata.org
> User-Agent: curl/7.52.1
> Accept: */*
> if-modified-since: Sat, 30 Mar 2019 07:05:28 GMT
>
< HTTP/1.1 304 Not Modified
< Date: Thu, 21 Nov 2019 07:44:56 GMT
< Server: mw1330.eqiad.wmnet
< Expires: Thu, 21 Nov 2019 07:44:56 GMT
< Cache-Control: private, must-revalidate, max-age=0
< Vary: Accept-Encoding,Cookie,Authorization
<
* Excess found in a non pipelined read: excess = 13881 url = /wiki/Special:EntityData/Q38646387.json (zero-length body)
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Curl_http_done: called premature == 0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host www.wikidata.org left intact

See how curl complains:

Excess found in a non pipelined read: excess = 13881 url = /wiki/Special:EntityData/Q38646387.json (zero-length body)

and apache shows the 304 with CL = 13881:

2019-11-21T07:44:56	85670	10.64.32.32	proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/304	13881	GET	http://www.wikidata.org/wiki/Special:EntityData/Q38646387.json	-	-	-	-curl/7.52.1	-	-	-	10.64.32.32

https://bz.apache.org/bugzilla/show_bug.cgi?id=57198 seems relevant, but I think it should be fixed in our version of apache2 (2.4.25).

We should try to see what php-fpm answers using furl on the appservers.

reproducing it with furl shows that php-fpm returns a 304 with a body, and it looks like apache2 2.4.25-3+deb9u7 is failing to handle that

The patch seems to be there, see https://salsa.debian.org/apache-team/apache2/blob/debian/2.4.25-3+deb9u7/modules%2Fproxy%2Fmod_proxy_fcgi.c#L663-675:

else if (status == HTTP_NOT_MODIFIED
         || status == HTTP_PRECONDITION_FAILED) {
    /* Special 'status' cases handled:
     * 1) HTTP 304 response MUST NOT contain
     *    a message-body, ignore it.
     * 2) HTTP 412 response.
     * The break is not added since there might
     * be more bytes to read from the FCGI
     * connection. Even if the message-body is
     * ignored (and the EOS bucket has already
     * been sent) we want to avoid subsequent
     * bogus reads. */
    ignore_body = 1;

@Addshore @WMDE-leszek: it seems we are causing this. We should take a look

so it looks like the apache fix doesn't fix every scenario. This small PoC triggers the issue:

<?php
http_response_code(304);
echo("test");
flush();

On mwdebug2001 this seems to be the mod proxy fcgi debug:

Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:debug] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(913): [client 10.192.0.98:37452] AH01076: url: fcgi://localhost/srv/mediawiki/docroot/wikidata.org/w/index.php proxyname: (null) proxyport: 0
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:debug] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(920): [client 10.192.0.98:37452] AH01078: serving URL fcgi://localhost/srv/mediawiki/docroot/wikidata.org/w/index.php
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'UNIQUE_ID' value 'REDACTED'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'HTTP_AUTHORIZATION' value ''
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'HTTP_CONTENT_TYPE' value ''
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'HTTP_CONTENT_LENGTH' value ''
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SCRIPT_URL' value '/wiki/Special:EntityData/Q38646387.json'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SCRIPT_URI' value 'http://www.wikidata.org/wiki/Special:EntityData/Q38646387.json'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'RW_PROTO' value 'http'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'HTTP_HOST' value 'www.wikidata.org'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'HTTP_USER_AGENT' value 'curl/7.52.1'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'HTTP_ACCEPT' value '*/*'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'HTTP_IF_MODIFIED_SINCE' value 'Sat, 30 Mar 2019 07:05:28 GMT'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'PATH' value '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SERVER_SIGNATURE' value ''
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SERVER_SOFTWARE' value 'mwdebug2001.codfw.wmnet'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SERVER_NAME' value 'www.wikidata.org'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SERVER_ADDR' value '10.192.0.98'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SERVER_PORT' value '80'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'REMOTE_ADDR' value '10.192.0.98'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'DOCUMENT_ROOT' value '/srv/mediawiki/docroot/wikidata.org'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'REQUEST_SCHEME' value 'http'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'CONTEXT_PREFIX' value ''
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'CONTEXT_DOCUMENT_ROOT' value '/srv/mediawiki/docroot/wikidata.org'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SERVER_ADMIN' value 'webmaster@wikimedia.org'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SCRIPT_FILENAME' value '/srv/mediawiki/docroot/wikidata.org/w/index.php'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'REMOTE_PORT' value '37452'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'GATEWAY_INTERFACE' value 'CGI/1.1'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SERVER_PROTOCOL' value 'HTTP/1.1'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'REQUEST_METHOD' value 'GET'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'QUERY_STRING' value ''
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'REQUEST_URI' value '/wiki/Special:EntityData/Q38646387.json'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(308): [client 10.192.0.98:37452] AH01062: sending env var 'SCRIPT_NAME' value '/wiki/Special:EntityData/Q38646387.json'
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(592): [client 10.192.0.98:37452] FastCGI header (8 bytes)
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(592): [client 10.192.0.98:37452] ........         010600011ff80000
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(571): [client 10.192.0.98:37452] Headers from script 'index.php':
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   X-Powered-By: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   X-Content-Type-Options: nosniff
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   P3P: CP="See https://www.wikidata.org/wiki/Special:CentralAutoLogin/P3P for more info."
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   Access-Control-Allow-Origin: *
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   Status: 304
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   Content-language: en
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   X-Frame-Options: DENY
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   X-Analytics: ns=-1;special=EntityData
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   Vary: Accept-Encoding, Cookie, Authorization
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   Expires: Thu, 21 Nov 2019 12:02:13 GMT
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace4] [pid 4516:tid 140373593552640] util_script.c(572): [client 10.192.0.98:37452]   Cache-Control: private, must-revalidate, max-age=0
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(592): [client 10.192.0.98:37452] FastCGI header (8 bytes)
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(592): [client 10.192.0.98:37452] ........         0106000118110700
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(592): [client 10.192.0.98:37452] FastCGI header (8 bytes)
Nov 21 12:02:13 mwdebug2001 apache2[4516]: [proxy_fcgi:trace8] [pid 4516:tid 140373593552640] mod_proxy_fcgi.c(592): [client 10.192.0.98:37452] ........         0103000100080000

Nothing out of the ordinary, and I don't see any payload after the 304.

I enabled mod_dumpio with Valentin's test on mwdebug and I can see the body emitted from httpd, but no good info from proxy_fcgi.

Is there anything that we can quickly do on wikibase to fix this?

if so, please advise what concretely.

Thanks!

I have created a Docker image for Debian stretch installing apache2 (same version of the mw app servers) + php7.2-fpm from Sury's repo + the following trivial config:

<Directory /var/www/>
        Options Indexes FollowSymLinks
        AllowOverride None
        Require all granted
       <FilesMatch ".php$">
           SetHandler "proxy:unix:/run/php/php7.2-fpm.sock|fcgi://localhost"
       </FilesMatch>
</Directory>

and

root@75ab41fd13ca:/# cat /var/www/test.php
<?php
http_response_code(304);
echo("test");
flush();

No repro:

root@75ab41fd13ca:/# curl localhost/test.php -i
HTTP/1.1 304 Not Modified
Date: Thu, 21 Nov 2019 18:07:28 GMT
Server: Apache/2.4.25 (Debian)

Edit: tried also with php7.2 from our wikimedia apt repo, same result.

please note that I cannot reproduce on mwdebug using curl -i, see the difference:

vgutierrez@mwdebug2001:~$ curl --resolve en.wikipedia.org:80:10.192.0.98 http://en.wikipedia.org/w/vgutierrez.php -v
* Added en.wikipedia.org:80:10.192.0.98 to DNS cache
* Hostname en.wikipedia.org was found in DNS cache
*   Trying 10.192.0.98...
* TCP_NODELAY set
* Connected to en.wikipedia.org (10.192.0.98) port 80 (#0)
> GET /w/vgutierrez.php HTTP/1.1
> Host: en.wikipedia.org
> User-Agent: curl/7.52.1
> Accept: */*
>
< HTTP/1.1 304 Not Modified
< Date: Fri, 22 Nov 2019 02:26:32 GMT
< Server: mwdebug2001.codfw.wmnet
<
* Excess found in a non pipelined read: excess = 4 url = /w/vgutierrez.php (zero-length body)
* Curl_http_done: called premature == 0
* Connection #0 to host en.wikipedia.org left intact
vgutierrez@mwdebug2001:~$ curl --resolve en.wikipedia.org:80:10.192.0.98 http://en.wikipedia.org/w/vgutierrez.php -i
HTTP/1.1 304 Not Modified
Date: Fri, 22 Nov 2019 02:26:35 GMT
Server: mwdebug2001.codfw.wmnet

And using a dumb TCP client we can see the 4 excess bytes:

vgutierrez@mwdebug2001:~$ python3
[...]
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.connect(('10.192.0.98', 80))
>>> s.send("GET /w/vgutierrez.php HTTP/1.1\r\nHost: en.wikipedia.org\r\n\r\n".encode())
58
>>> s.recv(1024)
b'HTTP/1.1 304 Not Modified\r\nDate: Fri, 22 Nov 2019 02:33:07 GMT\r\nServer: mwdebug2001.codfw.wmnet\r\n\r\ntest'
>>> s.close()

And I can reproduce the issue as well with docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2.
Dockerfile:

FROM docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2
RUN mkdir -p /tmp/php
RUN sed -i s/runuser/www-data/ /etc/apache2/apache2.conf
COPY poc.php /var/www/html
willikins:fpm-apache2 vgutierrez$ curl -v http://127.0.0.1:8080/poc.php
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET /poc.php HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 304 Not Modified
< Date: Fri, 22 Nov 2019 02:55:52 GMT
< Server: Apache/2.4.25 (Debian)
< Connection: Upgrade
<
* Excess found in a non pipelined read: excess = 4 url = /poc.php (zero-length body)
* Connection #0 to host 127.0.0.1 left intact
* Closing connection 0

versions used inside the container:

willikins:fpm-apache2 vgutierrez$ docker exec apache dpkg -l php7.2-fpm
ii  php7.2-fpm     7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1 amd64        server-side, HTML-embedded scripting language (FPM-CGI binary)
willikins:fpm-apache2 vgutierrez$ docker exec apache dpkg -l apache2
ii  apache2        2.4.25-3+deb9u9 amd64        Apache HTTP Server

Thanks, will remember to use -v :)

On buster + sury's php7.2-fpm + apache 2.4.38 I cannot repro:

root@497a89a9ac42:/# curl localhost/test.php -v
* Expire in 0 ms for 6 (transfer 0x5604525d2f50)
* Expire in 1 ms for 1 (transfer 0x5604525d2f50)
* Expire in 0 ms for 1 (transfer 0x5604525d2f50)
* Expire in 1 ms for 1 (transfer 0x5604525d2f50)
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 149997 ms for 3 (transfer 0x5604525d2f50)
* Expire in 200 ms for 4 (transfer 0x5604525d2f50)
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /test.php HTTP/1.1
> Host: localhost
> User-Agent: curl/7.64.0
> Accept: */*
>
< HTTP/1.1 304 Not Modified
< Date: Fri, 22 Nov 2019 07:22:15 GMT
< Server: Apache/2.4.38 (Debian)
<
* Connection #0 to host localhost left intact

I thought to use git bisect to nail down the commit that fixes the problem, but then I remembered http://svn.apache.org/r1837056 (plus https://svn.apache.org/r1843242 as follow up for mod_brotli/deflate). The fix handles the 304/204 body problem in the http output filter, so after all the content generators (like mod_proxy_fcgi) and before sending the response to the external client.

The change is not small and contained in one file, but we could think about adding it to our current httpd debian stretch version while waiting for Buster for appservers.

Is there anything that we can quickly do on wikibase to fix this?

if so, please advise what concretely.

Thanks!

@darthmon_wmde hi! So the issue seems to be that URLs like www.wikidata.org/wiki/Special:EntityData/Q38646387.json might lead to a HTTP 304 response with a body, that violates HTTP specs. This is currently not handled by the Apache httpd version that we use on Debian Stretch (namely proactively dropping the body in case it is found) so a fix in our code would definitely help.

Is there anything that we can quickly do on wikibase to fix this?

if so, please advise what concretely.

Thanks!

In general, whenever your code decides to respond with a 304 not modified or a 204 no content status code, no content body should be sent to the client.

While the FCGI specification[1] isn't extremely clear on this topic, the HTTP one is, and for instance nginx refuses to accept as valid 304 responses with a body, see for instance https://trac.nginx.org/nginx/ticket/459

So the code for Special:EntityData should not emit a body when returning a 304 response code, or any wikibase installation using nginx as a webserver would fail to respond to Special:EntityData.

[1] FCGI is the protocol PHP-FPM speaks - specification is here https://github.com/fast-cgi/spec/blob/master/spec.md (rather obscure, in fact)

Thanks @elukey and @Joe for translating from leet speak! I've filed T238901 about the problem in Wikibase, and we'll be looking into fixing the broken 304 response.

Addshore assigned this task to Ladsgroup.

I'll close this one along with the subtask then :)

I don't know if this is indeed done:
Request from [snip] via cp4028.ulsfo.wmnet, ATS/8.0.5 Error: 502, Cannot find server. at 2019-11-26 21:08:03 GMT when trying to load https://phabricator.wikimedia.org/T225055
Request from [snip] via cp4028.ulsfo.wmnet, ATS/8.0.5 Error: 502, Cannot find server. at 2019-11-26 21:08:25 GMT when trying to load https://phabricator.wikimedia.org/search/query/JJs0SOXpHF3Y/#R

I think you ran into a temporary blip in some unrelated DNS work (which is already dealt with), not this bug (502 errors can happen for real infra failure reasons, too!)

CDanis renamed this task from 502 errors on ATS/8.0.5 to ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).Nov 26 2019, 9:17 PM