Page MenuHomePhabricator

Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Seeing a flood of these errors in fluorine:/a/mw-log/apache2.log:

Sep 30 22:24:17 mw1018: [proxy_fcgi:error] [pid 12557] [client 10.64.0.103:5054] AH01070: Error parsing script headers, referer: https://es.wikipedia.org/wiki/Estadio_Coloso_del_Ruca_Quimey

Sep 30 22:24:16 mw1019: [proxy_fcgi:error] [pid 30172] [client 10.64.0.105:52919] AH01068: Got bogus version 1, referer: https://commons.wikimedia.org/wiki/File:Bibliotheque_de_Toulouse_-_Pic_d-Aneto_(pd).jpg
Sep 30 22:24:16 mw1019: [proxy_fcgi:error] [pid 30172] (22)Invalid argument: [client 10.64.0.105:52919] AH01075: Error dispatching request to :, referer: https://commons.wikimedia.org/wiki/File:Bibliotheque_de_Toulouse_-_Pic_d-Aneto_(pd).jpg


Version: wmf-deployment
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:52 AM
bzimport set Reference to bz71487.
bzimport added a subscriber: Unknown Object (MLST).

According to httpd[1] "Got bogus version 1" refers to the first byte in every fastcgi record. Per the spec[2] the only valid value there is 1. Something odd is going on, but without a capture of the packets i couldn't say what.

[1] https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_fcgi.c#L525
[2] http://www.fastcgi.com/devkit/doc/fcgi-spec.html#S3.3

So, both these lines of errors have been showed to be basically bogus.

I think the only way to filter/silence them is to patch apache directly.

@erik: never look at trunk for apache sources, but to our specific version, which is 2.4.7

The only way to make these alarms go away is to backport patches that corrected logging.

IDK if we want to invest that kind of time in such an activity.

Krinkle renamed this task from proxy_fcgi errors in Apache logs to Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503).Feb 6 2015, 9:55 PM
Krinkle set Security to None.
Krinkle added a project: HHVM.
Krinkle removed a subscriber: Unknown Object (MLST).
Krinkle added subscribers: Bennylin, tstarling, Aklapper and 4 others.

Still seeing lots of these in logstash. https://logstash.wikimedia.org/#/dashboard/elasticsearch/apache2log

18,878 hits for this error in the last 15 minutes alone (most frequent error).

2015-02-06T21:52:23.000Z	error	proxy_fcgi	mw1020	(22)Invalid argument: AH01075: Error dispatching request to ...
2015-02-06T21:52:23.000Z	error	proxy_fcgi	mw1243	(22)Invalid argument: AH01075: Error dispatching request to ...
2015-02-06T21:52:23.000Z	error	proxy_fcgi	mw1210	(22)Invalid argument: AH01075: Error dispatching request to ...
2015-02-06T21:52:22.000Z	error	proxy_fcgi	mw1242	(22)Invalid argument: AH01075: Error dispatching request to ...
2015-02-06T21:52:22.000Z	error	proxy_fcgi	mw1185	(22)Invalid argument: AH01075: Error dispatching request to ...
	..
	referrer	  	http://fr.wikipedia.org/w/index.php?title=Portail:Triathlon/Index_th%C3%A9matique&action=submit
hashar subscribed.

Would be nice to get rid of those errors. They are quite spammy in logstash.

@Joe mentioned it should be fixed via https://bz.apache.org/bugzilla/show_bug.cgi?id=52879 , the patch has a follow up at https://bz.apache.org/bugzilla/show_bug.cgi?id=57198

The bug I mentioned earlier was for the AH01070 and has been fixed in our new apache package for jessie.

The still bogus errors got bogus version are still unsolved and there is an unfixed bug report for apache https://bz.apache.org/bugzilla/show_bug.cgi?id=57398

Quite interestingly, the bogus report only happens on regular appservers, not on the API ones; this might give us a chance to isolate it better.

Adding some random thoughts about the following snippet of code (current httpd trunk):

ap_log_rdata(APLOG_MARK, APLOG_TRACE8, r, "FastCGI header",
             farray, AP_FCGI_HEADER_LEN, 0);

ap_fcgi_header_fields_from_array(&version, &type, &rid,
                                 &clen, &plen, farray);

if (version != AP_FCGI_VERSION_1) {
    ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, APLOGNO(01068)
                  "Got bogus version %d", (int)version);
    rv = APR_EINVAL;
    break;
}

if (rid != request_id) {
    ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, APLOGNO(01069)
                  "Got bogus rid %d, expected %d",
                  rid, request_id);
    rv = APR_EINVAL;
    break;
}

Also in utils_fcgi.h:

/**
 * Possible values for the version field of ap_fcgi_header
 */
#define AP_FCGI_VERSION_1 1

Checked on fluorine and got logs like:

Jun 20 06:31:56 mw1254:  [proxy_fcgi:error] [pid 43696:tid 140476912961280] [client 10.64.0.105:36820] AH01068: Got bogus version 1, referer: https://commons.wikimedia.org/
Jun 20 06:31:56 mw1271:  [proxy_fcgi:error] [pid 140227:tid 139763034076928] [client 10.64.0.103:46746] AH01068: Got bogus version 215, referer: https://nl.wikipedia.org/

We could:

  1. set LogLevel proxy_fcgi trace8 temporarily on a single machine to leverage ap_log_rdata and double check version/rid
  1. Do some tests about the condition (version != AP_FCGI_VERSION_1), because it doesn't make any sense the "Got bogus version 1" error, since AP_FCGI_VERSION_1 is one.
  1. Figure out why we get also "Got bogus version X" with X > 1

The "got bogus version 1" is a logging bug and has been fixed in the jessie apache package.

Your suggestions make a lot of sense, in fact.

From the few bug reports about this I found, it seems this error happens under high concurrency. It doesn't seem to cause any user-facing error though.

With the following code snippet I obtain:

First test:
b:1 (char) or b:49 (int) is different than a:1!
Second test:
Equal!
#include <stdio.h>

#define ONE 1

int main() {

    unsigned int c = 1;
    unsigned char one_char = '1';
    printf("First test:\n");
    if (one_char != ONE) {
        printf("b:%c (char) or b:%d (int) is different than a:%d!\n",
                one_char, (int)one_char, ONE);
    } else {
        printf("Equal!\n");
    }
    printf("Second test:\n");
    if (c != ONE) {
        printf("c:%d is different than a:%d!\n",
                c, ONE);
    } else {
        printf("Equal!\n");
    }

    return 1;
}

Possibly a type coercion problem between int and unsigned char?
EDITED: didn't see Joe's answer before writing this :)

In T73487#2393128, @Joe wrote:

The "got bogus version 1" is a logging bug and has been fixed in the jessie apache package.

Might be useful to keep track of everything:

https://bz.apache.org/bugzilla/show_bug.cgi?id=56110
https://svn.apache.org/r1564756

From 2.4.7 to 2.4.10 there are these relevant changes (from https://www.apache.org/dist/httpd/CHANGES_2.4):

*) mod_proxy_fcgi: Use apr_socket_timeout_get instead of hard-coded
   30 seconds timeout. [Jan Kaluza]
*) mod_proxy_fcgi: Fix error message when an unexpected protocol version
   number is received from the application.  PR 56110.  [Jeff Trawick]
*) core: Detect incomplete request and response bodies, log an error and
   forward it to the underlying filters. PR 55475 [Yann Ylavic]
*) mod_proxy_fcgi: Fix sending of response without some HTTP headers
   that might be set by filters.  PR 55558. [Jim Riggs <jim riggs.me>]
*) mod_proxy_fcgi: Support iobuffersize parameter.  [Jeff Trawick]
*) mod_proxy_fcgi: Fix occasional high CPU when handling request bodies.
   [Jeff Trawick]

From 2.4.10 onwards:

2.4.19

*) mod_proxy_fcgi: Suppress HTTP error 503 and message 01075, 
   "Error dispatching request", when the cause appears to be 
   due to the client closing the connection. 
   PR58118.  [Tobias Adolph <adolph lrz.de>]

This one --^ might be easy to add as patch and afaiu is one of the main issues listed in this task. Related PR https://bz.apache.org/bugzilla/show_bug.cgi?id=58118

2.4.11 (that was not released, so 2.4.12):

Changes with Apache 2.4.11 (not released)

*) SECURITY: CVE-2014-3583 (cve.mitre.org)
   mod_proxy_fcgi: Fix a potential crash due to buffer over-read, with 
   response headers' size above 8K.  [Yann Ylavic, Jeff Trawick]

*) mod_proxy_fcgi: Provide some basic alternate options for specifying 
   how PATH_INFO is passed to FastCGI backends by adding significance to
   the value of proxy-fcgi-pathinfo. PR 55329. [Eric Covener]
 
*) mod_proxy_fcgi: Enable UDS backends configured with SetHandler/RewriteRule
   to opt-in to connection reuse and other Proxy options via explicitly
   declared "proxy workers" (<Proxy unix:... enablereuse=on max=...)
   [Eric Covener]

*) mod_proxy: Add "enablereuse" option as the inverse of "disablereuse".
   [Eric Covener]

*) mod_proxy_fcgi: Enable opt-in to TCP connection reuse by explicitly
   setting proxy option disablereuse=off. [Eric Covener] PR 57378.

*) mod_proxy_fcgi: Remove proxy:balancer:// prefix from SCRIPT_FILENAME
   passed to fastcgi backends. [Eric Covener]

*) mod_proxy_fcgi: Ignore body data from backend for 304 responses. PR 57198.
   [Jan Kaluza]

*) mod_proxy_fcgi, mod_authnz_fcgi: stop reading the response and issue an
   error when parsing or forwarding the response fails. [Yann Ylavic]

*) mod_proxy_fcgi: Fix faulty logging of large amounts of stderr from the
   application.  PR 56858.  [Manuel Mausz <manuel-asf mausz.at>]

Also the Bogus request id is surely a bug since you can read this comment in the function that calls dispatch (that logs the error):

/*
 * process the request and write the response.
 */
static int fcgi_do_request(apr_pool_t *p, request_rec *r,
                           proxy_conn_rec *conn,
                           conn_rec *origin,
                           proxy_dir_conf *conf,
                           apr_uri_t *uri,
                           char *url, char *server_portstr)
{
    /* Request IDs are arbitrary numbers that we assign to a
     * single request. This would allow multiplex/pipelining of
     * multiple requests to the same FastCGI connection, but
     * we don't support that, and always use a value of '1' to
     * keep things simple. */
    apr_uint16_t request_id = 1;

Interesting note about "AH01069: Got bogus rid 9030, expected 1": I checked access logs related to some error from https://logstash.wikimedia.org/#/dashboard/elasticsearch/apache2log and I always found something like:

2016-07-03T21:12:26     47794      proxy-server/503        0       GET     http://meta.wikimedia.org/w/index.php

This matches what I have noticed in the code, namely that mod-proxy-fcgi reads the FCGI response headers and skips the body if the request id is not 1. The value should be used to multiplex multiple request over the same FCGI connection but it is not supported by mod-proxy-fcgi, so it seems that something mangles the FCGI request id.

Not sure about what the "9030" value means since it is always present.

Mentioned in SAL [2016-07-05T12:11:35Z] <elukey> depooling/re-pooling mw1024.eqiad.wmnet to temporarily set up trace8 logging (503 investigation - T73487)

So since I am very lucky, I discovered that the FCGI header dump has been released only on 2.4.17:

https://github.com/apache/httpd/commit/1dc13b934b683427e2bb906ced97804ec5318266

I tried anyway on mw1024 and got:

[Tue Jul 05 14:11:11.927064 2016] [proxy_fcgi:trace4] [pid 21493:tid 139785842632448] util_script.c(523):
Content-Encoding: gzip, referer: https://ru.wikipedia.org

[Tue Jul 05 14:11:11.927070 2016] [proxy_fcgi:trace4] [pid 21493:tid 139785842632448] util_script.c(523):
Content-Length: 10048, referer: https://ru.wikipedia.org

[Tue Jul 05 14:11:11.927076 2016] [proxy_fcgi:trace4] [pid 21493:tid 139785842632448] util_script.c(523):
Vary: Accept-Encoding, Cookie, Authorization, referer: https://ru.wikipedia.org

[Tue Jul 05 14:11:11.927082 2016] [proxy_fcgi:trace4] [pid 21493:tid 139785842632448] util_script.c(523):
Last-modified: Fri, 11 Sep 2015 02:13:37 GMT, referer: https://ru.wikipedia.org

[Tue Jul 05 14:11:11.927110 2016] [proxy_fcgi:error] [pid 21493:tid 139785842632448]
AH01070: Error parsing script headers, referer: https://ru.wikipedia.org

[Tue Jul 05 14:11:11.927124 2016] [proxy_fcgi:error] [pid 21493:tid 139785842632448]
AH01069: Got bogus rid 9030, expected 1, referer: https://ru.wikipedia.org

[Tue Jul 05 14:11:11.927141 2016] [proxy_fcgi:error] [pid 21493:tid 139785842632448] (22)Invalid argument:
AH01075: Error dispatching request to :, referer: https://ru.wikipedia.org

Highlight: Content-Length: 10048 --> so it seems that a response is indeed returned (doesn't tell us anything about the fcgi headers)

Since AH01075 was resolved simply checking if the connection was aborted or not by the client, a similar problem might affect AH01069 that comes right before it in the code.

Mentioned in SAL [2016-07-06T11:54:19Z] <elukey> depooling mw1261.eqiad.wmnet to raise Apache's mod-fcgi to trace8 for 503 investigation - T73487 (this will probably slow down a bit the host)

Performed the same test on mw1261, same result. I compared one meta.wikimedia.org request ending up with AH01609 and a "regular" one, the only diff that I found was the absence of "Set-Cookie: centralauth_Token" in the former.

Mentioned in SAL [2016-07-06T13:35:53Z] <elukey> depooling mw1261.eqiad to restore previous fcgi logging settings (T73487)

well @elukey this is indeed interesting and gives an explanation of why we don't see these errors on the API cluster.

It *clould* be due to some centralAuth_token that apache cannot understand. Decidedly worth investigating better.

Mentioned in SAL [2016-07-06T15:00:17Z] <_joe_> depooling mw1261, installing an apache package with additional fixes (T73487)

In the end, we decided we need the following patches:

Work on porting both patches is ongoing, the second one in particular requires some pretty massive backporting and is probably not worth it besides for this investigation.

Mentioned in SAL [2016-07-07T10:11:06Z] <elukey> pooling mw1261 back to service with Apache mod-proxy-fcgi set to trace8 (T73487)

Mentioned in SAL [2016-07-07T12:05:29Z] <elukey> depooling mw1261 from service (T73487)

More information thanks to @Joe's patch to log FCGI headers.

Afaiu mod-proxy-fcgi does the following for each request:

  1. sends AP_FCGI_BEGIN_REQUEST to let hhvm knows that a request is incoming.
  2. sends the environment variables with all the details of the request to hhvm.
  3. reads HHVM's response headers first (rather than the whole response) and decides what to do. For example, it might discover that the FCGI_VERSION is wrong (not 1 since it is the only value supported) and abort (ending up in AH01075) or reading the rest of the response and then proceed with flushing.

In this particular case it seems that the response is correctly read and flushed (so first round of FCGI headers are ok) but then for some reason mod-proxy-fcgi decides to keep going reading again FCGI headers, getting garbage.

This "garbage" can be of two kind:

  1. not starting with 01, triggering "Bogus version X, expected 1" AH01068
  2. starting with 01 but not carrying 01 where the request id should be (only value supported), triggering "Bogus rid X, expected 1" AH01069

The final value is AH01075, that basically states "something went wrong while reading FCGI, returning 503"

So the next steps is to figure out:

  1. if we can find a repro use case, maybe checking the error.log on mw1261
  2. figure out if this issue has already been resolved by more up to date versions of mod-proxy-fcgi (for example checking https://github.com/apache/httpd/commits/2.4.x/modules/proxy/mod_proxy_fcgi.c)

For anybody that wants to help, there is a complete error log in mw1261.eqiad.wmnet:/home/elukey/error_log_elukey_httpd_patched.log (read allowed only to root due to sensitive data on it, please let me know if you want to read it but don't have access)

@elukey it would be useful to save the corresponding access log as well.

Yes definitely!

mw1261.eqiad.wmnet:/home/elukey/error_log_elukey_httpd_patched_07072016.log

mw1261.eqiad.wmnet:/home/elukey/access_log_elukey_httpd_patched_07072016.log

Finally we found a repro thanks to https://bz.apache.org/bugzilla/show_bug.cgi?id=49671.
Quoting the bug report:

If mod_proxy_fcgi is handling a request it encounters an error sending data to the client 
(most commonly because the client has disconnected) then it will often stop processing 
data in the middle of an FCGI record and return from dispatch() without having read 
the entire record from the FCGI server.

When dispatch() is next entered it tries to read a new record, 
but gets bogus data from the middle of the previous record and logs an error along these likes:

[Thu Jul 29 15:48:29 2010] [error] proxy: FCGI: Got bogus version 66
[Thu Jul 29 15:48:29 2010] [error] (22)Invalid argument: proxy: FCGI: Error dispatching request to :

That explains perfectly what I was observing in the error.log set to trace8. I managed to reproduce on my Vagrant VM simply configuring httpd/mod-proxy-fcgi to use FCGI to serve all files, not only PHP, and issuing a HTTP request for the file aborting it before completion.

So the 503s that we are seeing in the access_logs (plus all the related errors in error.log) should be related to client aborting connections.

The fix should be https://svn.apache.org/r1682544, released in 2.4.16. Joe is trying to backport the patch to 2.4.10, let's see if it works. If everything goes fine it might be good to follow up with the Debian maintainers.

I updated all the related mod-proxy-fcgi apache bz tasks with some info too.

The two patches needed are the following (even simpler than what I expected):

http://svn.apache.org/r1642855
http://svn.apache.org/r1726019

Nice to have but not really necessary;

svn.apache.org/r1650677

Deployed the patched version of httpd but the errors are still there. I added the following logging:

LogLevel notice proxy_fcgi:trace8 core:trace8 http_module:trace8

and this is the result for every Bogus version and AH01075:

[Fri Jul 08 15:41:29.246709 2016] [http:trace3] http_filters.c(1003):  Response sent with status 304, headers:
[Fri Jul 08 15:41:29.246729 2016] [http:trace5] http_filters.c(1010):  Date: Fri, 08 Jul 2016 15:41:29 GMT
[Fri Jul 08 15:41:29.246734 2016] [http:trace5] http_filters.c(1013):  Server: mw1261.eqiad.wmnet
[Fri Jul 08 15:41:29.246747 2016] [proxy_fcgi:error] AH01068: Got bogus version 116
[Fri Jul 08 15:41:29.246754 2016] [proxy_fcgi:error]  (22)Invalid argument: [client 10.64.0.102:24486] AH01075: Error dispatching request to :

So all the requests with 503 and problematic logs seems to be 304 Not modified (I turned on core and http_module logs to trace8 this time, not only proxy_fcgi). We added http://svn.apache.org/viewvc?view=revision&revision=1650677 that should have solved the issue, but apparently we are still missing something.

Will try to reproduce a 304 "problematic" response in my vagrant testing environment.

I copied an example of 304 in mw1061:/home/elukey/error_log_304.log (root:adm permissions)

Is there anybody that could give me an example of request that simulates this one? @ori maybe? The goal would be to reproduce this sneaky 304 and figure out what is going wrong with httpd.

I found a repro (If-Modified needs to be modified accordingly to force a 304):

curl "localhost/w/index.php?title=MediaWiki:Gadget-ImageAnnotator.js&action=raw&ctype=text/javascript" 
-H "Host: commons.wikimedia.org" 
-i 
--header 'If-Modified-Since: Tue, 25 Aug 2015 15:19:10 GMT'

Correspondent logs:

2016-07-08T21:15:17	45908	::1	
proxy-server/503	0	
GET	http://commons.wikimedia.org/w/index.php	-	
text/html	-	-	curl/7.38.0	--	-	::1

Jul  8 21:24:11 mw2244 apache2[78184]: [proxy_fcgi:error] 
[pid 78184:tid 139815613789952] 
[client ::1:40136] AH01068: Got bogus version 32

Jul  8 21:24:11 mw2244 apache2[78184]: [proxy_fcgi:error] 
[pid 78184:tid 139815613789952] (22)Invalid argument: 
[client ::1:40136] AH01075: Error dispatching request to :

The question now is why a 304 generates this mess. It seems not related to http://svn.apache.org/r1650677, but something different.

On the bright side, I verified that the 304 is delivered to the client (that was my shell) without any mention of a 503 or an error. So this issue seems to be only related to httpd internals spamming logs.

Interesting fact: I tried the same curl request on mw2244 (that does not have the fix to ignore 304 response body - http://svn.apache.org/r1650677) and this is the result:

Jul  8 21:56:03 mw2244 apache2[78184]: [proxy_fcgi:error] 
[pid 78184:tid 139815790036736] [client ::1:43834] AH01070: Error parsing script headers
Jul  8 21:56:03 mw2244 apache2[78184]: [proxy_fcgi:error] 
[pid 78184:tid 139815790036736] [client ::1:43834] AH01068: Got bogus version 32
Jul  8 21:56:03 mw2244 apache2[78184]: [proxy_fcgi:error] 
[pid 78184:tid 139815790036736] (22)Invalid argument: [client ::1:43834] AH01075: Error dispatching request to :

The error message "Error parsing script headers" is exactly the one that the above patch corrected:

                                 APR_BRIGADE_INSERT_TAIL(ob, tmp_b);
                                 r->status = status;
                                 ap_pass_brigade(r->output_filters, ob);
-                                ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, APLOGNO(01070)
-                                              "Error parsing script headers");
-                                rv = APR_EINVAL;
+                                if (status == HTTP_NOT_MODIFIED) {
+                                    /* The 304 response MUST NOT contain
+                                     * a message-body, ignore it. */
+                                    ignore_body = 1;
+                                }
+                                else {
+                                    ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, APLOGNO(01070)
+                                                    "Error parsing script headers");
+                                    rv = APR_EINVAL;
+                                }
                                 break;

Since mw1261 does not show that error message, it seems that we are indeed in the case of a 304 response carrying payload.

we are either hitting a new mod-proxy-fcgi bug or we are just missing the commit that fixes it :)

So you at least have a proof AH01070: Error parsing script headers is fixed!

I found a repro (If-Modified needs to be modified accordingly to force a 304):

curl "localhost/w/index.php?title=MediaWiki:Gadget-ImageAnnotator.js&action=raw&ctype=text/javascript" 
-H "Host: commons.wikimedia.org" 
-i 
--header 'If-Modified-Since: Tue, 25 Aug 2015 15:19:10 GMT'

Correspondent logs:

2016-07-08T21:15:17	45908	::1	
proxy-server/503	0	
GET	http://commons.wikimedia.org/w/index.php	-	
text/html	-	-	curl/7.38.0	--	-	::1

Jul  8 21:24:11 mw2244 apache2[78184]: [proxy_fcgi:error] 
[pid 78184:tid 139815613789952] 
[client ::1:40136] AH01068: Got bogus version 32

Jul  8 21:24:11 mw2244 apache2[78184]: [proxy_fcgi:error] 
[pid 78184:tid 139815613789952] (22)Invalid argument: 
[client ::1:40136] AH01075: Error dispatching request to :

The question now is why a 304 generates this mess. It seems not related to http://svn.apache.org/r1650677, but something different.

On the bright side, I verified that the 304 is delivered to the client (that was my shell) without any mention of a 503 or an error. So this issue seems to be only related to httpd internals spamming logs.

If you try this with telnet, not curl, you'll see that for some reason the 304 body is still returned. I fear I made an error when rebuilding the package or that the 304_no_body patch was incomplete.

After a bit of digging, it seems that the AH01075/AH01068 error messages can be reproduced on httpd trunk too. I opened a bugzilla PR to propose a fix, let's see how it goes: https://bz.apache.org/bugzilla/show_bug.cgi?id=59838 (more info about how to repro and basic info in there!)

So the next steps are:

  1. wait for the PR
  2. double check if the deb package built by Joe contains all the patches that is supposed to have (even if I am pretty confident that it has everything). Hopefully this will fix the 304 body problem returned but I suspect that we are missing another commit to backport..

So, by combining elukey's patch and the original patch for not sending out a body on a 304, the bogus logs are effectively gone,

As soon as we feel confident, we can upgrade the apache package across all jessies, and then resolve this ticket.

Awesome! As a side note, the logstash syslog filter discard a few Apache errors. From https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/files/logstash/filter-syslog.conf;f878f4c4634277750ade1d92a6e4594ec6300cb1$82-92

if [program] == "apache2" {
  # Ignore known mod_proxy_fcgi bugs
  if [message] =~ /AH01070: Error parsing script headers/ {
    drop {}
  }
  if [message] =~ /AH01068: Got bogus version \d/ {
    drop {}
  }
  if [message] =~ /AH01075: Error dispatching request to :/ {
    drop {}
  }

Might want to stop dropping them.

@hashar: thanks for the info! I was wondering why apache2log wasn't showing all the spam :)

@Joe: I found some occurrences of AH01075 due to client disconnects, I believe we forgot to add http://svn.apache.org/r1726019

@elukey I didn't add it on purpose, as I was pretty sure it was a minor effect; and yes, the final version of our package will have one big patch for mod_proxy_fcgi.c that includes that as well.

I just uploaded the latest version of the apache package to reprepro, now installing it on mw1262.

Checked if a 304 response would return the body with the new httpd version:

elukey@mw1261:~$ telnet localhost 80
Trying ::1...
Connected to localhost.
Escape character is '^]'.
GET /w/index.php?title=MediaWiki:Gadget-ImageAnnotator.js&action=raw&ctype=text/javascript HTTP/1.1
Host: commons.wikimedia.org
If-Modified-Since: Tue, 25 Aug 2015 15:19:10 GMT

HTTP/1.1 304 Not Modified
Date: Mon, 01 Aug 2016 07:34:46 GMT
Server: mw1261.eqiad.wmnet
Cache-control: public, s-maxage=300, max-age=1209600
Vary: Accept-Encoding,Cookie,Authorization

Connection closed by foreign host.

The body is not returned anymore but telnet seems to wait a bit after the headers have been returned. I suspect that this behavior might be related to one of the patches added that forces httpd to read the whole content of the response returned by HHVM before giving up, even if it will discard it right afterwards. This avoids subsequent spurious reads (mod_proxy_fcgi reading from HHVM) that turns into 503 and bogus logging.

So the new version of httpd seems working as it should but there might be the chance to improve even more if:

  • (preferred) httpd would be able to avoid waiting for the whole body from HHVM before terminating the HTTP response.
  • HHVM would be able to return a 304 itself rather than the whole content with a Last-Modified header and related.

At the moment the Jessie version of httpd returns 304s with payload so the new behavior will not add any latency to these calls, but will avoid to send unnecessary bytes.

So the new version of httpd seems working as it should but there might be the chance to improve even more if:

  • (preferred) httpd would be able to avoid waiting for the whole body from HHVM before terminating the HTTP response.

For example:

else if (status == HTTP_NOT_MODIFIED) {
    /* A 304 response MUST NOT contain
     * a message-body, so we must ignore it but
     * some extra steps need to be taken to
     * avoid inconsistencies.
     * The break is not added with connection
     * reuse set since there might be more bytes
     * to read like the message-body, that would 
     * trigger subsequent bogus reads (for example
     * the start of the message-body
     * interpreted as a FCGI header).
     * With connection reuse disabled (default)
     * we can safely break and force the end
     * of the FCGI processing phase since the
     * connection will be cleaned up later on. */
    ignore_body = 1;
    if (conn->close) {
        done = 1;
        break;
    }
}

I am going to follow up with upstream to see if this change is worth or not. I did some testing this morning with and without connection reuse (httpd-trunk) and it seems working fine.

Mentioned in SAL [2016-08-03T08:31:02Z] <elukey> upgrading httpd on mw126[34] to 2.4.10-10+deb8u4+wmf3 (T73487)

After a long chat with upstream we decided not to go ahead with this patch since it wouldn't give us a lot of benefit plus it wouldn't respect a lot the FCGI protocol.

Some gotchas for whoever is interested:

  1. The FCGI script/backend (in our case HHVM) follows the FCGI protocol and it should always send a FCGI End of Request header. httpd should wait for it even if in some cases (like 304s) the connection could theoretically be dropped beforehand. Adding shortcuts for performance improvements might not be the best thing in the longer term because it could introduce too much variability.
  1. I observed the delay between headers and end of request only with telnet, not with curl or other tools, so I got fooled by it. Bad Luca is bad.
  1. mod_proxy_fcgi sends and EOS (end of stream) bucket (basically either bytes or metadata grouped) right after it has seen the FCGI end of request, so the client will not wait for httpd to finish its internal stuff. The httpd thread will be blocked of course and there are some suggestions in the httpd dev mailing list about how to improve the waiting time after the FCGI end of Request header, but they would rely on other big changes in trunk. A bit overkill for us, might be something for the future if we see performance issues with this version of httpd (I don't expect anything like this though).

Summary: let's go ahead with the deployment of the current package :)

Mentioned in SAL [2016-08-04T10:33:03Z] <elukey> upgrading httpd on mw126[56] to 2.4.10-10+deb8u4+wmf3 (T73487)

Mentioned in SAL [2016-08-05T09:12:49Z] <elukey> upgrading httpd on mw126[78] to 2.4.10-10+deb8u4+wmf3 (T73487)

Mentioned in SAL [2016-08-10T14:40:39Z] <elukey> depooling mw1261 to install/test apache2_2.4.10-10+deb8u6+wmf1_amd64.deb (T73487). After basic checks the host will get back into service with weight 5.

Summary and news:

  1. Thanks to @Joe and @MoritzMuehlenhoff we have a new version of httpd for jessie-wikimedia (2.4.10-10+deb8u6+wmf2) that is based on the latest Debian upstream package version.
  2. All the jessie based appservers have the new patched httpd, meanwhile we decided not to work on the Ubuntu package to avoid major headaches.
  3. The HTTP 304 patch that we proposed to upstream has been merged to the 2.4.x branch and will be part of the next httpd release.

The issue will be completely resolved when all the appservers will be migrated to Jessie. I'd say that this phab task can be closed now.

Congratulations !

Might want a low priority task to later remove the logstash filter:

files/logstash/filter-syslog.conf
if [message] =~ /AH01075: Error dispatching request to :/ {
   drop {}
}

Change 306943 had a related patch set uploaded (by BryanDavis):
logstash: Stop dropping mod_proxy_fcgi warnings

https://gerrit.wikimedia.org/r/306943

As FYI we'll be able to merge the above patch only when https://phabricator.wikimedia.org/T143536 will be completed!

Change 306943 merged by Elukey:
logstash: Stop dropping mod_proxy_fcgi warnings

https://gerrit.wikimedia.org/r/306943

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM