Hello Traffic team,
we have been receiving data consistency checks alarms for Hadoop Webrequest data, all of them for the upload segment. I created several instances of varnishlog in tmux sessions on upload hosts with the following parameters:
sudo varnishlog -c -n frontend -L 5000 -T 1500 -q 'VSL or (Timestamp:Start and not Timestamp:Resp)' | tee timeouts.txt
So basically what I wanted to log was:
- Any frontend request with the VSL tag (usually containing errors like timeout or store overflow, related to the VSL library).
- Any frontend request with a Start timestamp but no correspondent Resp one.
After some days of logs on cp3044 this is the breakdown:
elukey@cp3044:~$ grep -c Begin timeouts.txt 7972 elukey@cp3044:~$ grep -c "store overflow" timeouts.txt 5041 elukey@cp3044:~$ grep -c "timeout" timeouts.txt 0 elukey@cp3044:~$ grep -c "HttpGarbage" timeouts.txt 1719 elukey@cp3044:~$ grep -c "facebookexternalhit/1.1" timeouts.txt 1209
So over 7972 requests logged:
- 5041 were relate to store overflow , so varnishkafka needs to be reconfigured to keep more incomplete VSL records in memory (atm only 5000)
- 1719 requests like the following:
* << Request >> 169588668 - Begin req 169588667 rxreq - Timestamp Start: 1476449181.932622 0.000000 0.000000 - Timestamp Req: 1476449181.932622 0.000000 0.000000 - BogoHeader Header has ctrl char 0x04 - HttpGarbage "GET%00" - ReqAcct 640 0 640 28 0 28 - End
- 1209 request from the Facebook crawler, requesting the following URI:
/wikipedia/commons/thumb/6/6e/Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg/720px-Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg
The interesting bit is that the URI originates a HTTP 400 most of the times, but once in a while we get something like (IPs removed just in case, content available on cp3044:~elukey/timeouts.txt):
* << Request >> 27432747 - Begin req 27432746 rxreq - Timestamp Start: 1476662854.478842 0.000000 0.000000 - Timestamp Req: 1476662854.478842 0.000000 0.000000 - ReqStart 10.20.0.179 9470 - ReqMethod GET - ReqURL /wikipedia/commons/thumb/6/6e/Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg/720px-Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg - ReqProtocol HTTP/1.1 - ReqHeader Host: upload.wikimedia.org - ReqHeader X-Client-IP: [IPv6 cut] - ReqHeader X-Forwarded-For: [IPv6 cut] - ReqHeader X-Forwarded-Proto: https - ReqHeader X-Connection-Properties: H2=0; SSR=0; SSL=[cut]; C=[cut]; EC=[cut]; - ReqHeader Connection: close - ReqHeader User-Agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) - ReqHeader Accept: */* - ReqHeader Accept-Encoding: deflate, gzip - ReqUnset X-Forwarded-For: [IPv6 cut] - ReqHeader X-Forwarded-For: [IPv6 cut], 10.20.0.179 - VCL_call RECV - ReqUnset Host: upload.wikimedia.org - ReqHeader Host: upload.wikimedia.org - VCL_acl MATCH wikimedia_trust "10.0.0.0"/8 - VCL_acl MATCH local_host "10.20.0.179" - ReqUnset X-Forwarded-For: [IPv6 cut], 10.20.0.179 - ReqHeader X-Forwarded-For: [IPv6 cut], 10.20.0.179 - ReqUnset X-Forwarded-For: [IPv6 cut], 10.20.0.179 - ReqHeader X-Forwarded-For: [IPv6 cut], 10.20.0.179 - ReqUnset X-Forwarded-For: [IPv6 cut], 10.20.0.179 - ReqHeader X-Forwarded-For: [IPv6 cut], 10.20.0.179 - ReqHeader via-nginx: 1 - ReqHeader X-Trusted-Proxy: - ReqUnset X-Trusted-Proxy: - ReqHeader X-Carrier: - ReqUnset X-Carrier: - Debug "VSLP picked preferred backend 10 for key 6db6c94d" - ReqHeader X-WMF-NOCOOKIES: 1 - ReqURL /wikipedia/commons/thumb/6/6e/Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg/720px-Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg - VCL_return hash - ReqUnset Accept-Encoding: deflate, gzip - ReqHeader Accept-Encoding: gzip - VCL_call HASH - VCL_return lookup - ReqAcct 631 0 631 0 0 0 - End
I tried to access the upload URL from my browser and on cp3044 I can see correctly a log with Timestamp:Resp and the HTTP 400 status as expected. The big diff seems to be after the VSLP picked preferred backend 10 for key 6db6c94d Debug entry, so I am wondering if a backend down or temporary unavailable could cause these spurious logs.
I created https://gerrit.wikimedia.org/r/#/c/316306 to solve the first two issues (HttpGarbage and store overflow).