Page MenuHomePhabricator

Varnishlog with Start timestamp but no Resp one causing data consistency check alarms
Closed, ResolvedPublic13 Estimated Story Points

Description

Hello Traffic team,

we have been receiving data consistency checks alarms for Hadoop Webrequest data, all of them for the upload segment. I created several instances of varnishlog in tmux sessions on upload hosts with the following parameters:

sudo varnishlog -c -n frontend -L 5000 -T 1500 -q 'VSL or (Timestamp:Start and not Timestamp:Resp)' | tee timeouts.txt

So basically what I wanted to log was:

  1. Any frontend request with the VSL tag (usually containing errors like timeout or store overflow, related to the VSL library).
  2. Any frontend request with a Start timestamp but no correspondent Resp one.

After some days of logs on cp3044 this is the breakdown:

elukey@cp3044:~$ grep -c Begin timeouts.txt
7972
elukey@cp3044:~$ grep -c "store overflow" timeouts.txt
5041
elukey@cp3044:~$ grep -c "timeout" timeouts.txt
0
elukey@cp3044:~$ grep -c "HttpGarbage" timeouts.txt
1719
elukey@cp3044:~$ grep -c "facebookexternalhit/1.1" timeouts.txt
1209

So over 7972 requests logged:

  1. 5041 were relate to store overflow , so varnishkafka needs to be reconfigured to keep more incomplete VSL records in memory (atm only 5000)
  2. 1719 requests like the following:
*   << Request  >> 169588668
-   Begin          req 169588667 rxreq
-   Timestamp      Start: 1476449181.932622 0.000000 0.000000
-   Timestamp      Req: 1476449181.932622 0.000000 0.000000
-   BogoHeader     Header has ctrl char 0x04
-   HttpGarbage    "GET%00"
-   ReqAcct        640 0 640 28 0 28
-   End
  1. 1209 request from the Facebook crawler, requesting the following URI:
/wikipedia/commons/thumb/6/6e/Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg/720px-Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg

The interesting bit is that the URI originates a HTTP 400 most of the times, but once in a while we get something like (IPs removed just in case, content available on cp3044:~elukey/timeouts.txt):

*   << Request  >> 27432747
-   Begin          req 27432746 rxreq
-   Timestamp      Start: 1476662854.478842 0.000000 0.000000
-   Timestamp      Req: 1476662854.478842 0.000000 0.000000
-   ReqStart       10.20.0.179 9470
-   ReqMethod      GET
-   ReqURL         /wikipedia/commons/thumb/6/6e/Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg/720px-Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: upload.wikimedia.org
-   ReqHeader      X-Client-IP: [IPv6 cut]
-   ReqHeader      X-Forwarded-For: [IPv6 cut]
-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      X-Connection-Properties: H2=0; SSR=0; SSL=[cut]; C=[cut]; EC=[cut];
-   ReqHeader      Connection: close
-   ReqHeader      User-Agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
-   ReqHeader      Accept: */*
-   ReqHeader      Accept-Encoding: deflate, gzip
-   ReqUnset       X-Forwarded-For: [IPv6 cut]
-   ReqHeader      X-Forwarded-For: [IPv6 cut], 10.20.0.179
-   VCL_call       RECV
-   ReqUnset       Host: upload.wikimedia.org
-   ReqHeader      Host: upload.wikimedia.org
-   VCL_acl        MATCH wikimedia_trust "10.0.0.0"/8
-   VCL_acl        MATCH local_host "10.20.0.179"
-   ReqUnset       X-Forwarded-For: [IPv6 cut], 10.20.0.179
-   ReqHeader      X-Forwarded-For: [IPv6 cut], 10.20.0.179
-   ReqUnset       X-Forwarded-For: [IPv6 cut], 10.20.0.179
-   ReqHeader      X-Forwarded-For: [IPv6 cut], 10.20.0.179
-   ReqUnset       X-Forwarded-For: [IPv6 cut], 10.20.0.179
-   ReqHeader      X-Forwarded-For: [IPv6 cut], 10.20.0.179
-   ReqHeader      via-nginx: 1
-   ReqHeader      X-Trusted-Proxy:
-   ReqUnset       X-Trusted-Proxy:
-   ReqHeader      X-Carrier:
-   ReqUnset       X-Carrier:
-   Debug          "VSLP picked preferred backend 10 for key 6db6c94d"
-   ReqHeader      X-WMF-NOCOOKIES: 1
-   ReqURL         /wikipedia/commons/thumb/6/6e/Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg/720px-Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg
-   VCL_return     hash
-   ReqUnset       Accept-Encoding: deflate, gzip
-   ReqHeader      Accept-Encoding: gzip
-   VCL_call       HASH
-   VCL_return     lookup
-   ReqAcct        631 0 631 0 0 0
-   End

I tried to access the upload URL from my browser and on cp3044 I can see correctly a log with Timestamp:Resp and the HTTP 400 status as expected. The big diff seems to be after the VSLP picked preferred backend 10 for key 6db6c94d Debug entry, so I am wondering if a backend down or temporary unavailable could cause these spurious logs.

I created https://gerrit.wikimedia.org/r/#/c/316306 to solve the first two issues (HttpGarbage and store overflow).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 316306 had a related patch set uploaded (by Elukey):
Tune varnishkafka-webrequest parameters

https://gerrit.wikimedia.org/r/316306

elukey triaged this task as Medium priority.
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

I tried to compare a Miley Cyrus link logging correctly a 400 (and Timestamp:Resp) with a "bad" one, and two things came up:

Full command:

elukey@cp3044:~$ diff ~elukey/miley_cyrus_ok.txt ~elukey/miley_cyrus_not_ok.txt

#1

< -   ReqHeader      Accept: */*
< -   ReqHeader      Accept-Encoding: deflate, gzip

#2

< -   VCL_call       MISS
< -   ReqHeader      X-CDIS: miss
< -   VCL_return     fetch
< -   Link           bereq 376061956 fetch
< -   Timestamp      Fetch: 1476774839.223294 0.334511 0.334511
< -   RespProtocol   HTTP/1.1
< -   RespStatus     400
< -   RespReason     Bad Request
[ other content cut ]
< -   VCL_call       DELIVER
< -   RespUnset      X-Cache-Int: cp1099 miss, cp3048 miss
< -   RespHeader     X-Cache-Int: cp1099 miss, cp3048 miss, cp3044 miss
< -   RespHeader     X-Cache: cp1099 miss, cp3048 miss, cp3044 miss
< -   RespUnset      X-Cache-Int: cp1099 miss, cp3048 miss, cp3044 miss
< -   RespHeader     X-Cache-Status: miss
[ other content cut ]
< -   VCL_return     deliver
< -   Timestamp      Process: 1476774839.223340 0.334557 0.000046
< -   Debug          "RES_MODE 2"
< -   RespHeader     Connection: close
< -   Timestamp      Resp: 1476774839.223366 0.334583 0.000026
< -   ReqAcct        631 0 631 1052 319 1371

I tried a curl request with --header "Accept-Encoding: " but didn't repro the bad case, so there might be something with VCL_call MISS ?

Change 316306 merged by Elukey:
Tune varnishkafka-webrequest parameters

https://gerrit.wikimedia.org/r/316306

Found other occurrences of the same issue but with different URIs on other hosts:

-   Debug          "VSLP picked preferred backend  6 for key 586e03e3"
-   ReqURL         /wikipedia/commons/thumb/c/c8/Canonici_-_Prospetto_biografico_delle_donne_italiane.djvu/page42-1024px-Canonici_-_Prospetto_biografico_delle_donne_italiane.djvu.jpg
-   VCL_return     hash
-   ReqUnset       accept-encoding: gzip, deflate, br
-   ReqHeader      Accept-Encoding: gzip
-   VCL_call       HASH
-   VCL_return     lookup
-   ReqAcct        690 0 690 0 0 0
-   End




-   Debug          "VSLP picked preferred backend  8 for key 7503265e"
-   ReqHeader      X-WMF-NOCOOKIES: 1
-   ReqURL         /wikipedia/commons/thumb/1/14/Mahuri.svg/1280px-Mahuri.svg.png
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   ReqAcct        354 0 354 0 0 0
-   End

I'll try to revise the varnishlog query to include also backend tags, but it is currently a bit difficult since the "Timestamp:Start and not Timestamp:Resp" is always valid for a backend transaction and varnishlog ends up logging everything.

Going to try with

sudo varnishlog -g request -n frontend -L 10000 -T 1500 -q 'VSL or (Timestamp:Start and not Timestamp:Resp) and not HttpGarbage' | tee timeouts.txt

Got some requests logged with the new query string but no trace of backend tags. Maybe this is a corner case of VSLP?

One thing that is worth mentioning: the ReqAcct tag in the logged requests without Timestamp:Resp has always this format:

- ReqAcct 668 0 668 0 0 0

From https://www.varnish-cache.org/docs/4.1/reference/vsl.html

ReqAcct - Request handling byte counts
Contains byte counts for the request handling. ESI sub-request counts are also added to their parent request. The body bytes count does not include transmission (ie: chunked encoding) overhead. The format is:

%d %d %d %d %d %d
|  |  |  |  |  |
|  |  |  |  |  +- Total bytes transmitted
|  |  |  |  +---- Body bytes transmitted
|  |  |  +------- Header bytes transmitted
|  |  +---------- Total bytes received
|  +------------- Body bytes received
+---------------- Header bytes received

So it seems that a HTTP header is received for the request (as expected) but no bytes are sent back to the client.

What kind of request is? Maybe something like HttpGarbage that we shouldn't log?

After a chat with @ema we decided to test a very basic use case, namely if a TCP RST from the client could cause this issue. Nginx terminates TLS on our caching layer (using HTTP 1.X or H2 with the external client) and then it opens a TCP socket to Varnish using HTTP 1.0 on top of it. It seems that when the Openssl bug happens, Nginx sends a RST to Varnish.

So I tested on my Vagrant VM a simple Python script causing a RST (it seems easy enough with SO_LINGER set and timeout = 0) but I wasn't able to repro, since I found this in varnishlog:

  • Debug "Write error, retval = -1, len = 350, errno = Connection reset by peer"
  • Timestamp Resp: 1478085444.365348 10.001466 0.000036
  • ReqAcct 27 0 27 255 95 350
  • End

This is expected since client disconnects happen all the time, and we would have been seeing tons of daily errors if a TCP RST was the cause. I suspect though that the complex Prod scenario together with certain upload requests causing a backend fetch can trigger this behavior (the absence of Timestamp:Resp) but finding a repro might be not easy.

The issue re-appeared again in upload today (early UTC morning time), all concentrated in ulsfo. I managed to capture 300MB of logs from varnishlog -g raw -n frontend as suggested by upstream a while ago, now stored in cp4006:/home/elukey/raw_04_nov.log.

I checked tags with no transaction id (suggested by upstream) but I only found occurrences of ExpKill and BackendHealth, nothing related to Timestamp:Resp or similar.

A faulty request logged looks like this one:

472422175 Begin          c sess 0 HTTP/1
472422175 SessOpen       c 10.128.0.106 42233 :3121 10.128.0.106 3121 1478256195.672734 561
472422175 Link           c req 472422176 rxreq
472422176 Begin          c req 472422175 rxreq
472422176 Timestamp      c Start: 1478256195.672745 0.000000 0.000000
472422176 Timestamp      c Req: 1478256195.672745 0.000000 0.000000
472422176 ReqStart       c 10.128.0.106 42233
472422176 ReqMethod      c GET
472422176 ReqURL         c /wikipedia/commons/thumb/6/63/Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg/720px-Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg
472422176 ReqProtocol    c HTTP/1.1
472422176 ReqHeader      c Host: upload.wikimedia.org
472422176 ReqHeader      c X-Client-IP: [IPv6 cut]:0:8000
472422176 ReqHeader      c X-Forwarded-For: [IPv6 cut]::8000
472422176 ReqHeader      c X-Forwarded-Proto: https
472422176 ReqHeader      c X-Connection-Properties: H2=0; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-AES256-GCM-SHA384; EC=prime256v1;
472422176 ReqHeader      c Connection: close
472422176 ReqHeader      c User-Agent: [cut]:
472422176 ReqHeader      c Accept: */*
472422176 ReqHeader      c Accept-Encoding: deflate, gzip
472422176 ReqUnset       c X-Forwarded-For: [IPv6 cut]::8000
472422176 ReqHeader      c X-Forwarded-For: [IPv6 cut]::8000, 10.128.0.106
472422176 VCL_call       c RECV
472422176 ReqUnset       c Host: upload.wikimedia.org
472422176 ReqHeader      c Host: upload.wikimedia.org
472422176 VCL_acl        c MATCH wikimedia_trust "10.0.0.0"/8
472422176 VCL_acl        c MATCH local_host "10.128.0.106"
472422176 ReqUnset       c X-Forwarded-For: [IPv6 cut]::8000, 10.128.0.106
472422176 ReqHeader      c X-Forwarded-For: [IPv6 cut]::8000, 10.128.0.106
472422176 ReqUnset       c X-Forwarded-For: [IPv6 cut]::8000, 10.128.0.106
472422176 ReqHeader      c X-Forwarded-For: [IPv6 cut]::8000, 10.128.0.106
472422176 ReqUnset       c X-Forwarded-For: [IPv6 cut]::8000, 10.128.0.106
472422176 ReqHeader      c X-Forwarded-For:[IPv6 cut]::8000, 10.128.0.106
472422176 ReqHeader      c via-nginx: 1
472422176 ReqHeader      c X-Trusted-Proxy:
472422176 ReqUnset       c X-Trusted-Proxy:
472422176 ReqHeader      c X-Carrier:
472422176 ReqUnset       c X-Carrier:
472422176 Debug          c "VSLP picked preferred backend  2 for key 91c72127"
472422176 ReqHeader      c X-WMF-NOCOOKIES: 1
472422176 ReqURL         c /wikipedia/commons/thumb/6/63/Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg/720px-Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg
472422176 VCL_return     c hash
472422176 ReqUnset       c Accept-Encoding: deflate, gzip
472422176 ReqHeader      c Accept-Encoding: gzip

One relevant thing is that ReqURL is repeated two times only in the faulty requests.

I tried the following on cp4006:

curl "http://localhost/wikipedia/commons/thumb/6/63/Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg/720px-Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg" --header "Host: upload.wikimedia.org" --header 'X-Forwarded-Proto: https' -i

The request takes ages to return the response (that is a 404).

The following request takes ages to complete on cp400[67] but it completes very quickly on cp1099:

curl "http://localhost/wikipedia/commons/thumb/6/63/Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg/720px-Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg" --header "Host: upload.wikimedia.org" --header 'X-Forwarded-Proto: https' -i

Could it be a VSLP consistent hashing config issue?

Quick summary:

Brandon and Ema debugged the upload issue and figured out that it was related to the absence of a hitforpass code snippet in upload, fixed with https://gerrit.wikimedia.org/r/#/c/319869/

Varnish does a clever thing when it comes to multiple clients requesting the same resource from a backend: it coalesce the requests creating only one backend call, returning the result to all the clients. In this way the backend is not hammered, but the clients requests are put in a waiting list that can grow a lot.

Text ulsfo and codfw have been migrated to Varnish 4 and the Analytics alarm triggered again. This time it was (mostly) for this request:

https://id.wikipedia.org/w/index.php?title=Template%3AAdvancedSiteNotices%2Fajax&action=render

It seems to be again an issue with the waiting list, a lot of requests logged without Timestamp:Resp.

I played a bit with Varnish's settings on my Vagrant VM and I managed to repro the incomplete logs.

Varnish setup:

ExecStart=/usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret 
-p first_byte_timeout=5 
-p send_timeout=10 -p between_bytes_timeout=5 
-p thread_pool_min=5 
-p thread_pool_max=10 
-p thread_pool_timeout=120 
-p vsl_reclen=2048 -s malloc,256M

Varnish VCL:

backend default {
    .host = "127.0.0.1";
    .port = "80";
}

Apache httpd responding on port 80, serving a PHP script that sleeps for 20 seconds before returning.

Running ab with high concurrency (ab -n 10000 -c 1000 localhost:6081/index.php, Varnish listening on port 6081) causes a lot of requests to fail and our dear incomplete logs to get emitted:

*   << Request  >> 623140
-   Begin          req 623139 rxreq
-   Timestamp      Start: 1478518277.456461 0.000000 0.000000
-   Timestamp      Req: 1478518277.456461 0.000000 0.000000
-   ReqStart       ::1 59121
-   ReqMethod      GET
-   ReqURL         /index.php?a=1
-   ReqProtocol    HTTP/1.0
-   ReqHeader      Host: localhost:6081
-   ReqHeader      User-Agent: ApacheBench/2.3
-   ReqHeader      Accept: */*
-   ReqHeader      X-Forwarded-For: ::1
-   VCL_call       RECV
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   ReqAcct        95 0 95 0 0 0
-   End

From http://book.varnish-software.com/4.0/chapters/Tuning.html:

Varnish operates with multiple pools of threads. When a connection is accepted, the connection is delegated to one of these thread pools. Afterwards, the thread pool either delegates the connection request to an available thread, queue the request otherwise, or drop the connection if the queue is full.

I don't think that queue refers to the listen_depth parameter since the requests queued have been already accepted (so a basic socket has been created and the HTTP request has been sent from the client to Varnish).

thread_queue_limit seems to be the likely candidate, default to 20.

The real problem here is a misunderstanding between varnish shm log's semantics and how we're interpreting that in varnishkafka and/or Hadoop. Log entries are terminated by End, and there's no real requirement, at least in practice, that other desirable records like Resp timestamps exist. I think there will always be corner cases and technical failures where the desirable records are missing from the transactions. The existence of such records shouldn't be a data inconsistency or cause for reporting bugs in the analytics pipeline itself.

However, noticing these unusual records has been extremely useful in tracking down bugs (in Varnish and/or VCL) having little to do with shm logging itself.

We should probably come up with some kind of valid interpretation of such records, using whatever information is available and flagging them as some kind of error request that we can log and analyze and report on regularly. The data on them is probably a treasure-trove of insights into corner cases.

We had a discussion on #wikimedia-traffic about this and the Analytics team completely agrees with what Brandon said.

The main current issue is that some request logs might get bucketed in the wrong hour causing false positives.
Example: data consistency checks for hour 12 has sequence numbers from 100->1000 for cp1008, and because of a missing datetime/Timestamp:Resp another log with sequence number 10 is added (that should have belonged to the hour before).
Our data checks will alert that seqnos 11->99 are missing, causing analytics alarms.

We have been working on https://gerrit.wikimedia.org/r/#/c/319582 during the past days to split the data consistency check alarm in two:

  1. Requests logged with what we consider partial data, like missing Timestamp:Resp. From what we have observed these occurrences should be rare and almost surely related to problems.
  1. Missing sequence numbers for each hour NOT considering the logs described in 1).

Another thing that we have discussed is whether or not Varnishkafka should generate a timestamp when Timestamp:Resp is not found among the VSL tags for a request log. The Analytics team prefers to explore the possibility only after https://gerrit.wikimedia.org/r/#/c/319582 will be tested.

Last but not the least, no alarms were fired for upload/text during the past two days, so this issue can be closed.

elukey set the point value for this task to 13.

Last but not the least, no alarms were fired for upload/text during the past two days, so this issue can be closed.