Page MenuHomePhabricator

ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe
Open, NormalPublic

Description

for some reason, cp4027 and cp5007 are showing higher varnish-fe failed fetches rates than nodes on the same cluster & DC using nginx as TLS terminator. See for example https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend

Details

Related Gerrit Patches:

Event Timeline

Vgutierrez triaged this task as Normal priority.Oct 8 2019, 7:38 AM
Vgutierrez created this task.

It looks to me like this is some kind of timeout issue with POST requests, checking the output of varnishlog -n frontend -q "FetchError" I'm seeing POST requests only with the following FetchErrors:

-   FetchError     req.body read error: 11 (Resource temporarily unavailable)
-   FetchError     backend write error: 11 (Resource temporarily unavailable)

or

-   FetchError     Resource temporarily unavailable
-   FetchError     straight insufficient bytes

In both cases Timestamp entries point out to a 60 seconds timeout somewhere:

-   Timestamp      Resp: 1570523668.972151 59.989519 0.000027
-   Timestamp      Resp: 1570523738.104388 59.953222 0.000043

on the ATS side it doesn't look like there is any timeout set to 60 seconds though:

vgutierrez@cp4027:~$ sudo -i traffic_ctl --run-root=/srv/trafficserver/tls config match timeout
NOTE: using command line path as RUNROOT
proxy.config.http.connect_attempts_timeout: 180
proxy.config.http.post_connect_attempts_timeout: 180
proxy.config.http.parent_proxy.connect_attempts_timeout: 180
proxy.config.ssl.session_cache.timeout: 7200
proxy.config.stop.shutdown_timeout: 0
proxy.config.lm.pserver_timeout_secs: 1
proxy.config.lm.pserver_timeout_msecs: 0
proxy.config.process_manager.timeout: 5
proxy.config.http.keep_alive_no_activity_timeout_in: 120
proxy.config.http.keep_alive_no_activity_timeout_out: 120
proxy.config.websocket.no_activity_timeout: 600
proxy.config.websocket.active_timeout: 3600
proxy.config.http.transaction_no_activity_timeout_in: 30
proxy.config.http.transaction_no_activity_timeout_out: 30
proxy.config.http.transaction_active_timeout_in: 900
proxy.config.http.transaction_active_timeout_out: 0
proxy.config.http.accept_no_activity_timeout: 120
proxy.config.http.background_fill_active_timeout: 0
proxy.config.socks.socks_timeout: 100
proxy.config.socks.server_connect_timeout: 10
proxy.config.socks.server_retry_timeout: 300
proxy.config.net.poll_timeout: 10
proxy.config.net.default_inactivity_timeout: 86400
proxy.config.dns.lookup_timeout: 20
proxy.config.hostdb.lookup_timeout: 30
proxy.config.hostdb.timeout: 86400
proxy.config.hostdb.fail.timeout: 0
proxy.config.log.collation_host_timeout: 86390
proxy.config.log.collation_client_timeout: 86400
proxy.config.ssl.handshake_timeout_in: 0
proxy.config.ssl.ocsp.cache_timeout: 3600
proxy.config.ssl.ocsp.request_timeout: 10
proxy.config.http2.accept_no_activity_timeout: 120
proxy.config.http2.no_activity_timeout_in: 120
proxy.config.http2.active_timeout_in: 0
proxy.config.http2.zombie_debug_timeout_in: 0

Also from varnishlog output, it looks like HTTP/1.1 and HTTP/2 clients are affected

Change 541524 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Match HTTP transaction activity timeout and TTFB timeouts

https://gerrit.wikimedia.org/r/541524

Change 541525 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Honour 180 secs timeout on backend instances

https://gerrit.wikimedia.org/r/541525

Change 541525 abandoned by Vgutierrez:
ATS: Honour 180 secs timeout on backend instances

Reason:
Merged into Ife0c346eff5863c7dffcc1625fa84b46efe9772c

https://gerrit.wikimedia.org/r/541525

Change 542317 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Add timing request information to ats-tls log

https://gerrit.wikimedia.org/r/542317

Vgutierrez added a comment.EditedOct 14 2019, 10:28 AM

After *a lot* of debugging, I've found something that could explain this behaviour. Testing locally with HTTP/1.1 and HTTP/2 requests I've observed that transactions marked as aborted by ATS sometimes reach an origin server.
This can be tested with this curl cmd for HTTP/1.1 requests:

timeout 5 curl --http1.1 -X POST -d "{}" -H "Content-Length: 50" "https://en.wikipedia.org/w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1" -v -o /dev/null

and with this one for HTTP/2 requests:

curl -X POST -d "{}" -H "Content-Length: 50" "https://en.wikipedia.org/w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1" -v -o /dev/null

This kind of request is going to be marked as aborted (status: 000) and sometimes will hit varnish-fe (but ATS won't wait for the response) resulting in a FetchError like this one:

*   << BeReq    >> 663286927
-   Begin          bereq 663286926 pass
-   Timestamp      Start: 1571047553.811770 0.000000 0.000000
-   BereqMethod    POST
-   BereqURL       /w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1
-   BereqProtocol  HTTP/1.1
-   BereqHeader    User-Agent: curl/7.54.0
-   BereqHeader    Accept: */*
-   BereqHeader    Content-Length: 50
-   BereqHeader    Content-Type: application/x-www-form-urlencoded
-   BereqHeader    X-Client-IP: 101.99.34.40
-   BereqHeader    X-Connection-Properties: H2=0; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-CHACHA20-POLY1305; EC=X25519;
-   BereqHeader    X-Forwarded-Proto: https
-   BereqHeader    Host: en.wikipedia.org
-   BereqHeader    X-Forwarded-For: 101.99.34.40, 10.132.0.107
-   BereqHeader    via-nginx: 1
-   BereqHeader    X-WMF-NOCOOKIES: 1
-   BereqHeader    X-CDIS: pass
-   BereqHeader    X-Varnish: 663286927
-   VCL_call       BACKEND_FETCH
-   VCL_return     fetch
-   BackendOpen    3059 vcl-64e43dc0-1f78-43f1-8f21-11b242b0ef4e.be_cp5011_eqsin_wmnet 10.132.0.111 3128 10.132.0.107 45795
-   BackendStart   10.132.0.111 3128
-   FetchError     req.body read error: 11 (Resource temporarily unavailable)
-   FetchError     backend write error: 11 (Resource temporarily unavailable)
-   Timestamp      Bereq: 1571047558.972651 5.160882 5.160882
-   BackendClose   3059 vcl-64e43dc0-1f78-43f1-8f21-11b242b0ef4e.be_cp5011_eqsin_wmnet
-   Timestamp      Beresp: 1571047613.812024 60.000254 54.839372
-   Timestamp      Error: 1571047613.812029 60.000259 0.000005
-   BerespProtocol HTTP/1.1
-   BerespStatus   503
-   BerespReason   Service Unavailable
-   BerespReason   Backend fetch failed
-   BerespHeader   Date: Mon, 14 Oct 2019 10:06:53 GMT
-   BerespHeader   Server: Varnish
-   VCL_call       BACKEND_ERROR
-   BerespHeader   X-CDIS: int
-   BerespHeader   Content-Type: text/html; charset=utf-8
-   VCL_return     deliver
-   Storage        malloc Transient
-   ObjProtocol    HTTP/1.1
-   ObjStatus      503
-   ObjReason      Backend fetch failed
-   ObjHeader      Date: Mon, 14 Oct 2019 10:06:53 GMT
-   ObjHeader      Server: Varnish
-   ObjHeader      X-CDIS: int
-   ObjHeader      Content-Type: text/html; charset=utf-8
-   Length         1794
-   BereqAcct      478 2 480 0 0 0
-   End

*   << Request  >> 663286926
-   Begin          req 663286925 rxreq
-   Timestamp      Start: 1571047553.811627 0.000000 0.000000
-   Timestamp      Req: 1571047553.811627 0.000000 0.000000
-   ReqStart       10.132.0.107 52257
-   ReqMethod      POST
-   ReqURL         /w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: en.wikipedia.org
-   ReqHeader      User-Agent: curl/7.54.0
-   ReqHeader      Accept: */*
-   ReqHeader      Content-Length: 50
-   ReqHeader      Content-Type: application/x-www-form-urlencoded
-   ReqHeader      X-Forwarded-For: 101.99.34.40
-   ReqHeader      X-Client-IP: 101.99.34.40
-   ReqHeader      X-Connection-Properties: H2=0; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-CHACHA20-POLY1305; EC=X25519;
-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      Connection: close
-   ReqUnset       X-Forwarded-For: 101.99.34.40
-   ReqHeader      X-Forwarded-For: 101.99.34.40, 10.132.0.107
-   VCL_call       RECV
-   ReqUnset       Host: en.wikipedia.org
-   ReqHeader      Host: en.wikipedia.org
-   ReqUnset       Host: en.wikipedia.org
-   ReqHeader      Host: en.wikipedia.org
-   VCL_acl        MATCH wikimedia_trust "10.0.0.0"/8
-   VCL_acl        NO_MATCH local_host
-   VCL_acl        MATCH local_tls_terminator "10.132.0.107"
-   ReqUnset       X-Forwarded-For: 101.99.34.40, 10.132.0.107
-   ReqHeader      X-Forwarded-For: 101.99.34.40, 10.132.0.107
-   ReqUnset       X-Forwarded-For: 101.99.34.40, 10.132.0.107
-   ReqHeader      X-Forwarded-For: 101.99.34.40, 10.132.0.107
-   ReqUnset       X-Forwarded-For: 101.99.34.40, 10.132.0.107
-   ReqHeader      X-Forwarded-For: 101.99.34.40, 10.132.0.107
-   ReqHeader      via-nginx: 1
-   ReqHeader      X-Trusted-Proxy:
-   ReqUnset       X-Trusted-Proxy:
-   VCL_acl        NO_MATCH blocked_nets
-   VCL_acl        NO_MATCH bot_blocked_nets
-   ReqHeader      X-WMF-NOCOOKIES: 1
-   VCL_acl        NO_MATCH abuse_nets
-   ReqHeader      X-CP-HTTP2: 0
-   ReqHeader      X-CP-TLS-Version: TLSv1.2
-   ReqHeader      X-CP-TLS-Session-Reused: 0
-   ReqHeader      X-CP-Key-Exchange: X25519
-   ReqHeader      X-CP-Full-Cipher: ECDHE-ECDSA-CHACHA20-POLY1305
-   ReqHeader      X-CP-Auth: ECDHE-ECDSA-CHACHA20-POLY1305
-   ReqHeader      X-CP-Cipher: CHACHA20-POLY1305
-   ReqUnset       X-CP-Auth: ECDHE-ECDSA-CHACHA20-POLY1305
-   ReqHeader      X-CP-Auth: ECDSA
-   ReqUnset       X-CP-Cipher: CHACHA20-POLY1305
-   ReqHeader      X-CP-Cipher: CHACHA20-POLY1305-SHA256
-   VCL_Log        CP-HTTP2: 0
-   VCL_Log        CP-TLS-Version: TLSv1.2
-   VCL_Log        CP-TLS-Session-Reused: 0
-   VCL_Log        CP-Key-Exchange: X25519
-   VCL_Log        CP-Auth: ECDSA
-   VCL_Log        CP-Cipher: CHACHA20-POLY1305-SHA256
-   VCL_Log        CP-Full-Cipher: ECDHE-ECDSA-CHACHA20-POLY1305
-   ReqUnset       X-CP-HTTP2: 0
-   ReqUnset       X-CP-TLS-Version: TLSv1.2
-   ReqUnset       X-CP-TLS-Session-Reused: 0
-   ReqUnset       X-CP-Key-Exchange: X25519
-   ReqUnset       X-CP-Auth: ECDSA
-   ReqUnset       X-CP-Cipher: CHACHA20-POLY1305-SHA256
-   ReqUnset       X-CP-Full-Cipher: ECDHE-ECDSA-CHACHA20-POLY1305
-   VCL_return     pass
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       PASS
-   ReqHeader      X-CDIS: pass
-   VCL_acl        NO_MATCH wikimedia_nets
-   VCL_return     fetch
-   Link           bereq 663286927 pass
-   Storage        malloc Transient
-   FetchError     Resource temporarily unavailable
-   FetchError     straight insufficient bytes
-   Timestamp      ReqBody: 1571047558.972637 5.161011 5.161011
-   Timestamp      Fetch: 1571047613.812101 60.000474 54.839464
-   RespProtocol   HTTP/1.1
-   RespStatus     503
-   RespReason     Backend fetch failed
-   RespHeader     Date: Mon, 14 Oct 2019 10:06:53 GMT
-   RespHeader     Server: Varnish
-   RespHeader     X-CDIS: int
-   RespHeader     Content-Type: text/html; charset=utf-8
-   RespHeader     X-Varnish: 663286926
-   RespHeader     Age: 0
-   RespHeader     Via: 1.1 varnish (Varnish/5.1)
-   VCL_call       DELIVER
-   ReqUnset       X-CDIS: pass
-   ReqHeader      X-CDIS: int
-   RespUnset      X-CDIS: int
-   RespHeader     X-Cache-Int: cp5007 int
-   RespHeader     X-Cache: cp5007 int
-   RespHeader     X-Cache-Status: int
-   RespUnset      X-Cache-Int: cp5007 int
-   RespUnset      Via: 1.1 varnish (Varnish/5.1)
-   RespUnset      X-Cache-Status: int
-   RespHeader     X-Cache-Status: int-front
-   RespHeader     Server-Timing: cache;desc="int-front"
-   RespHeader     Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
-   RespHeader     X-Analytics:
-   ReqHeader      X-NowDay: 14-Oct-2019
-   RespHeader     Set-Cookie: WMF-Last-Access=14-Oct-2019;Path=/;HttpOnly;secure;Expires=Fri, 15 Nov 2019 00:00:00 GMT
-   RespHeader     Set-Cookie: WMF-Last-Access-Global=14-Oct-2019;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Fri, 15 Nov 2019 00:00:00 GMT
-   RespUnset      X-Analytics:
-   RespHeader     X-Analytics: ;https=1
-   RespUnset      X-Analytics: ;https=1
-   RespHeader     X-Analytics: ;https=1;nocookies=1
-   RespUnset      X-Analytics: ;https=1;nocookies=1
-   RespHeader     X-Analytics: https=1;nocookies=1
-   RespHeader     X-Client-IP: 101.99.34.40
-   RespHeader     Set-Cookie: GeoIP=VN:SG:Ho_Chi_Minh_City:10.81:106.64:v4; Path=/; secure; Domain=.wikipedia.org
-   VCL_return     deliver
-   Timestamp      Process: 1571047613.812179 60.000553 0.000078
-   RespHeader     Content-Length: 1794
-   RespHeader     Connection: close
-   Timestamp      Resp: 1571047613.812220 60.000593 0.000040
-   ReqAcct        413 2 415 754 1794 2548
-   End
ema moved this task from Triage to TLS on the Traffic board.Oct 14 2019, 5:36 PM

Change 543271 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Testing buffer_upload experimental plugin - do not merge

https://gerrit.wikimedia.org/r/543271

Vgutierrez added a subscriber: ema.Oct 16 2019, 9:23 AM

As reported to upstream here, ATS effectively lacks the ability of buffering POST requests on a secure and performant manner.
Right now there are two ways of achieving buffering POST requests with ATS, one using the experimental plugin buffer_upload and the other one with a core feature introduced in 8.x managed by the setting proxy.config.http.request_buffer_enabled

mechanismon memory bufferingon disk bufferingbypass memory buffering if request is too big
buffer_uploadyesyesyes but disk buffering is not used, it just hits the origin server ASAP
request_bufferyesnono, request is dropped with a 403

if on disk buffering is used with buffer_upload, it's always used, so it could result in a performance penalty for small requests.

buffer_upload seems the most feature complete solution right now, but it's been dropped by upstream on ATS >= 9.x, but we could push to get it back
@ema @BBlack, I'd love to hear your feedback on this issue, specially how we could proceed to continue rolling out ATS in the text cluster despite of the current limitations

Change 543271 abandoned by Vgutierrez:
Testing buffer_upload experimental plugin - do not merge

https://gerrit.wikimedia.org/r/543271

jijiki added a subscriber: jijiki.EditedOct 22 2019, 12:45 PM

It appears we are also having fetch errors, possibly due to timeouts as well mostly on two servers where we have enabled ats-be, cp1075 and cp4027

I highly suspect that's related to stricter timeouts on ats-be compared to varnish-be and atls-tls, that would be solved soon when https://gerrit.wikimedia.org/r/c/operations/puppet/+/541524 is merged

Change 541524 merged by Vgutierrez:
[operations/puppet@production] ATS: Adjust timeouts in ats-tls and ats-backend instances

https://gerrit.wikimedia.org/r/541524

Mentioned in SAL (#wikimedia-operations) [2019-10-23T03:51:18Z] <vgutierrez> depool cp5007 - T234887

Mentioned in SAL (#wikimedia-operations) [2019-10-23T04:49:19Z] <vgutierrez> repool cp5007 - T234887

I've depooled cp5007 to conduct some experiments, I've captured the varnish-fe traffic with the following tcpdump filter portrange 3120-3127 and I triggered a failed fetch with the following request:

curl --resolve en.wikipedia.org:443:127.0.0.1 -X POST -d "test" -H "Content-Length: 10" "https://en.wikipedia.org/w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1&host=cp5007&h2=1&random=$RANDOM" -v

we see the following on the wire between ats-tls and varnish-fe:

POST /w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1&host=cp5007&h2=1&random=8112 HTTP/1.1
User-Agent: curl/7.52.1
Accept: */*
Content-Length: 10
Content-Type: application/x-www-form-urlencoded
Host: en.wikipedia.org
X-Forwarded-For: 127.0.0.1
X-Client-IP: 127.0.0.1
X-Connection-Properties: H2=1; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-AES256-GCM-SHA384; EC=prime256v1;
X-Forwarded-Proto: https
Connection: close

HTTP/1.1 503 Backend fetch failed
Date: Wed, 23 Oct 2019 04:06:55 GMT
Server: Varnish
Content-Type: text/html; charset=utf-8
X-Varnish: 186127383
Age: 0
X-Cache: cp5007 int
X-Cache-Status: int-front
Server-Timing: cache;desc="int-front"
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
Set-Cookie: WMF-Last-Access=23-Oct-2019;Path=/;HttpOnly;secure;Expires=Sun, 24 Nov 2019 00:00:00 GMT
Set-Cookie: WMF-Last-Access-Global=23-Oct-2019;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Sun, 24 Nov 2019 00:00:00 GMT
X-Analytics: https=1;nocookies=1
X-Client-IP: 127.0.0.1
Set-Cookie: GeoIP=:::::v4; Path=/; secure; Domain=.wikipedia.org
Content-Length: 1791
Connection: close

[...snip...]
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class="text-muted"><code>Request from 127.0.0.1 via cp5007 frontend, Varnish XID 186127384<br>Error: 503, Backend fetch failed at Wed, 23 Oct 2019 04:06:55 GMT</code></p>
</div>
</html>
 75  11.561729 10.132.0.107 → 10.132.0.107 TCP 74 8349 → 3123 [SYN] Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=2735498851 TSecr=0 WS=512
 76  11.561736 10.132.0.107 → 10.132.0.107 TCP 74 3123 → 8349 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=2735498851 TSecr=2735498851 WS=512
 77  11.561744 10.132.0.107 → 10.132.0.107 TCP 66 8349 → 3123 [ACK] Seq=1 Ack=1 Win=44032 Len=0 TSval=2735498851 TSecr=2735498851
 78  11.561997 10.132.0.107 → 10.132.0.107 TCP 506 POST /w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1&host=cp5007&h2=1&random=8112 HTTP/1.1  [TCP segment of a reassembled PDU]
 79  11.562017 10.132.0.107 → 10.132.0.107 TCP 66 3123 → 8349 [ACK] Seq=1 Ack=441 Win=45056 Len=0 TSval=2735498851 TSecr=2735498851
 80  12.942536 10.132.0.107 → 10.132.0.107 TCP 66 8349 → 3123 [FIN, ACK] Seq=441 Ack=1 Win=44032 Len=0 TSval=2735499196 TSecr=2735498851
 81  12.989195 10.132.0.107 → 10.132.0.107 TCP 66 3123 → 8349 [ACK] Seq=1 Ack=442 Win=45056 Len=0 TSval=2735499208 TSecr=2735499196
202  42.009912 10.132.0.107 → 10.132.0.107 HTTP 2577 HTTP/1.1 503 Backend fetch failed  (text/html)
203  42.009921 10.132.0.107 → 10.132.0.107 TCP 54 8349 → 3123 [RST] Seq=442 Win=0 Len=0

See how on packet number 80 ats-tls requests the shutdown of the TCP connection [FIN, ACK] but varnish answers anyways 30 secs later on packet 202 and of course ATS rejects that with a RST on packet 203

Change 545499 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Send "100 continue" responses on the ats-tls instance

https://gerrit.wikimedia.org/r/545499

Change 545499 merged by Vgutierrez:
[operations/puppet@production] ATS: Send "100 continue" responses on the ats-tls instance

https://gerrit.wikimedia.org/r/545499