Page MenuHomePhabricator

noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams
Closed, ResolvedPublic

Description

@CptViraj reported noc.wikimedia.org was down, but it works for me. Looked closer and while other clusters are fine, fetches sometimes fail in esams (~20% of the time), and fail 100% of the time in eqsin.

These are the status codes for a simple fetch of the front page of https://noc.wikimedia.org/ using curl --resolve to re-point at different text-lbs:

eqiad    HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200
codfw    HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200
esams    HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 503  HTTP/2 200  HTTP/2 200  HTTP/2 200
ulsfo    HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200
eqsin    HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503

esams    HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 503  HTTP/2 503
eqsin    HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503

esams    HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 200  HTTP/2 503  HTTP/2 200  HTTP/2 503
eqsin    HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503  HTTP/2 503

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CDanis triaged this task as Medium priority.Jun 14 2020, 2:12 PM
CDanis updated the task description. (Show Details)

Checking against eqsin with curl --resolve noc.wikimedia.org:443:$(dig +short text-lb.eqsin.wikimedia.org) https://noc.wikimedia.org I do get a 503 and atslog shows the following results on cp5010:

vgutierrez@cp5010:~$ sudo -i atslog-tls noc.wikimedia.org
dt:2020-06-15T08:25:34Z	hostname:cp5010.eqsin.wmnet	time_firstbyte:1402	ip:84.78.247.11	cache_status:int-front	http_status:503	response_size:1795	http_method:GET	uri_host:noc.wikimedia.org	uri_path:/	content_type:text/html; charset=utf-8	referer:-	user_agent:curl/7.64.1	accept_language:-	x_analytics:https=1;nocookies=1	range:-	x_cache:cp5010 int	accept:*/*	backend:Varnish	tls:vers=TLSv1.2;keyx=X25519;auth=ECDSA;ciph=AES256-GCM-SHA384;prot=h2;sess=new
^C
vgutierrez@cp5010:~$ sudo -i atslog-backend noc.wikimedia.org
Date:2020-06-15 Time:08:26:38 ConnAttempts:1 ConnReuse:0 TTFetchHeaders:860 ClientTTFB:860 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:860 OriginServer:mwmaint.discovery.wmnet OriginServerTime:0 CacheResultCode:TCP_REFRESH_HIT CacheWriteResult:- ReqMethod:GET RespStatus:200 OriginStatus:304 ReqURL:http://noc.wikimedia.org/ ReqHeader:User-Agent:curl/7.64.1 ReqHeader:Host:noc.wikimedia.org ReqHeader:X-Client-IP:84.78.247.11 ReqHeader:Cookie:- BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:s-maxage=60, must-revalidate, max-age=0 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp5010 miss RespHeader:Backend-Timing:D=228 t=1591945783327213

curl --http1.1 -H 'Host: noc.wikimedia.org' https://mwmaint.discovery.wmnet from cp5010 returns a HTTP 200 as expected

varnish-fe also shows a 503:

vgutierrez@cp5010:~$ sudo -i varnishlog -n frontend -q "ReqHeader:Host eq noc.wikimedia.org"
*   << Request  >> 1066411391
-   Begin          req 1058354185 rxreq
-   Timestamp      Start: 1592211439.329728 0.000000 0.000000
-   Timestamp      Req: 1592211439.329728 0.000000 0.000000
-   ReqStart       10.132.0.110 20543
-   ReqMethod      GET
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      User-Agent: curl/7.64.1
-   ReqHeader      Accept: */*
-   ReqHeader      Host: noc.wikimedia.org
-   ReqHeader      X-Forwarded-For: 84.78.247.11
-   ReqHeader      X-Analytics-TLS: vers=TLSv1.2;keyx=X25519;auth=ECDSA;ciph=AES256-GCM-SHA384;prot=h2;sess=new
-   ReqHeader      X-Client-IP: 84.78.247.11
-   ReqHeader      X-Connection-Properties: H2=1; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-AES256-GCM-SHA384; EC=X25519;
-   ReqHeader      X-Forwarded-Proto: https
-   ReqUnset       X-Forwarded-For: 84.78.247.11
-   ReqHeader      X-Forwarded-For: 84.78.247.11, 10.132.0.110
-   VCL_call       RECV
-   ReqUnset       X-Analytics-TLS: vers=TLSv1.2;keyx=X25519;auth=ECDSA;ciph=AES256-GCM-SHA384;prot=h2;sess=new
-   VCL_acl        MATCH wikimedia_trust "10.0.0.0"/8
-   VCL_acl        NO_MATCH local_host
-   VCL_acl        MATCH local_tls_terminator "10.132.0.110"
-   ReqUnset       X-Forwarded-For: 84.78.247.11, 10.132.0.110
-   ReqHeader      X-Forwarded-For: 84.78.247.11, 10.132.0.110
-   ReqUnset       X-Forwarded-For: 84.78.247.11, 10.132.0.110
-   ReqHeader      X-Forwarded-For: 84.78.247.11, 10.132.0.110
-   ReqUnset       X-Forwarded-For: 84.78.247.11, 10.132.0.110
-   ReqHeader      X-Forwarded-For: 84.78.247.11, 10.132.0.110
-   ReqHeader      via-nginx: 1
-   ReqHeader      X-Trusted-Proxy:
-   ReqUnset       X-Trusted-Proxy:
-   ReqHeader      x-tls-prot: 1
-   ReqUnset       x-tls-prot: 1
-   ReqHeader      x-tls-prot: h2
-   ReqHeader      x-tls-sess: 0
-   ReqUnset       x-tls-sess: 0
-   ReqHeader      x-tls-sess: new
-   ReqHeader      x-tls-vers: TLSv1.2
-   ReqHeader      x-tls-keyx: X25519
-   ReqHeader      x-tls-auth: ECDHE-ECDSA-AES256-GCM-SHA384
-   ReqHeader      x-tls-ciph: AES256-GCM-SHA384
-   ReqUnset       x-tls-auth: ECDHE-ECDSA-AES256-GCM-SHA384
-   ReqHeader      x-tls-auth: ECDSA
-   ReqUnset       x-tls-ciph: AES256-GCM-SHA384
-   ReqHeader      x-tls-ciph: AES256-GCM-SHA384
-   VCL_Log        tls: vers=TLSv1.2;keyx=X25519;auth=ECDSA;ciph=AES256-GCM-SHA384;prot=h2;sess=new
-   ReqUnset       x-tls-prot: h2
-   ReqUnset       x-tls-vers: TLSv1.2
-   ReqUnset       x-tls-sess: new
-   ReqUnset       x-tls-keyx: X25519
-   ReqUnset       x-tls-auth: ECDSA
-   ReqUnset       x-tls-ciph: AES256-GCM-SHA384
-   VCL_acl        NO_MATCH blocked_nets
-   VCL_acl        NO_MATCH bot_blocked_nets
-   ReqUnset       Host: noc.wikimedia.org
-   ReqHeader      Host: noc.wikimedia.org
-   ReqUnset       Host: noc.wikimedia.org
-   ReqHeader      Host: noc.wikimedia.org
-   Debug          "Now using wikimedia_misc VCL"
-   VCL_return     vcl
-   VCL_call       RECV
-   ReqUnset       X-Analytics-TLS: vers=TLSv1.2;keyx=X25519;auth=ECDSA;ciph=AES256-GCM-SHA384;prot=h2;sess=new
-   VCL_acl        MATCH wikimedia_trust "10.0.0.0"/8
-   VCL_acl        NO_MATCH local_host
-   VCL_acl        MATCH local_tls_terminator "10.132.0.110"
-   ReqUnset       X-Forwarded-For: 84.78.247.11
-   ReqHeader      X-Forwarded-For: 84.78.247.11
-   ReqUnset       X-Forwarded-For: 84.78.247.11
-   ReqHeader      X-Forwarded-For: 84.78.247.11
-   ReqUnset       X-Forwarded-For: 84.78.247.11
-   ReqHeader      X-Forwarded-For: 84.78.247.11
-   ReqHeader      via-nginx: 1
-   ReqHeader      X-Trusted-Proxy:
-   ReqUnset       X-Trusted-Proxy:
-   ReqHeader      x-tls-prot: 1
-   ReqUnset       x-tls-prot: 1
-   ReqHeader      x-tls-prot: h2
-   ReqHeader      x-tls-sess: 0
-   ReqUnset       x-tls-sess: 0
-   ReqHeader      x-tls-sess: new
-   ReqHeader      x-tls-vers: TLSv1.2
-   ReqHeader      x-tls-keyx: X25519
-   ReqHeader      x-tls-auth: ECDHE-ECDSA-AES256-GCM-SHA384
-   ReqHeader      x-tls-ciph: AES256-GCM-SHA384
-   ReqUnset       x-tls-auth: ECDHE-ECDSA-AES256-GCM-SHA384
-   ReqHeader      x-tls-auth: ECDSA
-   ReqUnset       x-tls-ciph: AES256-GCM-SHA384
-   ReqHeader      x-tls-ciph: AES256-GCM-SHA384
-   VCL_Log        tls: vers=TLSv1.2;keyx=X25519;auth=ECDSA;ciph=AES256-GCM-SHA384;prot=h2;sess=new
-   ReqUnset       x-tls-prot: h2
-   ReqUnset       x-tls-vers: TLSv1.2
-   ReqUnset       x-tls-sess: new
-   ReqUnset       x-tls-keyx: X25519
-   ReqUnset       x-tls-auth: ECDSA
-   ReqUnset       x-tls-ciph: AES256-GCM-SHA384
-   VCL_acl        NO_MATCH blocked_nets
-   VCL_acl        NO_MATCH bot_blocked_nets
-   ReqUnset       Host: noc.wikimedia.org
-   ReqHeader      Host: noc.wikimedia.org
-   ReqUnset       Host: noc.wikimedia.org
-   ReqHeader      Host: noc.wikimedia.org
-   ReqHeader      X-WMF-NOCOOKIES: 1
-   VCL_return     pass
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       PASS
-   ReqHeader      X-CDIS: pass
-   VCL_return     fetch
-   Link           bereq 1066411392 pass
-   Timestamp      Fetch: 1592211440.829750 1.500022 1.500022
-   RespProtocol   HTTP/1.1
-   RespStatus     503
-   RespReason     Backend fetch failed
-   RespHeader     Date: Mon, 15 Jun 2020 08:57:20 GMT
-   RespHeader     Server: Varnish
-   RespHeader     X-CDIS: int
-   RespHeader     Content-Type: text/html; charset=utf-8
-   RespHeader     X-Varnish: 1066411391
-   RespHeader     Age: 0
-   RespHeader     Via: 1.1 varnish (Varnish/5.1)
-   VCL_call       DELIVER
-   ReqUnset       X-CDIS: pass
-   ReqHeader      X-CDIS: int
-   RespUnset      X-CDIS: int
-   RespHeader     X-Cache-Int: cp5010 int
-   RespHeader     X-Cache: cp5010 int
-   RespHeader     X-Cache-Status: int
-   RespUnset      X-Cache-Int: cp5010 int
-   RespUnset      Via: 1.1 varnish (Varnish/5.1)
-   RespUnset      X-Cache-Status: int
-   RespHeader     X-Cache-Status: int-front
-   RespHeader     Server-Timing: cache;desc="int-front"
-   RespHeader     Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
-   RespHeader     X-Analytics:
-   ReqHeader      X-NowDay: 15-Jun-2020
-   RespHeader     Set-Cookie: WMF-Last-Access=15-Jun-2020;Path=/;HttpOnly;secure;Expires=Fri, 17 Jul 2020 00:00:00 GMT
-   RespUnset      X-Analytics:
-   RespHeader     X-Analytics: ;https=1
-   RespUnset      X-Analytics: ;https=1
-   RespHeader     X-Analytics: ;https=1;nocookies=1
-   RespUnset      X-Analytics: ;https=1;nocookies=1
-   RespHeader     X-Analytics: https=1;nocookies=1
-   RespHeader     X-Client-IP: 84.78.247.11
-   VCL_return     deliver
-   Timestamp      Process: 1592211440.829807 1.500079 0.000057
-   RespHeader     Content-Length: 1795
-   RespHeader     Connection: keep-alive
-   Timestamp      Resp: 1592211440.829838 1.500110 0.000031
-   ReqAcct        355 0 355 532 1795 2327
-   End

Filtering by BeReqHeader we can see how varnish-fe apparently gets a 200 from ats-be and returns a 503 cause the "body cannot be fetched":

vgutierrez@cp5010:~$ sudo -i varnishlog -n frontend -q "BeReqHeader:Host eq noc.wikimedia.org"
*   << BeReq    >> 1061469865
-   Begin          bereq 1061469864 pass
-   Timestamp      Start: 1592211658.093100 0.000000 0.000000
-   BereqMethod    GET
-   BereqURL       /
-   BereqProtocol  HTTP/1.1
-   BereqHeader    User-Agent: curl/7.64.1
-   BereqHeader    Accept: */*
-   BereqHeader    X-Client-IP: 84.78.247.11
-   BereqHeader    X-Connection-Properties: H2=1; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-AES256-GCM-SHA384; EC=X25519;
-   BereqHeader    X-Forwarded-Proto: https
-   BereqHeader    X-Forwarded-For: 84.78.247.11
-   BereqHeader    via-nginx: 1
-   BereqHeader    Host: noc.wikimedia.org
-   BereqHeader    X-WMF-NOCOOKIES: 1
-   BereqHeader    X-CDIS: pass
-   BereqHeader    X-Varnish: 1061469865
-   VCL_call       BACKEND_FETCH
-   VCL_return     fetch
-   BackendOpen    4495 vcl-d9221012-6a32-4c4d-b28f-9081e0352d72.be_cp5007_eqsin_wmnet 10.132.0.107 3128 10.132.0.110 12043
-   BackendStart   10.132.0.107 3128
-   Timestamp      Bereq: 1592211658.093177 0.000077 0.000077
-   Timestamp      Beresp: 1592211659.074205 0.981105 0.981028
-   BerespProtocol HTTP/1.1
-   BerespStatus   200
-   BerespReason   OK
-   BerespHeader   date: Mon, 15 Jun 2020 09:00:58 GMT
-   BerespHeader   server: Apache
-   BerespHeader   last-modified: Tue, 17 Jul 2018 17:07:05 GMT
-   BerespHeader   accept-ranges: bytes
-   BerespHeader   content-length: 3162
-   BerespHeader   vary: Accept-Encoding
-   BerespHeader   backend-timing: D=203 t=1591858360290588
-   BerespHeader   cache-control: s-maxage=60, must-revalidate, max-age=0
-   BerespHeader   content-type: text/html
-   BerespHeader   x-envoy-upstream-service-time: 0
-   BerespHeader   Transfer-Encoding: chunked
-   BerespHeader   Etag: "c5a-57134f9c13cba"
-   BerespHeader   Age: 2
-   BerespHeader   Connection: keep-alive
-   BerespHeader   X-Cache-Int: cp5007 miss
-   BerespHeader   X-ATS-Timestamp: 1592211659
-   BackendClose   4495 vcl-d9221012-6a32-4c4d-b28f-9081e0352d72.be_cp5007_eqsin_wmnet
-   Error          Body cannot be fetched
-   Timestamp      Error: 1592211659.074231 0.981131 0.000026
-   BerespProtocol HTTP/1.1
-   BerespStatus   503
-   BerespReason   Service Unavailable
-   BerespReason   Backend fetch failed
-   BerespHeader   Date: Mon, 15 Jun 2020 09:00:59 GMT
-   BerespHeader   Server: Varnish
-   VCL_call       BACKEND_ERROR
-   VCL_return     retry
-   Timestamp      Retry: 1592211659.074241 0.981141 0.000010
-   Link           bereq 2528473 retry
-   End

*   << BeReq    >> 2528473
-   Begin          bereq 1061469865 retry
-   Timestamp      Start: 1592211659.074241 0.981141 0.000000
-   BereqMethod    GET
-   BereqURL       /
-   BereqProtocol  HTTP/1.1
-   BereqHeader    User-Agent: curl/7.64.1
-   BereqHeader    Accept: */*
-   BereqHeader    X-Client-IP: 84.78.247.11
-   BereqHeader    X-Connection-Properties: H2=1; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-AES256-GCM-SHA384; EC=X25519;
-   BereqHeader    X-Forwarded-Proto: https
-   BereqHeader    X-Forwarded-For: 84.78.247.11
-   BereqHeader    via-nginx: 1
-   BereqHeader    Host: noc.wikimedia.org
-   BereqHeader    X-WMF-NOCOOKIES: 1
-   BereqHeader    X-CDIS: pass
-   BereqHeader    X-Varnish: 1061469865
-   BereqUnset     X-Varnish: 1061469865
-   BereqHeader    X-Varnish: 2528473
-   VCL_call       BACKEND_FETCH
-   VCL_return     fetch
-   BackendOpen    4053 vcl-d9221012-6a32-4c4d-b28f-9081e0352d72.be_cp5010_eqsin_wmnet 10.132.0.110 3128 10.132.0.110 9593
-   BackendStart   10.132.0.110 3128
-   Timestamp      Bereq: 1592211659.074289 0.981189 0.000048
-   Timestamp      Beresp: 1592211659.774325 1.681225 0.700036
-   BerespProtocol HTTP/1.1
-   BerespStatus   200
-   BerespReason   OK
-   BerespHeader   date: Mon, 15 Jun 2020 09:00:59 GMT
-   BerespHeader   server: Apache
-   BerespHeader   last-modified: Tue, 17 Jul 2018 17:07:05 GMT
-   BerespHeader   accept-ranges: bytes
-   BerespHeader   content-length: 3162
-   BerespHeader   vary: Accept-Encoding
-   BerespHeader   backend-timing: D=228 t=1591945783327213
-   BerespHeader   cache-control: s-maxage=60, must-revalidate, max-age=0
-   BerespHeader   content-type: text/html
-   BerespHeader   x-envoy-upstream-service-time: 0
-   BerespHeader   Transfer-Encoding: chunked
-   BerespHeader   Etag: "c5a-57134f9c13cba"
-   BerespHeader   Age: 0
-   BerespHeader   Connection: keep-alive
-   BerespHeader   X-Cache-Int: cp5010 miss
-   BerespHeader   X-ATS-Timestamp: 1592211659
-   BackendClose   4053 vcl-d9221012-6a32-4c4d-b28f-9081e0352d72.be_cp5010_eqsin_wmnet
-   Error          Body cannot be fetched
-   Timestamp      Error: 1592211659.774360 1.681260 0.000035
-   BerespProtocol HTTP/1.1
-   BerespStatus   503
-   BerespReason   Service Unavailable
-   BerespReason   Backend fetch failed
-   BerespHeader   Date: Mon, 15 Jun 2020 09:00:59 GMT
-   BerespHeader   Server: Varnish
-   VCL_call       BACKEND_ERROR
-   BerespHeader   X-CDIS: int
-   BerespHeader   Content-Type: text/html; charset=utf-8
-   VCL_return     deliver
-   Storage        malloc Transient
-   ObjProtocol    HTTP/1.1
-   ObjStatus      503
-   ObjReason      Backend fetch failed
-   ObjHeader      Date: Mon, 15 Jun 2020 09:00:59 GMT
-   ObjHeader      Server: Varnish
-   ObjHeader      X-CDIS: int
-   ObjHeader      Content-Type: text/html; charset=utf-8
-   Length         1792
-   BereqAcct      661 0 661 968 0 968
-   End

It seems that something might be going wrong at the ats-tls<->varnish-fe level. Hitting varnish-fe directly on cp5007 I constantly get a 200 response with body and everything:

11:26:01 ema@cp5007.eqsin.wmnet:~
$ for i in 1 2 3; do curl -v -H "X-Forwarded-Proto: https" -H "Host: noc.wikimedia.org" localhost:80/?x=1 2>&1 | egrep '< HTTP|DOCT'; done
< HTTP/1.1 200 OK
<!DOCTYPE html>
< HTTP/1.1 200 OK
<!DOCTYPE html>
< HTTP/1.1 200 OK
<!DOCTYPE html>

It seems that something might be going wrong at the ats-tls<->varnish-fe level. Hitting varnish-fe directly on cp5007 I constantly get a 200 response with body and everything:

11:26:01 ema@cp5007.eqsin.wmnet:~
$ for i in 1 2 3; do curl -v -H "X-Forwarded-Proto: https" -H "Host: noc.wikimedia.org" localhost:80/?x=1 2>&1 | egrep '< HTTP|DOCT'; done
< HTTP/1.1 200 OK

The reason for the test above to always succeed is that those responses are 'pass'.

  • BerespHeader content-length: 3162

[...]

  • BerespHeader Transfer-Encoding: chunked

Responses cannot send both Content-Length and Transfer-Encoding: chunked, this is what is ultimately causing varnish to return a 503. It seems that ats-be is adding TE: chunked to cached responses with explicit CL.

Change 605578 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: stop caching noc.wm.org responses

https://gerrit.wikimedia.org/r/605578

Change 605578 merged by Ema:
[operations/puppet@production] ATS: stop caching noc.wm.org responses

https://gerrit.wikimedia.org/r/605578

Apparently TE:chunked is not added only on cache hits, but occasionally on miss/pass too. https://gerrit.wikimedia.org/r/605578 does make sense and it's good we found out it was missing, but it does not address this issue.

I've tried running curl -v -H 'Host: noc.wikimedia.org' localhost:3128/ 2>&1 | egrep -i 'transfer-encoding|content-length' on all cache-text nodes, and the following ats-be returned both CL and TE:

===== NODE GROUP =====                                                                                   
(6) cp2027.codfw.wmnet,cp[5008,5010,5012].eqsin.wmnet,cp[3050,3056].esams.wmnet                          
----- OUTPUT of 'curl -v  -H 'Hos...|content-length'' -----                                              
< content-length: 3162                                                                                   
< Transfer-Encoding: chunked

Puzzling.

I found out more. This morning the list of affected hosts is shorter than yesterday:

cp2027.codfw.wmnet,cp5008.eqsin.wmnet,cp3050.esams.wmnet

My theory is that this happens because, on other cache backends such as cp3056, cached objects with wrong Transfer-Encoding have expired overnight. This contradicts my observation from yesterday:

Apparently TE:chunked is not added only on cache hits, but occasionally on miss/pass too

The observation is wrong in a subtle way. To make sure that the HTTP client does not get in our way by automagically decoding stuff, I've used the following socket-based client:

import socket
  
client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = ('localhost', 3128)
client_socket.connect(server_address)

request_header = 'GET / HTTP/1.1\r\nConnection: close\r\nHost: noc.wikimedia.org\r\n\r\n'
client_socket.send(request_header)

response = ''
while True:
    recv = client_socket.recv(1024)
    if not recv:
        break
    response += recv

print response
client_socket.close()

The interesting parts of the output on cp3050 are:

HTTP/1.1 200 OK
last-modified: Tue, 17 Jul 2018 17:07:05 GMT
content-length: 3162
Transfer-Encoding: chunked
X-Cache-Int: cp3050 miss
Age: 71669

<!DOCTYPE html>
<html dir="ltr">
[...]

The following things are wrong here:

  1. Content-Length and Transfer-Encoding must not be sent together according to rfc2616#section-4.4. If both are sent anyways, the client must ignore Content-Length. This is what Varnish does, and it fails to fetch the body given that the response is not actually chunked
  2. Transfer-Encoding: chunked should not be sent at all given that the output clearly isn't chunked
  3. A cache miss should not have Age ~ 20 hours

The matching log on ats-be shows this:

Date:2020-06-16 Time:08:37:02 ConnAttempts:1 ConnReuse:0 TTFetchHeaders:253 ClientTTFB:253 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:253 OriginServer:mwmaint.discovery.wmnet OriginServerTime:0 CacheResultCode:TCP_REFRESH_HIT CacheWriteResult:- ReqMethod:GET RespStatus:200 OriginStatus:304 ReqURL:http://noc.wikimedia.org/ ReqHeader:User-Agent:- ReqHeader:Host:noc.wikimedia.org ReqHeader:X-Client-IP:- ReqHeader:Cookie:- BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:s-maxage=60, must-revalidate, max-age=0 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp3050 miss RespHeader:Backend-Timing:D=259 t=1591642922815611

In particular, OriginStatus:304 shows that ATS performed a conditional request to mwmaint.discovery.wmnet, and got a 304 (not modified) response. What I find surprising here is:

  1. s-maxage is 60 seconds, and yet ATS performed a conditional request and returned an object much older than one minute
  2. CacheResultCode is TCP_REFRESH_HIT and yet we set X-Cache-Status to miss

Part of the confusion is our fault. The way we generate X-Cache-Status when it comes to stale "hits" is wrong:

if status == TS_LUA_CACHE_LOOKUP_HIT_STALE then
    return "miss"
end

That should be considered a HIT instead. So TODO here is:

  • X-Cache-Status for TS_LUA_CACHE_LOOKUP_HIT_STALE should be 'hit' instead of miss
  • we should try to reproduce TE:chunked being added to a stale object on 304 responses from the origin

Change 605877 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: consider TS_LUA_CACHE_LOOKUP_HIT_STALE as "hit"

https://gerrit.wikimedia.org/r/605877

Change 605877 merged by Ema:
[operations/puppet@production] ATS: consider TS_LUA_CACHE_LOOKUP_HIT_STALE as "hit"

https://gerrit.wikimedia.org/r/605877

we should try to reproduce TE:chunked being added to a stale object on 304 responses from the origin

I've written a simple autest to reproduce this issue. When it comes to the origin server response headers, I've used exactly those returned by noc.wm.org: P11538. After a first request resulting in a miss (P11539), the second is a hit-stale without Transfer-Encoding: P11540. This confirms that the basic behavior of ATS is in general sane, and something more subtle must have happened.

Alright I finally understood what's going on. The problem here is that (1) the origin server is adding Transfer-Encoding: chunked to conditional HEAD requests, which if possible makes no sense twice: HEAD responses don't have body, and 304s also don't:

$ curl -I -H "Host: noc.wikimedia.org" -H "If-Modified-Since: Wed, 17 Jun 2020 05:13:37 GMT" https://mwmaint.discovery.wmnet 2>&1
HTTP/1.1 304 Not Modified
date: Wed, 17 Jun 2020 08:24:12 GMT
server: Apache
etag: "c7c-5a83b16373584"
cache-control: s-maxage=60, must-revalidate, max-age=0
x-envoy-upstream-service-time: 0
transfer-encoding: chunked

and (2) upon content revalidation ATS is updating the cached object with the headers returned in the 304 response:

$ curl -v -H Host: noc.wikimedia.org localhost:3128/
[...]
> GET / HTTP/1.1
> Host: noc.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< date: Wed, 17 Jun 2020 11:06:17 GMT
< server: Apache
< last-modified: Tue, 16 Jun 2020 22:30:12 GMT
< etag: "c7c-5a83b16373584"
< accept-ranges: bytes
< content-length: 3196
< vary: Accept-Encoding
< backend-timing: D=269 t=1592391978258588
< cache-control: s-maxage=60, must-revalidate, max-age=0
< content-type: text/html
< x-envoy-upstream-service-time: 0
< Age: 1
< Connection: keep-alive
< X-Cache-Int: cp2027 miss
< X-ATS-Timestamp: 1592391978
< 
[regular response body follows]
$ sleep 62
$ curl -v -I -H Host: noc.wikimedia.org localhost:3128/
[...]
> HEAD / HTTP/1.1
> Host: noc.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< date: Wed, 17 Jun 2020 11:22:03 GMT
< server: Apache
< last-modified: Tue, 16 Jun 2020 22:30:12 GMT
< accept-ranges: bytes
< content-length: 3196
< vary: Accept-Encoding
< backend-timing: D=196 t=1592392861556186
< cache-control: s-maxage=60, must-revalidate, max-age=0
< content-type: text/html
< x-envoy-upstream-service-time: 0
< Etag: "c7c-5a83b16373584"
< Transfer-Encoding: chunked
< Age: 0
< Connection: keep-alive
< X-Cache-Int: cp2027 hit
< X-ATS-Timestamp: 1592392923
< 
[note that TE:chunked is returned in response to the HEAD request]
$ curl -v -H Host: noc.wikimedia.org localhost:3128/
> GET / HTTP/1.1
> Host: noc.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< date: Wed, 17 Jun 2020 11:22:03 GMT
< server: Apache
< last-modified: Tue, 16 Jun 2020 22:30:12 GMT
< accept-ranges: bytes
< content-length: 3196
< vary: Accept-Encoding
< backend-timing: D=196 t=1592392861556186
< cache-control: s-maxage=60, must-revalidate, max-age=0
< content-type: text/html
< x-envoy-upstream-service-time: 0
< Etag: "c7c-5a83b16373584"
< Transfer-Encoding: chunked
< Age: 0
< Connection: keep-alive
< X-Cache-Int: cp2027 hit
< X-ATS-Timestamp: 1592392923
< 
{ [3196 bytes data]
* Illegal or missing hexadecimal sequence in chunked-encoding
[both Content-Length and Transfer-Encoding: chunked are returned]

Quite obviously, what we need to do is (1) ensure the origins don't send Transfer-Encoding on 304 responses and (2) make sure ATS does not add Transfer-Encoding to cached objects when receiving a 304.

Change 606204 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: unset Transfer-Encoding on 304 responses from origins

https://gerrit.wikimedia.org/r/606204

Change 606204 merged by Ema:
[operations/puppet@production] ATS: unset Transfer-Encoding on 304 responses from origins

https://gerrit.wikimedia.org/r/606204

ema claimed this task.

what we need to do is (1) ensure the origins don't send Transfer-Encoding on 304 responses and (2) make sure ATS does not add Transfer-Encoding to cached objects when receiving a 304.

Envoy adds TE: chunked to HEAD responses on purpose, so scratch (1). When it comes to (2), we fixed that with https://gerrit.wikimedia.org/r/606204.

Change 937955 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Drop T255368 workaround

https://gerrit.wikimedia.org/r/937955