Page MenuHomePhabricator

Webrequests live data shows traffic without TLS on varnish for upload.w.o
Open, In Progress, MediumPublic

Description

After the migration of port 80 handling to HAProxy it theory we should not get anymore HTTP traffic at the varnish layer, but we're still getting some traffic, in particular for the upload.wikimedia.org domain, in addition to the healthcheck ones (are those ones expected?).

All the TLS headers related to this traffic are null as you can see here:
https://w.wiki/6rta

And it didn't seem to change in the last 30 days, in the before/after of the port 80 migration:
https://w.wiki/6rte

Event Timeline

Vgutierrez changed the task status from Open to In Progress.Jun 22 2023, 9:14 AM
Vgutierrez triaged this task as Medium priority.
Vgutierrez moved this task from Backlog to Traffic team actively servicing on the Traffic board.

This seems like a varnish (VCL) bug. Varnish is getting requests with X-Connection-Properties header set but it's failing to issue the expected X-Analytics-TLS one that should be handled by varnishkafka:

vgutierrez@cp3061:~$ sudo -i varnishlog -n frontend -I "ReqHeader:x-connection-properties" -I "VCL_Log:tls" -q "ReqMethod eq GET and VCL_Log:tls"

   << Request  >> 714622587 
-   ReqHeader      x-connection-properties: H2=1; SSR=1; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;
-   VCL_Log        tls: vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=reused

*   << Request  >> 696314755 
-   ReqHeader      x-connection-properties: H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;
-   VCL_Log        tls: vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new
vgutierrez@cp3061:~$ sudo -i varnishlog -n frontend -I "ReqHeader:x-connection-properties" -I "VCL_Log:tls" -q "ReqMethod eq GET and not VCL_Log:tls"
*   << Request  >> 723851798 
-   ReqHeader      x-connection-properties: H2=0; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-AES256-GCM-SHA384; EC=UNKNOWN;

*   << Request  >> 723619905 
-   ReqHeader      x-connection-properties: H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;

Full log of a request showing the misbehaviour:

*   << Request  >> 713485145                                                                                                                                                      
-   Begin          req 713485144 restart                                                                                                                                          
-   Timestamp      Start: 1687427044.614841 0.000217 0.000000                                                                                                                     
-   ReqStart       0.0.0.0 0 a8                                                                                                                                                   
-   ReqMethod      GET                                                                   
-   ReqURL         /wikipedia/en/thumb/9/96/Symbol_category_class.svg/16px-Symbol_category_class.svg.png.webp
-   ReqProtocol    HTTP/1.1                                                              
-   ReqHeader      sec-ch-ua: "Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"                                                                                    
-   ReqHeader      sec-ch-ua-mobile: ?0                                                  
-   ReqHeader      user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36                                    
-   ReqHeader      sec-ch-ua-platform: "Windows"                                                                                                                                  
-   ReqHeader      accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
-   ReqHeader      sec-fetch-site: cross-site                                                                                                                                     
-   ReqHeader      sec-fetch-mode: no-cors                                               
-   ReqHeader      sec-fetch-dest: image                                                 
-   ReqHeader      referer: https://en.wikipedia.org/                                    
-   ReqHeader      x-client-ip: REDACTED                                                
-   ReqHeader      x-client-port: 56312                                                  
-   ReqHeader      x-forwarded-proto: https                                              
-   ReqHeader      x-connection-properties: H2=1; SSR=1; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;
-   ReqHeader      X-Forwarded-For: REDACTED, 10.20.0.61
-   ReqHeader      via-nginx: 1
-   ReqHeader      Host: upload.wikimedia.org
-   ReqHeader      Accept-Language: en-gb,en-us;q=0.9,en;q=0.8
-   ReqHeader      X-WMF-NOCOOKIES: 1
-   ReqHeader      Accept-Encoding: gzip
-   ReqHeader      X-Requestctl: 
-   ReqHeader      X-CDIS: hit/70162
-   VCL_call       RECV
-   ReqUnset       X-CDIS: hit/70162
-   ReqURL         /wikipedia/en/thumb/9/96/Symbol_category_class.svg/16px-Symbol_category_class.svg.png.webp
-   ReqURL         /wikipedia/en/thumb/9/96/Symbol_category_class.svg/16px-Symbol_category_class.svg.png.webp
-   ReqURL         /wikipedia/en/thumb/9/96/Symbol_category_class.svg/16px-symbol_category_class.svg.png.webp
-   ReqURL         /wikipedia/en/thumb/9/96/Symbol_category_class.svg/16px-Symbol_category_class.svg.webp
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   Hit            622574333 64339.046947 1200.000000 86400.000000
-   VCL_call       HIT
-   ReqHeader      X-CDIS: hit
-   ReqUnset       X-Requestctl: 
-   ReqHeader      X-Requestctl: 
-   VCL_return     deliver
-   RespProtocol   HTTP/1.1
-   RespStatus     200
-   RespReason     OK
-   RespHeader     Date: Thu, 22 Jun 2023 03:58:32 GMT
-   RespHeader     Etag: 670b66419952fd0027726bc71c7f38aa
-   RespHeader     Cache-Control: public, max-age=86400
-   RespHeader     Server: ATS/9.1.4
-   RespHeader     X-Cache-Int: cp3065 hit
-   RespHeader     X-ATS-Timestamp: 1687406312
-   RespHeader     X-MediaWiki-Original: /wikipedia/en/9/96/Symbol_category_class.svg
-   RespHeader     Content-Type: image/webp
-   RespHeader     Content-Length: 528
-   RespHeader     Content-Disposition: inline;filename*=UTF-8''Symbol_category_class.svg.webp
-   RespHeader     Last-Modified: Wed, 04 Jan 2023 11:50:40 GMT
-   RespHeader     X-Timestamp: 1672833039.74157
-   RespHeader     Access-Control-Allow-Origin: *
-   RespHeader     X-Trans-Id: tx2bfc521ba9fc4d1aaf7fb-0064913507
-   RespHeader     X-Openstack-Request-Id: tx2bfc521ba9fc4d1aaf7fb-0064913507
-   RespHeader     X-Varnish: 713485145 622574333
-   RespHeader     Age: 20732
-   RespHeader     Via: 1.1 varnish (Varnish/6.0)
-   VCL_call       DELIVER
-   ReqUnset       X-CDIS: hit
-   ReqHeader      X-CDIS: hit/28831
-   RespUnset      X-Cache-Int: cp3065 hit
-   RespHeader     X-Cache-Int: cp3065 hit, cp3061 hit/28831
-   RespHeader     X-Cache: cp3065 hit, cp3061 hit/28831
-   RespHeader     X-Cache-Status: hit, hit
-   RespUnset      X-Cache-Int: cp3065 hit, cp3061 hit/28831
-   RespUnset      Via: 1.1 varnish (Varnish/6.0)
-   RespUnset      X-Cache-Status: hit, hit
-   RespHeader     X-Cache-Status: hit-front 
-   RespHeader     Server-Timing: cache;desc="hit-front", host;desc="cp3061"
-   RespHeader     Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
-   RespHeader     Report-To: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error
&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
-   RespHeader     NEL: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
-   RespHeader     X-Analytics: 
-   RespUnset      X-Analytics: 
-   RespHeader     X-Analytics: ;https=1
-   RespUnset      X-Analytics: ;https=1
-   RespHeader     X-Analytics: ;https=1;client_port=56312
-   RespUnset      X-Analytics: ;https=1;client_port=56312
-   RespHeader     X-Analytics: ;https=1;client_port=56312;nocookies=1
-   RespUnset      X-Analytics: ;https=1;client_port=56312;nocookies=1
-   RespHeader     X-Analytics: https=1;client_port=56312;nocookies=1
-   RespHeader     X-Client-IP: REDACTED
-   RespUnset      X-MediaWiki-Original: /wikipedia/en/9/96/Symbol_category_class.svg
-   RespUnset      Cache-Control: public, max-age=86400
-   RespHeader     X-Content-Type-Options: nosniff
-   RespUnset      Access-Control-Allow-Origin: *
-   RespHeader     Access-Control-Allow-Origin: *
-   RespHeader     Access-Control-Expose-Headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
-   RespHeader     Timing-Allow-Origin: *
-   VCL_acl        NO_MATCH fb_nets
-   VCL_return     deliver
-   Timestamp      Process: 1687427044.614933 0.000310 0.000093
-   RespHeader     Accept-Ranges: bytes
-   RespHeader     Connection: keep-alive
-   Timestamp      Resp: 1687427044.614956 0.000333 0.000023
-   ReqAcct        941 0 941 1326 528 1854
-   End

as shown on the full request example, this is happening on request restarts:

Begin          req 713485144 restart

and our VCL logic excludes from TLS data being parsed:

if (req.restarts == 0) { 
    // IP processing is req->req mangling that shouldn't be re-done on restart
    call recv_fe_ip_processing;

    // Parse X-Connection-Properties and log TLS/HTTP2 connection
    // information, which is sent via varnishkafka to Analytics.
    // If we don't do this super-early, some rejected/invalid
    // requests from early checks won't have any TLS analytics,
    // which is very confusing and looks like plain-HTTP.
    if (req.http.X-Connection-Properties) {
        call log_xcps_info;
    }

This happens on upload.wm.o when handling thumbnails:

    # If the client supports WebP and the thumbnail is popular enough, attempt to serve a WebP thumbnail instead of JPG/PNG
    if (obj.hits > <%= @vcl_config.fetch('upload_webp_hits_threshold') %> && req.url ~ "(?i)\.(jpg|jpeg|jpe|png)$" && req.url ~ "(?i)/thumb/" && req.http.Accept ~ "(?i)image/webp" && req.http.User-Agent !~ "(^Opera|Safari\/6)" && req.restarts == 0) { 
        # Append .webp to the request URL 
        set req.url = req.url + ".webp";
        return(restart);
    }   

    # The WebP thumbnail failed, fall back to the JPG/PNG one 
    if (resp.status >= 400 && req.restarts >= 1 && req.url ~ "(?i)\.webp$" && req.url ~ "(?i)/thumb/") {
        # Remove .webp from the request URL 
        set req.url = regsub(req.url, "(?i)\.webp$", "");
        return(restart);
    }   
}

Regarding healthcheck.wikimedia.org those are actually plain text requests being issued by the UDS healthcheck:

*   << Request  >> 621520726 
-   Begin          req 621520725 rxreq
-   Timestamp      Start: 1687428027.211764 0.000000 0.000000
-   Timestamp      Req: 1687428027.211764 0.000000 0.000000
-   ReqStart       0.0.0.0 0 a8
-   ReqMethod      GET
-   ReqURL         /varnish-fe
-   ReqProtocol    HTTP/1.0
-   ReqHeader      Host: healthcheck.wikimedia.org
-   ReqHeader      Connection: close
-   ReqHeader      User-Agent: check_varnish_uds/0.1
-   ReqHeader      X-Forwarded-For: 0.0.0.0
-   VCL_call       RECV
-   VCL_acl        NO_MATCH wikimedia_trust
-   VCL_acl        MATCH local_tls_terminator "0.0.0.0"
-   VCL_acl        NO_MATCH local_host
-   VCL_acl        MATCH local_tls_terminator "0.0.0.0"
-   VCL_acl        MATCH UDS "0.0.0.0"
-   VCL_acl        MATCH UDS "0.0.0.0"
-   VCL_acl        MATCH UDS "0.0.0.0"
-   ReqUnset       X-Forwarded-For: 0.0.0.0
-   ReqHeader      X-Forwarded-For: 10.136.1.7
-   ReqUnset       X-Forwarded-For: 10.136.1.7
-   ReqHeader      X-Forwarded-For: 10.136.1.7
-   ReqUnset       X-Forwarded-For: 10.136.1.7
-   ReqHeader      X-Forwarded-For: 10.136.1.7
-   ReqUnset       X-Forwarded-For: 10.136.1.7
-   ReqHeader      X-Forwarded-For: 10.136.1.7
-   ReqHeader      X-Client-IP: 0.0.0.0
-   ReqHeader      X-Trusted-Proxy: 
-   ReqUnset       X-Trusted-Proxy: 
-   ReqHeader      X-Public-Cloud: 
-   ReqUnset       X-Public-Cloud: 
-   ReqHeader      X-Abuse-Network: 
-   ReqUnset       X-Abuse-Network: 
-   VCL_acl        NO_MATCH blocked_nets
-   VCL_acl        NO_MATCH bot_blocked_nets
-   ReqUnset       Host: healthcheck.wikimedia.org
-   ReqHeader      Host: healthcheck.wikimedia.org
-   ReqUnset       Host: healthcheck.wikimedia.org
-   ReqHeader      Host: healthcheck.wikimedia.org
-   ReqURL         /varnish-fe
-   VCL_return     synth
-   VCL_call       HASH
-   VCL_return     lookup
-   Timestamp      Process: 1687428027.211818 0.000054 0.000054
-   RespProtocol   HTTP/1.1
-   RespStatus     200
-   RespReason     OK
-   RespReason     healthcheck
-   RespHeader     Date: Thu, 22 Jun 2023 10:00:27 GMT
-   RespHeader     Server: Varnish
-   RespHeader     X-Varnish: 621520726
-   VCL_call       SYNTH
-   RespHeader     X-CDIS: int
-   ReqHeader      X-CDIS: int
-   RespUnset      X-CDIS: int
-   RespHeader     X-Cache-Int: cp6004 int
-   RespHeader     X-Cache: cp6004 int
-   RespHeader     X-Cache-Status: int
-   RespUnset      X-Cache-Int: cp6004 int
-   RespUnset      X-Cache-Status: int
-   RespHeader     X-Cache-Status: int-front
-   RespHeader     Server-Timing: cache;desc="int-front", host;desc="cp6004"
-   RespHeader     X-Analytics: 
-   ReqHeader      X-NowDay: 22-Jun-2023
-   RespHeader     Set-Cookie: WMF-Last-Access=22-Jun-2023;Path=/;HttpOnly;secure;Expires=Mon, 24 Jul 2023 00:00:00 GMT
-   RespUnset      X-Analytics: 
-   RespHeader     X-Client-IP: 0.0.0.0
-   RespReason     OK
-   VCL_return     deliver
-   RespHeader     Content-Length: 40
-   Storage        malloc Transient
-   RespHeader     Accept-Ranges: bytes
-   RespHeader     Connection: close
-   Timestamp      Resp: 1687428027.211891 0.000127 0.000073
-   ReqAcct        115 0 115 387 40 427
-   End

@Volans even if this is the expected behavior right now we need to clarify the dashboards a little bit. The first scenario (upload.wm.o and req.restarts >= 1) still keeps X-Analytics: https=1 data, could we leverage that on the dashboards to avoid the confusion?

@Vgutierrez I don't see that field in druid so I think we have to check if that's available when benthos parses the stream and set a field for it. This for the live dataset. For the _128 one it's a different pipeline.
My understanding is that it should be done here:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/benthos/instances/webrequest_live.yaml.erb#47

Adding @elukey and @fgiunchedi for comments.

Change 932262 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] benthos: add X-Analytics' https field to the stream

https://gerrit.wikimedia.org/r/932262

Change 932263 had a related patch set uploaded (by Elukey; author: Elukey):

[analytics/refinery@master] druid: update the webrequest live's supervisor

https://gerrit.wikimedia.org/r/932263

Change 932262 merged by Elukey:

[operations/puppet@production] benthos: add X-Analytics' https field to the stream

https://gerrit.wikimedia.org/r/932262

Change 932263 merged by Elukey:

[analytics/refinery@master] druid: update the webrequest live's supervisor

https://gerrit.wikimedia.org/r/932263

Mentioned in SAL (#wikimedia-analytics) [2023-06-22T15:50:12Z] <elukey> update the webrequest_sampled_live druid kafka supervisor to add the https field - T340097

Change 932361 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] turnilo: add https field to webrequest_sampled_live datacube

https://gerrit.wikimedia.org/r/932361

Change 932361 merged by Elukey:

[operations/puppet@production] turnilo: add https field to webrequest_sampled_live datacube

https://gerrit.wikimedia.org/r/932361

Mentioned in SAL (#wikimedia-analytics) [2023-06-23T10:16:22Z] <elukey> restart turnilo to pick up config changes - T340097