Page MenuHomePhabricator

Package and deploy ATS 9.2.1
Open, MediumPublic

Assigned To
None
Authored By
ssingh
Jun 14 2023, 2:20 PM
Referenced Files
F37148438: image.png
Jul 24 2023, 3:19 PM
F37148392: image.png
Jul 24 2023, 1:42 PM
F37148385: image.png
Jul 24 2023, 1:34 PM

Description

We are currently running ATS 9.1.4 in production and should upgrade to 9.2.1.

Release notes https://docs.trafficserver.apache.org/release-notes/whats-new.en.html#version-9-2-x-updates

This task tracks the Debian packaging work and deployment of ATS 9.2.1.

  • Debian packaging
    • review packaging
  • 9.2.x config changes
    • review changes
      • Address "Failed to write log to /var/log/trafficserver/notpurge.pipe" errors
  • Testing of new changes
    • currently running in cp4052 (upload)

Event Timeline

Change 930230 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/trafficserver@master] Release 9.2.1-1wm1

https://gerrit.wikimedia.org/r/930230

Change 930230 merged by Ssingh:

[operations/debs/trafficserver@master] Release 9.2.1-1wm1

https://gerrit.wikimedia.org/r/930230

Mentioned in SAL (#wikimedia-operations) [2023-07-24T08:45:02Z] <vgutierrez> testing trafficserver 9.2.1 in cp4052 (upload node) - T339134

Vgutierrez triaged this task as Medium priority.Jul 24 2023, 9:20 AM
Vgutierrez subscribed.

naive test on cp4052 resulted on a rollback, ATS itself was able to fetch data from its backend servers but varnish was unable to fetch data as expect returning the usual 503 Backend fetch failed.

grafana also shows a spike on active and total plugin time... considering it was too low with 9.1.x maybe it could mean that those metrics have been fixed rather than an actual regression.

more research needed :)

Mentioned in SAL (#wikimedia-operations) [2023-07-24T09:22:37Z] <vgutierrez> rollback to trafficserver 9.1.4 in cp4052 - T339134

Mentioned in SAL (#wikimedia-operations) [2023-07-24T13:03:53Z] <vgutierrez> depooling cp4052 for some ATS 9.2.1 testing - T339134

varnish healthchecks aren't happy with ATS 9.2.1:

$ sudo -i varnishadm -n frontend backend.list
Backend name                   Admin      Probe                Last updated
vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet probe      Sick                0/5 Mon, 24 Jul 2023 13:04:35 GMT

varnish probe (healthcheck) is defined like this:

probe varnish {
    .request =
        "GET /ats-be HTTP/1.1"
        "Host: healthcheck.wikimedia.org"
        "User-agent: Varnish backend check"
        "Connection: close";
    .timeout = 100ms;
    .interval = 100ms;
    .window = 5;
    .threshold = 3;
}




# Generated list of cache backend hosts for director consumption

backend be_cp4052_ulsfo_wmnet {
    .host = "cp4052.ulsfo.wmnet";
    .port = "3128";
    .connect_timeout = 3s;
    .first_byte_timeout = 65s;
    .between_bytes_timeout = 33s;
    .max_connections = 400000;
    .probe = varnish;
}

A naive test using curl shows the following output:

$ curl -H "Host: healthcheck.wikimedia.org" -H "User-Agent: Varnish backend check" -H "Connection: Close" --http1.1 http://cp4052.ulsfo.wmnet:3128/ats-be -v -o /dev/null -s
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1'
*   Trying 10.128.0.12:3128...
* Connected to cp4052.ulsfo.wmnet (10.128.0.12) port 3128 (#0)
> GET /ats-be HTTP/1.1
> Host: healthcheck.wikimedia.org
> Accept: */*
> User-Agent: Varnish backend check
> Connection: Close
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Cache-Control: no-cache
< Date: Mon, 24 Jul 2023 13:21:00 GMT
< Age: 0
< Connection: close
< Server: ATS/9.2.1
< X-Cache-Int: cp4052 int
< X-ATS-Timestamp: 1690204860
< 
* Closing connection 0

but varnish is unable to get a response from ATS:

vgutierrez@cp4052:/etc/varnish$ sudo -i varnishlog -n frontend -g raw -i backend_health
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still sick 4---Xr-- 0 3 5 0.000000 0.000589 Read error 104 (Connection reset by peer)
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still sick 4---Xr-- 0 3 5 0.000000 0.000589 Read error 104 (Connection reset by peer)

varnish backend_health flags report the following:

  • 4 --> IPv4 connection
  • X --> request transmit succeeded
  • r --> read response failed

a traffic capture shows that ATS puts the response back on the wire:

image.png (240×1 px, 68 KB)

on a 9.1.4 instance (cp4051) healthcheck looks identical in terms of payload, but TCP connection gets closed gracefully:

image.png (247×1 px, 88 KB)

healthcheck gets generated by our default.lua, specifically:

function do_global_read_request()
    if ts.client_request.header['Host'] == 'healthcheck.wikimedia.org' and ts.client_request.get_uri() == '/ats-be' then
        ts.http.intercept(function()
            ts.say('HTTP/1.1 200 OK\r\n' ..
                   'Content-Length: 0\r\n' ..
                   'Cache-Control: no-cache\r\n\r\n')
        end)

        return 0
    end

staring at the ATS 9.2.1 changelog https://github.com/apache/trafficserver/blob/9.2.x/CHANGELOG-9.2.1 doesn't show any obvious culprit for this

After checking https://github.com/apache/trafficserver/blob/9.2.x/CHANGELOG-9.2.0 I've noticed:

#8784 - Propagate proxy.config.net.sock_option_flag_in to newly accepted connections

we set proxy.config.net.sock_option_flag_in to 0x5, enabling SO_LINGER, disabling it solved the issue:

vgutierrez@cp4052:~$ sudo -i varnishlog -n frontend -g raw -i backend_health
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still healthy 4---X-RH 5 3 5 0.000360 0.000460 HTTP/1.1 200 OK
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still healthy 4---X-RH 5 3 5 0.000653 0.000509 HTTP/1.1 200 OK
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still healthy 4---X-RH 5 3 5 0.000278 0.000451 HTTP/1.1 200 OK

Change 940953 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Disable SO_LINGER on incoming connections

https://gerrit.wikimedia.org/r/940953

Change 940953 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Disable SO_LINGER on incoming connections

https://gerrit.wikimedia.org/r/940953

Mentioned in SAL (#wikimedia-operations) [2023-07-24T14:44:32Z] <vgutierrez> Repooling cp4052 (upload) running ATS 9.2.1 - T339134

grafana shows a regression on lua performance after the update to 9.2.1:

image.png (1×1 px, 47 KB)

I do remember talking to upstream about getting a lot of requests with active and total lua time set to 0, so at this point I don't know if this is an actual regression or not, but we need to double check it (as in attempting to measure lua performance without relying on ATS metrics)

Mentioned in SAL (#wikimedia-operations) [2023-07-24T16:39:00Z] <sukhe> restart ATS to pick up CR 940953: T339134

As per @BCornwall's comment above, it does seem like we have more issue with 9.2.1 that we should look into:

sukhe@cp4052:~$ /usr/bin/traffic_server --version
Traffic Server 9.2.1 Jun 14 2023 18:20:20 localhost

trafficserver.service journal output:

Sep 07 17:58:16 cp4052 trafficserver[1962]: [Sep  7 17:58:16.062] [LOG_FLUSH] ERROR: The following message was suppressed 987 times.
Sep 07 17:58:16 cp4052 trafficserver[1962]: [Sep  7 17:58:16.062] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 869, wrote 0, Resource temporarily unavailable]
Sep 07 17:59:17 cp4052 trafficserver[1962]: [Sep  7 17:59:17.033] [LOG_FLUSH] ERROR: The following message was suppressed 1157 times.
Sep 07 17:59:17 cp4052 trafficserver[1962]: [Sep  7 17:59:17.033] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 806, wrote 0, Resource temporarily unavailable]
Sep 07 18:00:18 cp4052 trafficserver[1962]: [Sep  7 18:00:18.431] [LOG_FLUSH] ERROR: The following message was suppressed 1030 times.
Sep 07 18:00:18 cp4052 trafficserver[1962]: [Sep  7 18:00:18.431] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 827, wrote 0, Resource temporarily unavailable]

that's not an issue with ATS 9.2.1, problem got fixed by restarting fifo-log-demux@notpurge.service:

vgutierrez@cp4052:~$ journalctl -u fifo-log-demux@notpurge.service -f
-- Journal begins at Sat 2023-08-05 22:39:02 UTC. --
Sep 03 20:57:26 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 06 15:47:35 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 07 07:52:46 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 07 08:17:20 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 08 09:42:55 cp4052 systemd[1]: Stopping FIFO log demultiplexer (instance notpurge)...
Sep 08 09:42:55 cp4052 systemd[1]: fifo-log-demux@notpurge.service: Succeeded.
Sep 08 09:42:55 cp4052 systemd[1]: Stopped FIFO log demultiplexer (instance notpurge).
Sep 08 09:42:55 cp4052 systemd[1]: fifo-log-demux@notpurge.service: Consumed 9min 29.690s CPU time.
Sep 08 09:42:55 cp4052 systemd[1]: Started FIFO log demultiplexer (instance notpurge).
Sep 08 09:42:55 cp4052 fifo-log-demux[902329]: Waiting for connections on /run/trafficserver/notpurge.sock

ATS logs against an unix pipe and it will fail to do so if there is anything reading on the other side (fifo-log-demux)

A quick check with cumin shows several servers impacted:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'journalctl -u fifo-log-demux@notpurge.service --since=-1h | grep Error'
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit: 96 
===== NODE GROUP =====                                                                                                                                                            
(1) cp5028.eqsin.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:18:14 cp5028 fifo-log-demux[1607]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
===== NODE GROUP =====                                                                                                                                                            
(1) cp2033.codfw.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:27:36 cp2033 fifo-log-demux[1320]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
===== NODE GROUP =====                                                                                                                                                            
(1) cp4042.ulsfo.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:03:56 cp4042 fifo-log-demux[1831]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
================                                                                                                                                                                  
PASS |████                                                                                                                               |   3% (3/96) [00:04<02:14,  1.45s/hosts]
FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉    |  97% (93/96) [00:04<00:00, 21.44hosts/s]
96.9% (93/96) of nodes failed to execute command 'journalctl -u fi...-1h | grep Error': cp[2027-2032,2034-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5027,5029-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4041,4043-4052].ulsfo.wmnet
3.1% (3/96) success ratio (< 100.0% threshold) for command: 'journalctl -u fi...-1h | grep Error'. Aborting.: cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet
3.1% (3/96) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.: cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet

All of them running ATS 9.1.4:

vgutierrez@cumin1001:~$ sudo -i cumin 'cp4042.ulsfo.wmnet,cp2033.codfw.wmnet,cp5028.eqsin.wmnet' 'apt-cache policy trafficserver|grep Installed'
3 hosts will be targeted:
cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet
OK to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit: 3
===== NODE GROUP =====                                                                                                                                                            
(3) cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet                                                                                                                      
----- OUTPUT of 'apt-cache policy...r|grep Installed' -----                                                                                                                       
  Installed: 9.1.4-1wm1

What's unique to 9.2.1 is the flooding of broken pipe logs, and that seems to be triggered by 0004-logging-broken-pipe-no-spam.patch being removed in our 9.2.1 build: https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/930230

Change 956484 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/trafficserver@master] Release 9.2.1-1wm2

https://gerrit.wikimedia.org/r/956484

Change 956484 merged by BCornwall:

[operations/debs/trafficserver@master] Release 9.2.1-1wm2

https://gerrit.wikimedia.org/r/956484