Package and deploy ATS 9.2.1
Open, MediumPublic
Actions

Assigned To

None

Authored By

	ssingh
	Jun 14 2023, 2:20 PM

Description

We are currently running ATS 9.1.4 in production and should upgrade to 9.2.1.

Release notes https://docs.trafficserver.apache.org/release-notes/whats-new.en.html#version-9-2-x-updates

This task tracks the Debian packaging work and deployment of ATS 9.2.1.

Debian packaging
- review packaging
9.2.x config changes
- review changes
  - Address "Failed to write log to /var/log/trafficserver/notpurge.pipe" errors
Testing of new changes
- currently running in cp4052 (upload)

Details

Subject	Repo	Branch	Lines +/-
Release 9.2.1-1wm2	operations/debs/trafficserver	master	+8 -1
trafficserver: Disable SO_LINGER on incoming connections	operations/puppet	production	+1 -1
Release 9.2.1-1wm1	operations/debs/trafficserver	master	+638 -1 K

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T342154 Upgrade Traffic hosts to bookworm
Open	None	T339134 Package and deploy ATS 9.2.1
Resolved	Vgutierrez	T341755 Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout

Event Timeline

ssingh created this task.Jun 14 2023, 2:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 14 2023, 2:20 PM

Change 930230 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/trafficserver@master] Release 9.2.1-1wm1

https://gerrit.wikimedia.org/r/930230

gerritbot added a project: Patch-For-Review.Jun 14 2023, 7:03 PM

KOfori moved this task from Backlog to Traffic team actively servicing on the Traffic board.Jun 14 2023, 7:34 PM

KOfori removed a project: SRE.

Change 930230 merged by Ssingh:

[operations/debs/trafficserver@master] Release 9.2.1-1wm1

https://gerrit.wikimedia.org/r/930230

ssingh updated the task description. (Show Details)Jun 15 2023, 4:27 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 15 2023, 4:30 PM

Mentioned in SAL (#wikimedia-operations) [2023-07-24T08:45:02Z] <vgutierrez> testing trafficserver 9.2.1 in cp4052 (upload node) - T339134

naive test on cp4052 resulted on a rollback, ATS itself was able to fetch data from its backend servers but varnish was unable to fetch data as expect returning the usual 503 Backend fetch failed.

grafana also shows a spike on active and total plugin time... considering it was too low with 9.1.x maybe it could mean that those metrics have been fixed rather than an actual regression.

more research needed :)

Mentioned in SAL (#wikimedia-operations) [2023-07-24T09:22:37Z] <vgutierrez> rollback to trafficserver 9.1.4 in cp4052 - T339134

Mentioned in SAL (#wikimedia-operations) [2023-07-24T13:03:53Z] <vgutierrez> depooling cp4052 for some ATS 9.2.1 testing - T339134

varnish healthchecks aren't happy with ATS 9.2.1:

$ sudo -i varnishadm -n frontend backend.list
Backend name                   Admin      Probe                Last updated
vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet probe      Sick                0/5 Mon, 24 Jul 2023 13:04:35 GMT

varnish probe (healthcheck) is defined like this:

probe varnish {
    .request =
        "GET /ats-be HTTP/1.1"
        "Host: healthcheck.wikimedia.org"
        "User-agent: Varnish backend check"
        "Connection: close";
    .timeout = 100ms;
    .interval = 100ms;
    .window = 5;
    .threshold = 3;
}




# Generated list of cache backend hosts for director consumption

backend be_cp4052_ulsfo_wmnet {
    .host = "cp4052.ulsfo.wmnet";
    .port = "3128";
    .connect_timeout = 3s;
    .first_byte_timeout = 65s;
    .between_bytes_timeout = 33s;
    .max_connections = 400000;
    .probe = varnish;
}

A naive test using curl shows the following output:

$ curl -H "Host: healthcheck.wikimedia.org" -H "User-Agent: Varnish backend check" -H "Connection: Close" --http1.1 http://cp4052.ulsfo.wmnet:3128/ats-be -v -o /dev/null -s
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1'
*   Trying 10.128.0.12:3128...
* Connected to cp4052.ulsfo.wmnet (10.128.0.12) port 3128 (#0)
> GET /ats-be HTTP/1.1
> Host: healthcheck.wikimedia.org
> Accept: */*
> User-Agent: Varnish backend check
> Connection: Close
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Cache-Control: no-cache
< Date: Mon, 24 Jul 2023 13:21:00 GMT
< Age: 0
< Connection: close
< Server: ATS/9.2.1
< X-Cache-Int: cp4052 int
< X-ATS-Timestamp: 1690204860
< 
* Closing connection 0

but varnish is unable to get a response from ATS:

vgutierrez@cp4052:/etc/varnish$ sudo -i varnishlog -n frontend -g raw -i backend_health
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still sick 4---Xr-- 0 3 5 0.000000 0.000589 Read error 104 (Connection reset by peer)
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still sick 4---Xr-- 0 3 5 0.000000 0.000589 Read error 104 (Connection reset by peer)

varnish backend_health flags report the following:

4 --> IPv4 connection
X --> request transmit succeeded
r --> read response failed

a traffic capture shows that ATS puts the response back on the wire:

on a 9.1.4 instance (cp4051) healthcheck looks identical in terms of payload, but TCP connection gets closed gracefully:

healthcheck gets generated by our default.lua, specifically:

function do_global_read_request()
    if ts.client_request.header['Host'] == 'healthcheck.wikimedia.org' and ts.client_request.get_uri() == '/ats-be' then
        ts.http.intercept(function()
            ts.say('HTTP/1.1 200 OK\r\n' ..
                   'Content-Length: 0\r\n' ..
                   'Cache-Control: no-cache\r\n\r\n')
        end)

        return 0
    end

staring at the ATS 9.2.1 changelog https://github.com/apache/trafficserver/blob/9.2.x/CHANGELOG-9.2.1 doesn't show any obvious culprit for this

After checking https://github.com/apache/trafficserver/blob/9.2.x/CHANGELOG-9.2.0 I've noticed:

#8784 - Propagate proxy.config.net.sock_option_flag_in to newly accepted connections

we set proxy.config.net.sock_option_flag_in to 0x5, enabling SO_LINGER, disabling it solved the issue:

vgutierrez@cp4052:~$ sudo -i varnishlog -n frontend -g raw -i backend_health
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still healthy 4---X-RH 5 3 5 0.000360 0.000460 HTTP/1.1 200 OK
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still healthy 4---X-RH 5 3 5 0.000653 0.000509 HTTP/1.1 200 OK
         0 Backend_health - vcl-3bdd70e8-bae1-400e-a03f-fb20c766895c.be_cp4052_ulsfo_wmnet Still healthy 4---X-RH 5 3 5 0.000278 0.000451 HTTP/1.1 200 OK

Change 940953 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Disable SO_LINGER on incoming connections

https://gerrit.wikimedia.org/r/940953

Change 940953 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Disable SO_LINGER on incoming connections

https://gerrit.wikimedia.org/r/940953

Mentioned in SAL (#wikimedia-operations) [2023-07-24T14:44:32Z] <vgutierrez> Repooling cp4052 (upload) running ATS 9.2.1 - T339134

Vgutierrez updated the task description. (Show Details)Jul 24 2023, 2:46 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 24 2023, 3:10 PM

grafana shows a regression on lua performance after the update to 9.2.1:

I do remember talking to upstream about getting a lot of requests with active and total lua time set to 0, so at this point I don't know if this is an actual regression or not, but we need to double check it (as in attempting to measure lua performance without relying on ATS metrics)

Mentioned in SAL (#wikimedia-operations) [2023-07-24T16:39:00Z] <sukhe> restart ATS to pick up CR 940953: T339134

BCornwall mentioned this in T342154: Upgrade Traffic hosts to bookworm.Aug 11 2023, 7:09 PM

BCornwall added a parent task: T342154: Upgrade Traffic hosts to bookworm.

BCornwall added a subtask: T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout.Aug 25 2023, 5:43 PM

Vgutierrez closed subtask T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout as Resolved.Aug 31 2023, 9:23 AM

BCornwall updated the task description. (Show Details)Sep 7 2023, 6:01 PM

As per @BCornwall's comment above, it does seem like we have more issue with 9.2.1 that we should look into:

sukhe@cp4052:~$ /usr/bin/traffic_server --version
Traffic Server 9.2.1 Jun 14 2023 18:20:20 localhost

trafficserver.service journal output:

Sep 07 17:58:16 cp4052 trafficserver[1962]: [Sep  7 17:58:16.062] [LOG_FLUSH] ERROR: The following message was suppressed 987 times.
Sep 07 17:58:16 cp4052 trafficserver[1962]: [Sep  7 17:58:16.062] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 869, wrote 0, Resource temporarily unavailable]
Sep 07 17:59:17 cp4052 trafficserver[1962]: [Sep  7 17:59:17.033] [LOG_FLUSH] ERROR: The following message was suppressed 1157 times.
Sep 07 17:59:17 cp4052 trafficserver[1962]: [Sep  7 17:59:17.033] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 806, wrote 0, Resource temporarily unavailable]
Sep 07 18:00:18 cp4052 trafficserver[1962]: [Sep  7 18:00:18.431] [LOG_FLUSH] ERROR: The following message was suppressed 1030 times.
Sep 07 18:00:18 cp4052 trafficserver[1962]: [Sep  7 18:00:18.431] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 827, wrote 0, Resource temporarily unavailable]

that's not an issue with ATS 9.2.1, problem got fixed by restarting fifo-log-demux@notpurge.service:

vgutierrez@cp4052:~$ journalctl -u fifo-log-demux@notpurge.service -f
-- Journal begins at Sat 2023-08-05 22:39:02 UTC. --
Sep 03 20:57:26 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 06 15:47:35 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 07 07:52:46 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 07 08:17:20 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 08 09:42:55 cp4052 systemd[1]: Stopping FIFO log demultiplexer (instance notpurge)...
Sep 08 09:42:55 cp4052 systemd[1]: fifo-log-demux@notpurge.service: Succeeded.
Sep 08 09:42:55 cp4052 systemd[1]: Stopped FIFO log demultiplexer (instance notpurge).
Sep 08 09:42:55 cp4052 systemd[1]: fifo-log-demux@notpurge.service: Consumed 9min 29.690s CPU time.
Sep 08 09:42:55 cp4052 systemd[1]: Started FIFO log demultiplexer (instance notpurge).
Sep 08 09:42:55 cp4052 fifo-log-demux[902329]: Waiting for connections on /run/trafficserver/notpurge.sock

ATS logs against an unix pipe and it will fail to do so if there is anything reading on the other side (fifo-log-demux)

A quick check with cumin shows several servers impacted:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'journalctl -u fifo-log-demux@notpurge.service --since=-1h | grep Error'
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit: 96 
===== NODE GROUP =====                                                                                                                                                            
(1) cp5028.eqsin.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:18:14 cp5028 fifo-log-demux[1607]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
===== NODE GROUP =====                                                                                                                                                            
(1) cp2033.codfw.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:27:36 cp2033 fifo-log-demux[1320]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
===== NODE GROUP =====                                                                                                                                                            
(1) cp4042.ulsfo.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:03:56 cp4042 fifo-log-demux[1831]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
================                                                                                                                                                                  
PASS |████                                                                                                                               |   3% (3/96) [00:04<02:14,  1.45s/hosts]
FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉    |  97% (93/96) [00:04<00:00, 21.44hosts/s]
96.9% (93/96) of nodes failed to execute command 'journalctl -u fi...-1h | grep Error': cp[2027-2032,2034-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5027,5029-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4041,4043-4052].ulsfo.wmnet
3.1% (3/96) success ratio (< 100.0% threshold) for command: 'journalctl -u fi...-1h | grep Error'. Aborting.: cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet
3.1% (3/96) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.: cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet

All of them running ATS 9.1.4:

vgutierrez@cumin1001:~$ sudo -i cumin 'cp4042.ulsfo.wmnet,cp2033.codfw.wmnet,cp5028.eqsin.wmnet' 'apt-cache policy trafficserver|grep Installed'
3 hosts will be targeted:
cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet
OK to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit: 3
===== NODE GROUP =====                                                                                                                                                            
(3) cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet                                                                                                                      
----- OUTPUT of 'apt-cache policy...r|grep Installed' -----                                                                                                                       
  Installed: 9.1.4-1wm1

What's unique to 9.2.1 is the flooding of broken pipe logs, and that seems to be triggered by 0004-logging-broken-pipe-no-spam.patch being removed in our 9.2.1 build: https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/930230

BCornwall mentioned this in T345939: Create metrics/monitoring of fifo-log-demux.Sep 8 2023, 6:54 PM

Change 956484 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/trafficserver@master] Release 9.2.1-1wm2

https://gerrit.wikimedia.org/r/956484

gerritbot added a project: Patch-For-Review.Sep 11 2023, 6:11 PM

Change 956484 merged by BCornwall:

[operations/debs/trafficserver@master] Release 9.2.1-1wm2

https://gerrit.wikimedia.org/r/956484

BCornwall updated the task description. (Show Details)Sep 12 2023, 10:04 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 12 2023, 10:10 PM