ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Vgutierrez
	Sep 14 2022, 10:31 AM

Description

While working on SLIs for ATS, we've discovered that certain requests are consuming up to 1 second waiting for a cache read to happen:

Date:2022-09-14 Time:10:28:02 ConnAttempts:0 ConnReuse:7 TTFetchHeaders:1396 ClientTTFB:2007 CacheReadTime:610 CacheWriteTime:0 TotalSMTime:2095 TotalPluginTime:0 ActivePluginTime:0 OriginServer:appservers-ro.discovery.wmnet OriginServerTime:1484 CacheResultCode:TCP_MISS CacheWriteResult:FIN ReqMethod:GET RespStatus:200 OriginStatus:200 ReqURL:http://fr.m.wikipedia.org/wiki/Robert_Klapisch ReqHeader:User-Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 15_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/228.0.471065565 Mobile/15E148 Safari/604.1 ReqHeader:Host:fr.wikipedia.org ReqHeader:X-Client-IP:REDACTED ReqHeader:GeoIP=REDACTED; PHP_ENGINE=7.4; WMF-Last-Access=14-Sep-2022; BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:s-maxage=1209600, must-revalidate, max-age=0 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp6016 miss RespHeader:Backend-Timing:D=1220799 t=1663151283314732

Details

Subject	Repo	Branch	Lines +/-
trafficserver: Partition cache in one server per DC and cluster #8	operations/puppet	production	+1 -55
trafficserver: Partition cache in one server per DC and cluster #7	operations/puppet	production	+3 -25
trafficserver: Partition cache in one server per DC and cluster #6	operations/puppet	production	+12 -0
trafficserver: Partition cache in one server per DC and cluster #5	operations/puppet	production	+12 -0
trafficserver: Partition cache in one server per DC and cluster #4	operations/puppet	production	+12 -0
trafficserver: Partition cache in one server per DC and cluster #3	operations/puppet	production	+12 -0
trafficserver: Partition cache in one server per DC and cluster	operations/puppet	production	+12 -0
trafficserver: Enable cache partitioning in cp6016	operations/puppet	production	+1 -0
trafficserver: Enable cache partitioning in cp6008	operations/puppet	production	+1 -0
trafficserver: Allow partitioning the cache storage in several volumes	operations/puppet	production	+23 -0
trafficserver: Log total time spent on a client request	operations/puppet	production	+8 -8

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Vgutierrez	T316921 Implement SLI measurement for ATS
		Resolved		Vgutierrez	T317748 ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations

Event Timeline

Vgutierrez created this task.Sep 14 2022, 10:31 AM

Change 838111 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Log total time spent on a client request

https://gerrit.wikimedia.org/r/838111

gerritbot added a project: Patch-For-Review.Oct 4 2022, 9:51 AM

This seems to happen every nine minutes both for upload and text nodes:

vgutierrez@cp6016:~$ cut -f1-2 -d' ' readtime.log |sort |uniq -c
    203 Date:2022-10-04 Time:09:20:32
     63 Date:2022-10-04 Time:09:20:33
    164 Date:2022-10-04 Time:09:29:32
     94 Date:2022-10-04 Time:09:29:33
    134 Date:2022-10-04 Time:09:38:32
    125 Date:2022-10-04 Time:09:38:33
    222 Date:2022-10-04 Time:09:47:33
    209 Date:2022-10-04 Time:09:56:33
     56 Date:2022-10-04 Time:09:56:34

vgutierrez@cp6008:~$ cut -f1-2 -d' ' readtime.log |sort |uniq -c
    215 Date:2022-10-04 Time:09:24:52
     65 Date:2022-10-04 Time:09:33:51
    144 Date:2022-10-04 Time:09:33:52
     66 Date:2022-10-04 Time:09:42:51
    147 Date:2022-10-04 Time:09:42:52
    163 Date:2022-10-04 Time:09:51:51
     41 Date:2022-10-04 Time:09:51:52

sadly that doesn't match with any specific ATS timeout. https://gerrit.wikimedia.org/r/838111 could potentially help identifying a timeout, as it will log the total time that a client request spends on ATS

Change 838111 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Log total time spent on a client request

https://gerrit.wikimedia.org/r/838111

Reported to upstream in https://github.com/apache/trafficserver/issues/9118

ATS is supposed to perform a cache_sync_dir every 60 seconds per the undocumented config setting proxy.config.cache.dir.sync_frequency. Logging can be enabled via the debug tag cache_dir_sync

The log shows that the sync is done in 8 minutes:

[Oct  5 10:09:15.541] [ET_NET 11] DEBUG: <CacheDir.cc:1099 (mainEvent)> (cache_dir_sync) sync started
[Oct  5 10:09:15.541] [ET_NET 11] DEBUG: <CacheDir.cc:1111 (mainEvent)> (cache_dir_sync) Dir disk.0 294912:195351687: waiting for agg buffer                                                                                                                                                                               
[Oct  5 10:09:15.542] [ET_AIO 5:166] DEBUG: <CacheDir.cc:1099 (mainEvent)> (cache_dir_sync) sync started
[Oct  5 10:09:15.542] [ET_AIO 5:166] DEBUG: <CacheDir.cc:1118 (mainEvent)> (cache_dir_sync) pos: 615841923072 Dir disk.0 294912:195351687 dirty...syncing to disk                                                                                                                                                          
[Oct  5 10:17:15.702] [ET_NET 11] DEBUG: <CacheDir.cc:1054 (mainEvent)> (cache_dir_sync) sync done

and a minute later another sync will be performed:

[Oct  5 10:18:15.728] [ET_NET 11] DEBUG: <CacheDir.cc:1099 (mainEvent)> (cache_dir_sync) sync started

Every time that a cache_dir_sync operation starts cache read time spikes up to ~1 second values.

Vgutierrez added a project: Performance-Team (Radar).Oct 5 2022, 10:32 AM

cache_dir_sync issues reported to upstream in https://github.com/apache/trafficserver/issues/9124

Change 839486 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Allow partioning the cache storage in several volumes

https://gerrit.wikimedia.org/r/839486

gerritbot added a project: Patch-For-Review.Oct 6 2022, 11:00 AM

Change 839490 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Enable cache partitioning in cp6008

https://gerrit.wikimedia.org/r/839490

Change 839486 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Allow partitioning the cache storage in several volumes

https://gerrit.wikimedia.org/r/839486

Mentioned in SAL (#wikimedia-operations) [2022-10-06T13:17:55Z] <vgutierrez> partition ats-be cache in cp6008 - T317748

Change 839490 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Enable cache partitioning in cp6008

https://gerrit.wikimedia.org/r/839490

Change 840061 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Enable cache partitioning in cp6016

https://gerrit.wikimedia.org/r/840061

Mentioned in SAL (#wikimedia-operations) [2022-10-07T08:22:14Z] <vgutierrez> partition ats-be cache in cp6016 - T317748

Change 840061 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Enable cache partitioning in cp6016

https://gerrit.wikimedia.org/r/840061

It looks like I've found a mitigation for this issue, tested in cp6008 (upload) and cp6016 (text):

What's happening?
What we're seeing in the affected nodes is lock contention during the init stage of the cache_dir_sync operation. ATS locks the whole cache volume to get a snapshot of a dirty cache directory before writing it back to disk using AIO. For the 1.6TB NVMe drives, a cache directory is around ~2GB of data, mmaping() 2Gb is a relatively fast operation, ~0.020ms according to perf trace:

3022422084.617 ( 0.023 ms): [ET_AIO 2:166]/40605 mmap(len: 1995481088, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS      ) = 0x7fc71f2de000

What's not that fast is the memcpy() needed to copy the dirty cache directory to the newly allocated buffer by mmap(). As a result during the memcpy() the cache volume is block for read or write operations. By default ATS creates one cache volume, so this effectively blocks the ATS cache entirely.

How is it being mitigated in cp6008 & cp6016?
ATS allows to partition a cache disk in several volumes. For testing purposes I've split cp6008/cp6016 cache in 5 volumes:

vgutierrez@cp6008:~$ sudo cat /etc/trafficserver/volume.config 
# https://docs.trafficserver.apache.org/admin-guide/files/volume.config.en.html
# This file is managed by Puppet.

# volume=volume_number  scheme=protocol_type  size=volume_size
volume=1 scheme=http size=20%
volume=2 scheme=http size=20%
volume=3 scheme=http size=20%
volume=4 scheme=http size=20%
volume=5 scheme=http size=20%

How ATS distributes data between cache volumes?
ATS maps cache keys to cache volumes based on the hostname and the first two 32 bits of the cache key.
As it can be seen on the cache read panel for both cp6008 and cp6016, it achieves and equal distribution of operations per cache volume:

Change 841036 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster

https://gerrit.wikimedia.org/r/841036

Change 841036 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster

https://gerrit.wikimedia.org/r/841036

Mentioned in SAL (#wikimedia-operations) [2022-10-10T09:26:04Z] <vgutierrez> partitioning the ATS cache in cp1089, cp1090, cp2041, cp2042, cp3064, cp3065, cp4034, cp4036, cp5014, cp5016, cp6007, cp6015 - T317748

Mentioned in SAL (#wikimedia-operations) [2022-10-10T13:31:42Z] <vgutierrez> partitioning the ATS cache in cp1087, cp1088, cp2039, cp2040, cp3062, cp3063, cp4033, cp4035, cp5013, cp5015, cp6006, cp6014 - T317748

Change 841451 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #3

https://gerrit.wikimedia.org/r/841451

Change 841451 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #3

https://gerrit.wikimedia.org/r/841451

Mentioned in SAL (#wikimedia-operations) [2022-10-11T08:53:08Z] <vgutierrez> partitioning the ATS cache in cp1085, cp1086, cp2037, cp2038, cp3060, cp3061, cp4026, cp4030, cp5006, cp5012, cp6005, cp6013 - T317748

Change 841486 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #4

https://gerrit.wikimedia.org/r/841486

Change 841486 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #4

https://gerrit.wikimedia.org/r/841486

Mentioned in SAL (#wikimedia-operations) [2022-10-11T12:46:53Z] <vgutierrez> partitioning the ATS cache in cp[2035-2036], cp[6004,6012], cp[1083-1084], cp[5005,5011], cp[3058-3059], cp[4025,4029] - T317748

Change 841856 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #5

https://gerrit.wikimedia.org/r/841856

Change 841856 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #5

https://gerrit.wikimedia.org/r/841856

Mentioned in SAL (#wikimedia-operations) [2022-10-12T08:52:03Z] <vgutierrez> partitioning the ATS cache in cp[2033-2034], cp[6003,6011], cp[1081-1082], cp[5004,5010], cp[3056-3057], cp[4024,4028] - T317748

Vgutierrez mentioned this in T320615: ATS Request Error Ratio SLI shows negative values.Oct 12 2022, 10:48 AM

Change 841941 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #6

https://gerrit.wikimedia.org/r/841941

Change 841941 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #6

https://gerrit.wikimedia.org/r/841941

Mentioned in SAL (#wikimedia-operations) [2022-10-12T15:45:45Z] <vgutierrez> partitioning the ATS cache in cp[2031-2032], cp[6002,6010], cp[1079-1080], cp[5003,5009], cp[3054-3055], cp[4023,4032] - T317748

Change 842352 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #7

https://gerrit.wikimedia.org/r/842352

Change 842352 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #7

https://gerrit.wikimedia.org/r/842352

Mentioned in SAL (#wikimedia-operations) [2022-10-13T08:48:27Z] <vgutierrez> partitioning the ATS cache in cp[2029-2030], cp[6001,6009], cp[1077-1078], cp[5002,5008], cp[3052-3053], cp4022 - T317748

Change 842486 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #8

https://gerrit.wikimedia.org/r/842486

Change 842486 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Partition cache in one server per DC and cluster #8

https://gerrit.wikimedia.org/r/842486

Mentioned in SAL (#wikimedia-operations) [2022-10-13T15:29:32Z] <vgutierrez> partitioning the ATS cache in cp[2027-2028], cp[1075-1076], cp5007, cp[3050-3051] - T317748

After partitioning the ATS cache in the whole fleet cache read time p999 dropped as expected:

data for each DC can be found in https://grafana.wikimedia.org/goto/wMLyQFIVk?orgId=1

For future reference, some additional graphs captured over a slightly wider range, now that more time has passed. From:
https://grafana.wikimedia.org/d/1cjNVDM4k/ats-sli-drilldown?orgId=1&var-site=eqiad&var-site=codfw&var-site=esams&var-site=ulsfo&var-site=eqsin&var-site=drmrs&var-cluster=text&var-cluster=upload&var-percentile=999&from=1665014400000&to=1666051200000

'T is the difference between night and day. From jumping up and down to sometimes 1000ms (1s) down to a very smooth and comfortable ~1ms in Codfw (0.001s). Scanning through the graph, it's all 0.999ms and 1.000ms that I can see. Incredibly stable and consistent now.

In Eqiad it's similarly smooth and stable, though at ~5ms instead. This difference continues to today (Grafana):

Screenshot 2022-11-04 at 22.55.27.png (1×2 px, 214 KB)

Perhaps a difference in hardware? Or perhaps a slight difference in distribution, implying that both DCs have at least some ~5ms reads every second near the p999, with Eqiad perhaps consistenly over and Eqiad usually under.

	F35707359: Screenshot 2022-11-04 at 22.55.27.png
	Nov 4 2022, 11:32 PM

	F35566477: image.png
	Oct 13 2022, 4:05 PM

	F35560258: image.png
	Oct 10 2022, 9:04 AM

	F35560252: image.png
	Oct 10 2022, 9:04 AM

	F35560167: image.png
	Oct 10 2022, 9:04 AM

ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operationsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations
Closed, ResolvedPublic
Actions

Related Objects
Search...