Page MenuHomePhabricator

ATS Read While Writer feature is wrongly configured
Open, MediumPublic

Description

Our current puppetization attempts to enable ATS Read While Writer feature but it fails to set proxy.config.http.cache.open_write_fail_action to the expected value of 5 that would allow RWW to work

Event Timeline

Change 825390 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Disable origin coalescing in cp601[56]

https://gerrit.wikimedia.org/r/825390

Change 825390 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Disable origin coalescing in cp601[56]

https://gerrit.wikimedia.org/r/825390

Mentioned in SAL (#wikimedia-operations) [2022-08-22T18:12:02Z] <vgutierrez> disable origin coalescing in ats@cp601[56] - T315911

Change 825709 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Fix open_write_fail_action values

https://gerrit.wikimedia.org/r/825709

Current Status

The current settings applied when profile::trafficserver::backend::origin_coalescing is set to true (default value) are:

CONFIG proxy.config.cache.enable_read_while_writer INT 1
CONFIG proxy.config.http.cache.max_open_read_retries INT 50
CONFIG proxy.config.http.cache.max_open_write_retries INT 150
CONFIG proxy.config.cache.enable_read_while_writer INT 0

This doesn't work as expected because RWW requires proxy.config.http.cache.open_write_fail_action set to 5. We also need to take into account that proxy.config.http.cache.max_open_write_retries is ignored (per https://docs.trafficserver.apache.org/en/9.1.x/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-open-write-retries) when proxy.config.http.cache.open_write_fail_action is set to 5. Similar settings are enforced when RWW is enabled:

proxy.config.cache.read_while_writer.max_retries
proxy.config.cache.read_while_writer_retry.delay

Also, proxy.config.cache.read_while_writer_retry.delay is implemented as a progressive delay, so after the second attempt the configured delay will be doubled:

iocore/cache/P_CacheInternal.h:92
if (writer_lock_retry > 2)                                            \
  _t = HRTIME_MSECONDS(cache_read_while_writer_retry_delay) * 2;      \
trigger = mutex->thread_holding->schedule_in_local(this, _t);         \

Current Performance issues

Due to the misconfiguration described above ATS is currently suffering the downfalls of coalescing requests without any of the benefits. Some requests wait until a Write Lock is acquired or max_open_write_retries is reached but they aren't able to benefit from the response fetched by a competing thread. Setting profile::trafficserver::backend::origin_coalescing to false on cp6015 (ATS 8.0.8) and cp6016 (ATS 9.1.3) on 2022-08-22 18:12 improved p99 for cache read times significantly:

cp6015

cp6016

Proposed fix

As described in the commit message for https://gerrit.wikimedia.org/r/c/operations/puppet/+/825709/:
This commit is an attempt to fix our current configuration and make it comply with the different approaches available in ATS to mitigate the thundering herd effect.

Potential issues down the road: when using RWW we could potentially hit https://github.com/apache/trafficserver/issues/7092, since RWW is currently enabled (but not working) cluster wide I'd recommend turning it off globally and then proceed with this commit.

After ATS configuration is fixed, we probably should enable origin_coalescing for upload and keep it disabled on text

Change 825727 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Disable origin coalescing

https://gerrit.wikimedia.org/r/825727

Change 826239 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Disable origin coalescing in cp600[78]

https://gerrit.wikimedia.org/r/826239

Change 826239 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Disable origin coalescing in cp600[78]

https://gerrit.wikimedia.org/r/826239

Mentioned in SAL (#wikimedia-operations) [2022-08-24T10:52:19Z] <vgutierrez> disable origin coalescing in ats@cp600[78] - T315911

Change 825727 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Disable origin coalescing

https://gerrit.wikimedia.org/r/825727

Mentioned in SAL (#wikimedia-operations) [2022-08-25T13:19:50Z] <vgutierrez> disable origin coalescing in ats-be globally - T315911

Change 825709 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Fix open_write_fail_action values

https://gerrit.wikimedia.org/r/825709

Change 826576 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Enable origin coalescing in cp600[78]

https://gerrit.wikimedia.org/r/826576

Change 826576 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Enable origin coalescing in cp600[78]

https://gerrit.wikimedia.org/r/826576

Mentioned in SAL (#wikimedia-operations) [2022-08-25T14:32:23Z] <vgutierrez> enable origin coalescing in ats-be@cp600[78] [expect crashes] - T315911

Change 826586 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Get rid of disable_coalescing() in Lua

https://gerrit.wikimedia.org/r/826586

Change 826586 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Get rid of disable_coalescing() in Lua

https://gerrit.wikimedia.org/r/826586

Change 826616 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] Revert "trafficserver: Enable origin coalescing in cp600[78]"

https://gerrit.wikimedia.org/r/826616

Change 826616 merged by Vgutierrez:

[operations/puppet@production] Revert "trafficserver: Enable origin coalescing in cp600[78]"

https://gerrit.wikimedia.org/r/826616

Mentioned in SAL (#wikimedia-operations) [2022-08-26T09:33:31Z] <vgutierrez> disable origin coalescing in cp6007 and cp6008 - T315911

Change 826866 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Emit X-Varnish-Cluster for misc sites

https://gerrit.wikimedia.org/r/826866

Vgutierrez triaged this task as Medium priority.Sep 6 2022, 1:22 PM