Our current puppetization attempts to enable ATS Read While Writer feature but it fails to set proxy.config.http.cache.open_write_fail_action to the expected value of 5 that would allow RWW to work
Description
Details
Event Timeline
Change 825390 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Disable origin coalescing in cp601[56]
Change 825390 merged by Vgutierrez:
[operations/puppet@production] trafficserver: Disable origin coalescing in cp601[56]
Mentioned in SAL (#wikimedia-operations) [2022-08-22T18:12:02Z] <vgutierrez> disable origin coalescing in ats@cp601[56] - T315911
Change 825709 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Fix open_write_fail_action values
Current Status
The current settings applied when profile::trafficserver::backend::origin_coalescing is set to true (default value) are:
CONFIG proxy.config.cache.enable_read_while_writer INT 1 CONFIG proxy.config.http.cache.max_open_read_retries INT 50 CONFIG proxy.config.http.cache.max_open_write_retries INT 150 CONFIG proxy.config.cache.enable_read_while_writer INT 0
This doesn't work as expected because RWW requires proxy.config.http.cache.open_write_fail_action set to 5. We also need to take into account that proxy.config.http.cache.max_open_write_retries is ignored (per https://docs.trafficserver.apache.org/en/9.1.x/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-open-write-retries) when proxy.config.http.cache.open_write_fail_action is set to 5. Similar settings are enforced when RWW is enabled:
proxy.config.cache.read_while_writer.max_retries proxy.config.cache.read_while_writer_retry.delay
Also, proxy.config.cache.read_while_writer_retry.delay is implemented as a progressive delay, so after the second attempt the configured delay will be doubled:
if (writer_lock_retry > 2) \ _t = HRTIME_MSECONDS(cache_read_while_writer_retry_delay) * 2; \ trigger = mutex->thread_holding->schedule_in_local(this, _t); \
Current Performance issues
Due to the misconfiguration described above ATS is currently suffering the downfalls of coalescing requests without any of the benefits. Some requests wait until a Write Lock is acquired or max_open_write_retries is reached but they aren't able to benefit from the response fetched by a competing thread. Setting profile::trafficserver::backend::origin_coalescing to false on cp6015 (ATS 8.0.8) and cp6016 (ATS 9.1.3) on 2022-08-22 18:12 improved p99 for cache read times significantly:
Proposed fix
As described in the commit message for https://gerrit.wikimedia.org/r/c/operations/puppet/+/825709/:
This commit is an attempt to fix our current configuration and make it comply with the different approaches available in ATS to mitigate the thundering herd effect.
- Read While Writer (built-in ATS core): open_write_fail_action must be set to 5 per https://docs.trafficserver.apache.org/admin-guide/files/records.config.en.html?highlight=read_while_writer#proxy-config-http-cache-open-write-fail-action
- Collapsed Forwarded plugin: open_write_fail_action must be set to 1 per https://docs.trafficserver.apache.org/admin-guide/plugins/collapsed_forwarding.en.html#functionality
- N/A: open_write_fail_action must be set to 0
Potential issues down the road: when using RWW we could potentially hit https://github.com/apache/trafficserver/issues/7092, since RWW is currently enabled (but not working) cluster wide I'd recommend turning it off globally and then proceed with this commit.
After ATS configuration is fixed, we probably should enable origin_coalescing for upload and keep it disabled on text
Change 825727 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Disable origin coalescing
Change 826239 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Disable origin coalescing in cp600[78]
Change 826239 merged by Vgutierrez:
[operations/puppet@production] trafficserver: Disable origin coalescing in cp600[78]
Mentioned in SAL (#wikimedia-operations) [2022-08-24T10:52:19Z] <vgutierrez> disable origin coalescing in ats@cp600[78] - T315911
Change 825727 merged by Vgutierrez:
[operations/puppet@production] trafficserver: Disable origin coalescing
Mentioned in SAL (#wikimedia-operations) [2022-08-25T13:19:50Z] <vgutierrez> disable origin coalescing in ats-be globally - T315911
Change 825709 merged by Vgutierrez:
[operations/puppet@production] trafficserver: Fix open_write_fail_action values
Change 826576 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Enable origin coalescing in cp600[78]
Change 826576 merged by Vgutierrez:
[operations/puppet@production] trafficserver: Enable origin coalescing in cp600[78]
Mentioned in SAL (#wikimedia-operations) [2022-08-25T14:32:23Z] <vgutierrez> enable origin coalescing in ats-be@cp600[78] [expect crashes] - T315911
Change 826586 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Get rid of disable_coalescing() in Lua
Change 826586 merged by Vgutierrez:
[operations/puppet@production] trafficserver: Get rid of disable_coalescing() in Lua
Change 826616 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] Revert "trafficserver: Enable origin coalescing in cp600[78]"
Change 826616 merged by Vgutierrez:
[operations/puppet@production] Revert "trafficserver: Enable origin coalescing in cp600[78]"
Mentioned in SAL (#wikimedia-operations) [2022-08-26T09:33:31Z] <vgutierrez> disable origin coalescing in cp6007 and cp6008 - T315911
Change 826866 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] varnish: Emit X-Varnish-Cluster for misc sites