We're currently using swift-rw (eqiad only) as the origin server for upload cache misses. Thumb traffic can however be served active/active by swift-ro. Doing this with remap rules requires using the regex remap plugin, which isn't great. Let's do it in Lua instead.
Mon, Mar 18
RAM cache usage issue submitted upstream: https://github.com/apache/trafficserver/issues/5179
Fri, Mar 15
RAM cache usage seems to be growing non-stop. I have left proxy.config.cache.ram_cache.size to the default value of -1, which according to the docs means that the RAM cache size is automatically determined. That seems to be true, and ATS decided that our RAM caches should be ~970M in size. However, the amount of data in RAM cache has now grown much larger than the limit:
Thu, Mar 14
Wed, Mar 13
Preliminary testing went well: the cp-ats cluster in codfw served all frontend misses from cp2002 for about one hour between 17:18 and 18:28 without any immediately obvious issue.
Closing this as cp1077 is fine and back in service. For the general Varnish scalability issue, the solution we've identified is moving to ATS (ongoing, see T213263).
Tue, Mar 12
Re-pooling the service caused the issue to show up again. For some reason, the cron jobs restarting varnish.service do not seem to have worked. Although cron logged two restarts, one on Mar 08 and one on Mar 12:
At the time of the issue, cp1077 was failing to fetch objects from the origin servers and was affected heavily by mbox lag.
Thu, Mar 7
Those connection resets on the varnish backend layer happen when frontend caches are full and varnish cannot make space for a newly fetched object body:
Varnishlog of the varnish backend instance serving the request in esams reports the following:
Wed, Mar 6
I haven't done any investigation yet, but it sounds similar to T215389.
Can this be closed?
Anything else to be done here?
Is there anything to do here? :-)
Fri, Mar 1
Jan 10 2019
Jan 9 2019
The patch by @Vgutierrez fixed this bug. Closing.
We've added TLS support for maps and fixed the SAN list on swift to ensure proper TLS connections with upload origin servers. This is thus done.
Jan 4 2019
Jan 2 2019
New certificates deployed both in codfw and in eqiad.
Dec 21 2018
Tested the new cert on ms-fe2006, looks good:
Dec 19 2018
Dec 18 2018
Dec 17 2018
I do agree with @Joe, without a proper PKI this is going to be painful. For now however I've added TLS support to kartotherian (T211970) as that's part of cache_upload, which is the cluster we're gonna tackle first for the conversion to ATS.
Yes @mmodell, now I can. Thank you!