Page MenuHomePhabricator

Partial cache_upload traffic switchover to ATS and switchback to Varnish
Closed, ResolvedPublic

Description

With the ATS clusters ready to serve production traffic (T207048), the next step is to serve a portion of the cache_upload requests via ATS and observe its behavior.

To do that, we need to:

  • add a feature flag to point individual Varnish frontends in core DCs to their dc-local ATS clusters
  • flip the switch on some cache_upload frontends, both in eqiad and codfw
  • observe and document any issues, gather operational experience
  • switch back to Varnish

Details

Related Gerrit Patches:
operations/puppet : productionATS: update cumin aliases
operations/puppet : productionATS: remove role trafficserver::backend
operations/puppet : productioncp-ats: reimage as test nodes
operations/puppet : productionRevert "cache: define ATS nodes in hiera"
operations/puppet : productionRevert "cache: hiera flag to use ATS as local backend"
operations/puppet : productionRevert "cp1076: use ATS backends instead of Varnish"
operations/puppet : productionRevert "cp2002: use ATS backends instead of Varnish"
operations/puppet : productionRevert "cp2005: use ATS backends instead of Varnish"
operations/puppet : productionATS: add request/response details to error message
operations/puppet : productionATS: fix template_sets_dir
operations/puppet : productionATS: custom WMF error page
operations/puppet : productionATS: add support for custom error messages
operations/puppet : productionATS: add ats-backend-restart
operations/puppet : productionATS: test unsetting Accept-Encoding
operations/puppet : productionATS: strip PKP headers
operations/puppet : productionATS: unset Accept-Encoding
operations/puppet : productionATS: add ats_transaction_err.stp
operations/puppet : productionATS: use pointer trick in SystemTap scripts
operations/puppet : productioncp1076: use ATS backends instead of Varnish
operations/puppet : productioncp2005: use ATS backends instead of Varnish
operations/puppet : productionATS: SystemTap probe for origin server connections
operations/puppet : productionATS: use 'swift-ro' as the origin server for thumb traffic
operations/puppet : productionATS: disable max_doc_size
operations/puppet : productionATS: set RAM cache size
operations/debs/superior-cache-analyzer : masterInitial debianization
integration/config : masterTest operations/debs/superior-cache-analyzer
operations/puppet : productionATS: set disk cache size cutoff to 1G
operations/puppet : productioncp2002: use ATS backends instead of Varnish
operations/puppet : productionATS: do not return X-Cache-Status
operations/puppet : productionrole::trafficserver::backend: add conftool client
operations/puppet : productionprofile::trafficserver::backend: install conftool scripts
operations/puppet : productioncache: hiera flag to use ATS as local backend
operations/puppet : productioncache: define ATS nodes in hiera
operations/puppet : productionAdd new conftool service "ats-be"

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 494761 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp2002: use ATS backends instead of Varnish

https://gerrit.wikimedia.org/r/494761

Change 494770 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: do not return X-Cache-Status

https://gerrit.wikimedia.org/r/494770

Change 494770 merged by Ema:
[operations/puppet@production] ATS: do not return X-Cache-Status

https://gerrit.wikimedia.org/r/494770

Mentioned in SAL (#wikimedia-operations) [2019-03-13T14:27:05Z] <ema> cp2002: depool varnish-fe in preparation of pointing it to ATS T213263

Change 494761 merged by Ema:
[operations/puppet@production] cp2002: use ATS backends instead of Varnish

https://gerrit.wikimedia.org/r/494761

Mentioned in SAL (#wikimedia-operations) [2019-03-13T17:18:56Z] <ema> cp2002: pool varnish-fe for user traffic, routed through ATS backends T213263

Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:28:47Z] <ema> cp2002: depool varnish-fe after 1 hour ATS experiment T213263

ema added a comment.Mar 13 2019, 6:34 PM

Preliminary testing went well: the cp-ats cluster in codfw served all frontend misses from cp2002 for about one hour between 17:18 and 18:28 without any immediately obvious issue.

Mentioned in SAL (#wikimedia-operations) [2019-03-14T10:50:02Z] <ema> cp2002: pool varnish-fe to resume ATS testing T213263

Change 496767 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: set disk cache size cutoff to 1G

https://gerrit.wikimedia.org/r/496767

Change 496767 merged by Ema:
[operations/puppet@production] ATS: set disk cache size cutoff to 1G

https://gerrit.wikimedia.org/r/496767

Change 496783 had a related patch set uploaded (by Ema; owner: Ema):
[integration/config@master] Test operations/debs/superior-cache-analyzer

https://gerrit.wikimedia.org/r/496783

Change 496781 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/superior-cache-analyzer@master] Initial debianization

https://gerrit.wikimedia.org/r/496781

Mentioned in SAL (#wikimedia-operations) [2019-03-15T15:04:35Z] <ema> cp2015: test ATS depool T213263

Mentioned in SAL (#wikimedia-operations) [2019-03-15T15:10:36Z] <ema> cp2015: repool ATS with proxy.config.cache.ram_cache.size 1G T213263

ema added a comment.Mar 15 2019, 4:19 PM

RAM cache usage seems to be growing non-stop. I have left proxy.config.cache.ram_cache.size to the default value of -1, which according to the docs means that the RAM cache size is automatically determined. That seems to be true, and ATS decided that our RAM caches should be ~970M in size. However, the amount of data in RAM cache has now grown much larger than the limit:

$ traffic_ctl metric match .*cache.ram_cache.*bytes
proxy.process.cache.ram_cache.total_bytes 972832768
proxy.process.cache.ram_cache.bytes_used 2565339904

I have disabled puppet on cp2015 and manually set proxy.config.cache.ram_cache.size to 1G to test if, by setting an explicit limit, the value is actually enforced.

Although we've got > 250G of memory per host, which is plenty enough for days to come at the current RAM cache usage growth rate, I'll depool cp2002's varnish-fe over the weekend just to be on the safe side.

Change 496829 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: set RAM cache size

https://gerrit.wikimedia.org/r/496829

Mentioned in SAL (#wikimedia-operations) [2019-03-15T17:53:13Z] <ema> depool cp2002's varnish-fe for the weekend T213263#5027366

Mentioned in SAL (#wikimedia-operations) [2019-03-18T08:31:55Z] <ema> cp2002: repool varnish-fe to resume ATS testing T213263

Change 496783 merged by jenkins-bot:
[integration/config@master] Test operations/debs/superior-cache-analyzer

https://gerrit.wikimedia.org/r/496783

Change 496781 merged by Ema:
[operations/debs/superior-cache-analyzer@master] Initial debianization

https://gerrit.wikimedia.org/r/496781

Mentioned in SAL (#wikimedia-operations) [2019-03-18T09:40:02Z] <ema> superior-cache-analyzer_3.3.7 uploaded to stretch-wikimedia T213263

Change 497277 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable max_doc_size

https://gerrit.wikimedia.org/r/497277

ema added a comment.EditedMar 18 2019, 1:32 PM

I have disabled puppet on cp2015 and manually set proxy.config.cache.ram_cache.size to 1G to test if, by setting an explicit limit, the value is actually enforced.

It is, RAM cache usage is not going above the explicit limit.

Setting proxy.config.cache.ram_cache.size on all nodes.

Change 496829 merged by Ema:
[operations/puppet@production] ATS: set RAM cache size

https://gerrit.wikimedia.org/r/496829

Mentioned in SAL (#wikimedia-operations) [2019-03-18T13:41:59Z] <ema> cp-ats rolling restart to apply proxy.config.cache.ram_cache.size config change T213263

Change 497277 merged by Ema:
[operations/puppet@production] ATS: disable max_doc_size

https://gerrit.wikimedia.org/r/497277

ema added a comment.Mar 21 2019, 9:25 AM

We're currently using swift-rw (eqiad only) as the origin server for upload cache misses. Thumb traffic can however be served active/active by swift-ro. Doing this with remap rules requires using the regex remap plugin, which isn't great. Let's do it in Lua instead.

Change 498028 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: use 'swift-ro' as the origin server for thumb traffic

https://gerrit.wikimedia.org/r/498028

Change 498031 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: SystemTap probe for origin server connections

https://gerrit.wikimedia.org/r/498031

Change 498028 merged by Ema:
[operations/puppet@production] ATS: use 'swift-ro' as the origin server for thumb traffic

https://gerrit.wikimedia.org/r/498028

Change 498031 merged by Ema:
[operations/puppet@production] ATS: SystemTap probe for origin server connections

https://gerrit.wikimedia.org/r/498031

Change 498330 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp2005: use ATS backends instead of Varnish

https://gerrit.wikimedia.org/r/498330

Mentioned in SAL (#wikimedia-operations) [2019-03-22T09:47:53Z] <ema> cp2005: depool varnish-fe in preparation of traffic switch to ATS T213263

Change 498330 merged by Ema:
[operations/puppet@production] cp2005: use ATS backends instead of Varnish

https://gerrit.wikimedia.org/r/498330

Mentioned in SAL (#wikimedia-operations) [2019-03-22T10:05:34Z] <ema> cp2005: repooled, serving traffic via ATS T213263

ema updated the task description. (Show Details)Mar 22 2019, 10:12 AM

Change 498849 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp1076: use ATS backends instead of Varnish

https://gerrit.wikimedia.org/r/498849

Mentioned in SAL (#wikimedia-operations) [2019-03-25T13:41:12Z] <ema> cp1076: depool varnish-fe and point it to cp-ats T213263

Change 498849 merged by Ema:
[operations/puppet@production] cp1076: use ATS backends instead of Varnish

https://gerrit.wikimedia.org/r/498849

Mentioned in SAL (#wikimedia-operations) [2019-03-25T13:52:15Z] <ema> cp1076: repool varnish-fe, frontend misses served by cp-ats T213263

Change 499762 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: use pointer trick in SystemTap scripts

https://gerrit.wikimedia.org/r/499762

Change 499763 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add ats_transaction_err.stp

https://gerrit.wikimedia.org/r/499763

Change 499762 merged by Ema:
[operations/puppet@production] ATS: use pointer trick in SystemTap scripts

https://gerrit.wikimedia.org/r/499762

Change 499763 merged by Ema:
[operations/puppet@production] ATS: add ats_transaction_err.stp

https://gerrit.wikimedia.org/r/499763

Change 500011 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: unset AE:gzip

https://gerrit.wikimedia.org/r/500011

Change 500011 merged by Ema:
[operations/puppet@production] ATS: unset Accept-Encoding

https://gerrit.wikimedia.org/r/500011

Mentioned in SAL (#wikimedia-operations) [2019-03-29T13:05:55Z] <ema> cp2002/cp2005: repool varnish-fe for user traffic T213263

Change 500656 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: test unsetting Accept-Encoding

https://gerrit.wikimedia.org/r/500656

Change 500655 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: strip PKP headers

https://gerrit.wikimedia.org/r/500655

Change 500655 merged by Ema:
[operations/puppet@production] ATS: strip PKP headers

https://gerrit.wikimedia.org/r/500655

Change 500656 merged by Ema:
[operations/puppet@production] ATS: test unsetting Accept-Encoding

https://gerrit.wikimedia.org/r/500656

Change 500675 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add ats-backend-restart

https://gerrit.wikimedia.org/r/500675

Change 500675 merged by Ema:
[operations/puppet@production] ATS: add ats-backend-restart

https://gerrit.wikimedia.org/r/500675

Mentioned in SAL (#wikimedia-operations) [2019-04-03T09:29:22Z] <ema> cp-ats-codfw: test ATS rolling restart T213263

Change 501160 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add support for custom error messages

https://gerrit.wikimedia.org/r/501160

Change 501168 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: custom WMF error page

https://gerrit.wikimedia.org/r/501168

Change 501160 merged by Ema:
[operations/puppet@production] ATS: add support for custom error messages

https://gerrit.wikimedia.org/r/501160

Change 501168 merged by Ema:
[operations/puppet@production] ATS: custom WMF error page

https://gerrit.wikimedia.org/r/501168

Change 501195 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: fix template_sets_dir

https://gerrit.wikimedia.org/r/501195

Change 501195 merged by Ema:
[operations/puppet@production] ATS: fix template_sets_dir

https://gerrit.wikimedia.org/r/501195

Change 501198 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add request/response details to error message

https://gerrit.wikimedia.org/r/501198

ema added a comment.Apr 4 2019, 12:32 PM

Here is how our custom ATS errors look like.

Change 501198 merged by Ema:
[operations/puppet@production] ATS: add request/response details to error message

https://gerrit.wikimedia.org/r/501198

Change 504036 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cp1076: use ATS backends instead of Varnish"

https://gerrit.wikimedia.org/r/504036

Change 504037 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cp2002: use ATS backends instead of Varnish"

https://gerrit.wikimedia.org/r/504037

Change 504038 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cp2005: use ATS backends instead of Varnish"

https://gerrit.wikimedia.org/r/504038

Change 504039 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cache: hiera flag to use ATS as local backend"

https://gerrit.wikimedia.org/r/504039

Mentioned in SAL (#wikimedia-operations) [2019-04-16T07:25:26Z] <ema> cp2005: depool varnish-fe in preparation of traffic switchback to Varnish T213263

Change 504038 merged by Ema:
[operations/puppet@production] Revert "cp2005: use ATS backends instead of Varnish"

https://gerrit.wikimedia.org/r/504038

Mentioned in SAL (#wikimedia-operations) [2019-04-16T07:32:53Z] <ema> cp2005: repool varnish-fe pointing to Varnish T213263

Mentioned in SAL (#wikimedia-operations) [2019-04-16T07:45:00Z] <ema> cp2002: depool varnish-fe in preparation of traffic switchback to Varnish T213263

Change 504037 merged by Ema:
[operations/puppet@production] Revert "cp2002: use ATS backends instead of Varnish"

https://gerrit.wikimedia.org/r/504037

Mentioned in SAL (#wikimedia-operations) [2019-04-16T07:50:38Z] <ema> cp2002: repool varnish-fe pointing to Varnish T213263

Mentioned in SAL (#wikimedia-operations) [2019-04-16T08:57:33Z] <ema> cp1076: depool varnish-fe in preparation of traffic switchback to Varnish T213263

Change 504036 merged by Ema:
[operations/puppet@production] Revert "cp1076: use ATS backends instead of Varnish"

https://gerrit.wikimedia.org/r/504036

Mentioned in SAL (#wikimedia-operations) [2019-04-16T09:05:15Z] <ema> cp1076: repool varnish-fe pointing to Varnish T213263

Change 504039 merged by Ema:
[operations/puppet@production] Revert "cache: hiera flag to use ATS as local backend"

https://gerrit.wikimedia.org/r/504039

ema closed this task as Resolved.Apr 16 2019, 9:44 AM
ema claimed this task.

Switchback completed, closing.

Change 506141 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cache: define ATS nodes in hiera"

https://gerrit.wikimedia.org/r/506141

Change 506141 merged by Ema:
[operations/puppet@production] Revert "cache: define ATS nodes in hiera"

https://gerrit.wikimedia.org/r/506141

Change 508567 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp-ats: reimage as test nodes

https://gerrit.wikimedia.org/r/508567

Change 508567 merged by Ema:
[operations/puppet@production] cp-ats: reimage as test nodes

https://gerrit.wikimedia.org/r/508567

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp1071.eqiad.wmnet', 'cp1072.eqiad.wmnet', 'cp1073.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201905071331_ema_248362.log.

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp1074.eqiad.wmnet', 'cp2003.codfw.wmnet', 'cp2015.codfw.wmnet', 'cp2021.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201905071406_ema_260929.log.

Change 508587 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: remove role trafficserver::backend

https://gerrit.wikimedia.org/r/508587

Change 508588 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: update cumin aliases

https://gerrit.wikimedia.org/r/508588

Completed auto-reimage of hosts:

['cp1072.eqiad.wmnet', 'cp1073.eqiad.wmnet', 'cp1071.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cp1074.eqiad.wmnet', 'cp2021.codfw.wmnet', 'cp2015.codfw.wmnet', 'cp2003.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp2009.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201905071506_ema_20589.log.

Completed auto-reimage of hosts:

['cp2009.codfw.wmnet']

and were ALL successful.

Change 508587 merged by Ema:
[operations/puppet@production] ATS: remove role trafficserver::backend

https://gerrit.wikimedia.org/r/508587

Change 508588 merged by Ema:
[operations/puppet@production] ATS: update cumin aliases

https://gerrit.wikimedia.org/r/508588