Page MenuHomePhabricator

Merge cache_misc into cache_text functionally
Closed, ResolvedPublic

Description

Starting with Varnish 5.0, it is possible to load multiple VCL files and switch to the desired one from within VCL based on certain conditions (eg: Host header matching vhost-style). We should take advantage of such feature to merge the functionality of our current cache_misc cluster into cache_text.

Steps to complete the transition:

  • Make puppet deploy misc VCL files on cache::text hosts
  • Add the ability to load multiple VCL files to our custom reload-vcl script
  • Make sure puppet triggers a reload of the additional VCL files upon modification. The reload procedure is a bit different compared to the one used to reload the main VCL program: vcl.load ; vcl.label instead of vcl.load ; vcl.use
  • Matching the host header value with the hostnames currently handled by cache_misc (~90 at the time of this writing) should be done in a efficient and possibly clean way. The set functionality of libvmod-re2 seems apt, see https://phabricator.wikimedia.org/P7201 for how cache_text VCL might end up looking like if we choose to follow that approach
  • Add misc cache::app_directors and cache::req_handling entries to text
  • Some configuration settings differ between misc and text (eg: cache::websocket_support and cache::app_def_be_opts). See if/how they can be applied to text
  • Test by pointing misc services to text IPs (eg: curl --resolve, resolv.conf)
  • Add misc multicast purge IP, 239.128.0.115 to text's profile::cache::base::purge_multicasts
  • DNS changes (geoip!misc-addrs -> geoip!text-addrs)
  • Reimage misc hosts as spares
  • Remove misc LVS services
  • Cleanup

Details

Related Gerrit Patches:
operations/puppet : productioncache_misc: cleanup leftovers
operations/dns : masterRemove misc-web
operations/puppet : productionRemove cache_misc definitions
operations/puppet : productionvarnishkafka/icinga: remove check for misc-web webrequests
operations/puppet : productionAmend the phabricator icinga check
operations/puppet : productionlvs: remove misc_web and misc_web-https
operations/puppet : productionsite: convert cache::misc hosts to spares
operations/puppet : productionphabricator/otrs: use cache::text::nodes for mod_remoteip
operations/puppet : productionprofile::docker::registry: whitelist cache_text nodes
operations/dns : masterRevert TTLs back to 600 for misc->text moves
operations/dns : masterUse cache_text instead of cache_misc for all misc sites
operations/dns : masterLower TTL of cache_misc sites to 30s
operations/dns : masterRoute 10 cache_misc sites to cache_text
operations/dns : masterRoute puppetboard & debmonitor through cache_text
operations/dns : masterRoute grafana through cache_text
operations/puppet : productioncache_text: listen for cache_misc PURGE multicasts
operations/puppet : productioncache::text: enable websocket_support
operations/puppet : productioncache::canary: enable websocket_support
operations/puppet : productionvcl: add cluster_{fe,be}_vcl_switch hooks
operations/puppet : productioncache_text: re-enable alternate domains
operations/puppet : productioncache_text: disable all alternate domains
operations/puppet : productioncache_canary: add phabricator for testing purposes
operations/puppet : productioncache_text: add misc-specific VTC tests
operations/puppet : productioncache_text: load misc VCL as wikimedia_misc in VTC files
operations/puppet : productionRevert "Revert "cache_text: add support for alternate_domains""
operations/puppet : productionnetwork::constants: define all caches, not only cache_misc
operations/puppet : productionreload-vcl: manually set separate VCL files as warm
operations/puppet : productioncache_canary: add config-master for testing purposes
operations/puppet : productionvarnish: startup process for multiple VCL files
operations/puppet : productioncache_text: add support for alternate_domains
operations/puppet : productioncache_text: add misc directors and alternate_domains
operations/puppet : productionreload-vcl: label separate VCLs before compiling the main one
operations/puppet : productionvarnish: load separate VCL files on service startup
operations/puppet : productionreload-vcl: do not include layer information in additional VCL labels
operations/puppet : productioncache::text: ship cache_misc VCL
operations/puppet : productioncache_text: test switching to cache_misc VCL
operations/puppet : productionvarnish::instance: separate VCLs support
operations/puppet : productionreload-vcl: add --separate-vcls
operations/puppet : productioncache: allow installing separate VCL files
operations/puppet : productionvarnish: install libvmod-re2

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 445081 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: start without specifying a VCL file

https://gerrit.wikimedia.org/r/445081

Change 445081 merged by Ema:
[operations/puppet@production] varnish: startup process for multiple VCL files

https://gerrit.wikimedia.org/r/445081

Change 445126 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] network::constants: define and use cache_text

https://gerrit.wikimedia.org/r/445126

Change 445133 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_canary: add config-master testing purposes

https://gerrit.wikimedia.org/r/445133

Change 445133 merged by Ema:
[operations/puppet@production] cache_canary: add config-master for testing purposes

https://gerrit.wikimedia.org/r/445133

Change 445357 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] reload-vcl: manually set separate VCL files as warm

https://gerrit.wikimedia.org/r/445357

Change 445357 merged by Ema:
[operations/puppet@production] reload-vcl: manually set separate VCL files as warm

https://gerrit.wikimedia.org/r/445357

Change 445126 merged by Ema:
[operations/puppet@production] network::constants: define all caches, not only cache_misc

https://gerrit.wikimedia.org/r/445126

Change 447593 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "Revert "cache_text: add support for alternate_domains""

https://gerrit.wikimedia.org/r/447593

Mentioned in SAL (#wikimedia-operations) [2018-07-24T11:35:26Z] <ema> disable puppet on cp-text hosts to merge alternate domains patch T164609

Change 447593 merged by Ema:
[operations/puppet@production] Revert "Revert "cache_text: add support for alternate_domains""

https://gerrit.wikimedia.org/r/447593

Mentioned in SAL (#wikimedia-operations) [2018-07-24T12:07:53Z] <ema> depool cp1067 to test alternate domains patch T164609

Mentioned in SAL (#wikimedia-operations) [2018-07-24T13:07:40Z] <ema> repool cp1067 with alternate domains support T164609

Change 443930 merged by Ema:
[operations/puppet@production] cache_text: load misc VCL as wikimedia_misc in VTC files

https://gerrit.wikimedia.org/r/443930

Change 443974 merged by Ema:
[operations/puppet@production] cache_text: add misc-specific VTC tests

https://gerrit.wikimedia.org/r/443974

Mentioned in SAL (#wikimedia-operations) [2018-07-24T13:45:51Z] <ema> apply alternate domains patch to text-eqiad T164609

Change 447643 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_canary: add phabricator for testing purposes

https://gerrit.wikimedia.org/r/447643

Change 447643 merged by Ema:
[operations/puppet@production] cache_canary: add phabricator for testing purposes

https://gerrit.wikimedia.org/r/447643

Change 447646 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: disable all alternate_domains but grafana

https://gerrit.wikimedia.org/r/447646

Change 447646 merged by Ema:
[operations/puppet@production] cache_text: disable all alternate domains

https://gerrit.wikimedia.org/r/447646

Mentioned in SAL (#wikimedia-operations) [2018-07-24T17:18:18Z] <ema> restart varnish-fe on cp1068 to clear "child restarted" alert T164609

Mentioned in SAL (#wikimedia-operations) [2018-07-24T17:40:19Z] <ema> re-enable puppet on all cache nodes with alternate domains disabled T164609

Mentioned in SAL (#wikimedia-operations) [2018-07-25T09:31:25Z] <ema> upload varnish 5.1.3-1wm9 to apt.w.o (fixing POST requests w/ separate VCL) T164609

Mentioned in SAL (#wikimedia-operations) [2018-07-25T10:05:08Z] <ema> upgrade varnish to 5.1.3-1wm9 on text-eqiad T164609

Change 447776 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: re-enable alternate domains

https://gerrit.wikimedia.org/r/447776

Change 447776 merged by Ema:
[operations/puppet@production] cache_text: re-enable alternate domains

https://gerrit.wikimedia.org/r/447776

Mentioned in SAL (#wikimedia-operations) [2018-07-25T13:26:03Z] <ema> depool cp1067 and test alternate domains patch with varnish 5.1.3-1wm9 T164609

Mentioned in SAL (#wikimedia-operations) [2018-07-25T13:34:06Z] <ema> repool cp1067 w/ alternate domains patch and varnish 5.1.3-1wm9 T164609

Mentioned in SAL (#wikimedia-operations) [2018-07-25T13:48:49Z] <ema> text-eqiad: test alternate domains patch with varnish 5.1.3-1wm9 T164609

Change 447836 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: add cluster_{fe,be}_vcl_switch hooks

https://gerrit.wikimedia.org/r/447836

Change 447836 merged by Ema:
[operations/puppet@production] vcl: add cluster_{fe,be}_vcl_switch hooks

https://gerrit.wikimedia.org/r/447836

Mentioned in SAL (#wikimedia-operations) [2018-07-26T10:24:31Z] <ema> cache-text: upgrade to varnish 5.1.3-1wm9 and apply alternate domains patch T164609

Change 449739 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache::canary: enable websocket_support

https://gerrit.wikimedia.org/r/449739

Change 449739 merged by Ema:
[operations/puppet@production] cache::canary: enable websocket_support

https://gerrit.wikimedia.org/r/449739

Change 449752 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache::text: enable websocket_support

https://gerrit.wikimedia.org/r/449752

Change 449752 merged by Ema:
[operations/puppet@production] cache::text: enable websocket_support

https://gerrit.wikimedia.org/r/449752

Change 449945 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: listen for cache_misc PURGE multicasts

https://gerrit.wikimedia.org/r/449945

Change 449945 merged by Ema:
[operations/puppet@production] cache_text: listen for cache_misc PURGE multicasts

https://gerrit.wikimedia.org/r/449945

ema updated the task description. (Show Details)Aug 2 2018, 7:47 AM

Change 450513 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Route grafana through cache_text

https://gerrit.wikimedia.org/r/450513

Change 450513 merged by Ema:
[operations/dns@master] Route grafana through cache_text

https://gerrit.wikimedia.org/r/450513

Change 450909 had a related patch set uploaded (by Volans; owner: Volans):
[operations/dns@master] Route puppetboard & debmonitor through cache_text

https://gerrit.wikimedia.org/r/450909

Joe added a subscriber: Joe.Aug 7 2018, 7:16 AM

Sometimes we get 503 peaks from a cache_misc application like phabricator or gerrit; knowing the origin of the 5xxs in broad categories ("public traffic for the sites" vs "miscellanea") was very useful IMHO; do we have a way to preserve such information?

Change 450909 merged by Volans:
[operations/dns@master] Route puppetboard & debmonitor through cache_text

https://gerrit.wikimedia.org/r/450909

Mentioned in SAL (#wikimedia-operations) [2018-08-07T07:29:49Z] <volans> migrated puppetboard and debmonitor to cache_text - T164609

Change 451585 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Route 10 cache_misc sites to cache_text

https://gerrit.wikimedia.org/r/451585

Change 451585 merged by Ema:
[operations/dns@master] Route 10 cache_misc sites to cache_text

https://gerrit.wikimedia.org/r/451585

Change 451656 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Lower TTL of cache_misc sites to 30s

https://gerrit.wikimedia.org/r/451656

Change 451656 merged by Ema:
[operations/dns@master] Lower TTL of cache_misc sites to 30s

https://gerrit.wikimedia.org/r/451656

Change 451659 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Use cache_text instead of cache_misc for all sites

https://gerrit.wikimedia.org/r/451659

ema updated the task description. (Show Details)Aug 9 2018, 4:55 PM

Change 451659 merged by BBlack:
[operations/dns@master] Use cache_text instead of cache_misc for all misc sites

https://gerrit.wikimedia.org/r/451659

Change 451695 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Revert TTLs back to 600 for misc->text moves

https://gerrit.wikimedia.org/r/451695

ema updated the task description. (Show Details)Aug 10 2018, 7:10 AM

Change 451695 merged by BBlack:
[operations/dns@master] Revert TTLs back to 600 for misc->text moves

https://gerrit.wikimedia.org/r/451695

Change 452182 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] profile::docker::registry: whitelist cache_text nodes

https://gerrit.wikimedia.org/r/452182

Change 452182 merged by Ema:
[operations/puppet@production] profile::docker::registry: whitelist cache_text nodes

https://gerrit.wikimedia.org/r/452182

Change 452321 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] phabricator/otrs: use cache::text::nodes for mod_remoteip

https://gerrit.wikimedia.org/r/452321

Sometimes we get 503 peaks from a cache_misc application like phabricator or gerrit; knowing the origin of the 5xxs in broad categories ("public traffic for the sites" vs "miscellanea") was very useful IMHO; do we have a way to preserve such information?

We could surely distinguish between "public" vs "not-really" at the prometheus level, with a predictable amount of bikeshedding involved [is phabricator public? wdqs? Note BTW that both examples were on cache_misc before :)]. Perphaps even adding the Host header as a label for certain metrics such as eg. job_method_status:varnish_requests:rate5m would be interesting. @fgiunchedi probably knows whether that is doable prometheus-wise.

Change 452321 merged by Ema:
[operations/puppet@production] phabricator/otrs: use cache::text::nodes for mod_remoteip

https://gerrit.wikimedia.org/r/452321

Sometimes we get 503 peaks from a cache_misc application like phabricator or gerrit; knowing the origin of the 5xxs in broad categories ("public traffic for the sites" vs "miscellanea") was very useful IMHO; do we have a way to preserve such information?

We could surely distinguish between "public" vs "not-really" at the prometheus level, with a predictable amount of bikeshedding involved [is phabricator public? wdqs? Note BTW that both examples were on cache_misc before :)]. Perphaps even adding the Host header as a label for certain metrics such as eg. job_method_status:varnish_requests:rate5m would be interesting. @fgiunchedi probably knows whether that is doable prometheus-wise.

  • I think the "text" vs "misc" distinction was becoming problematic anyways, both for alerting and for rolling out changes. It carries with it some old assumptions that everything on misc was a handful of less-important services.
  • We do have per-backend-service 5xx metrics I believe, although maybe not the best graphing for them at present. They could be usefully categorized I think (maybe rather than by hand, do it with some metadata)? I think it's worth thinking a bit about which categories are useful. Rather than thinking in Traffic terms, we should probably come up with a 3-tier system that's useful for all general monitoring, and then place the traffic-layer alerting about backends into those same categories as appropriate. Something that reflects the notion of "high/prod" being critical to live service of primary wiki projects (probably including APIs and side-services people care about, e.g. ores?), "medium/meta" for ancillary services that are still pretty uptime-critical important to users and/or developers even if not in the direct line of fire (e.g. phab?), and "low" for services that aren't an emergency if they fail for a little while (debmonitor?).

Change 460217 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] site: convert cache::misc hosts to spares

https://gerrit.wikimedia.org/r/460217

Change 460218 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] lvs: remove misc_web and misc_web-https

https://gerrit.wikimedia.org/r/460218

Change 460219 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Remove cache_misc definitions

https://gerrit.wikimedia.org/r/460219

Change 460275 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Remove misc-web

https://gerrit.wikimedia.org/r/460275

Change 460217 merged by Ema:
[operations/puppet@production] site: convert cache::misc hosts to spares

https://gerrit.wikimedia.org/r/460217

ema updated the task description. (Show Details)Sep 14 2018, 8:59 AM

Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts:

cp3007.esams.wmnet

The log can be found in /var/log/wmf-auto-reimage/201809140903_ema_28457_cp3007_esams_wmnet.log.

Completed auto-reimage of hosts:

['cp3007.esams.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts:

cp1045.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201809141036_ema_18700_cp1045_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cp1045.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts:

['cp1051.eqiad.wmnet', 'cp2006.codfw.wmnet', 'cp2012.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201809141259_ema_14401.log.

Completed auto-reimage of hosts:

['cp1051.eqiad.wmnet', 'cp2006.codfw.wmnet', 'cp2012.codfw.wmnet']

and were ALL successful.

Change 460218 merged by Ema:
[operations/puppet@production] lvs: remove misc_web and misc_web-https

https://gerrit.wikimedia.org/r/460218

Mentioned in SAL (#wikimedia-operations) [2018-09-14T14:23:18Z] <ema> lvs1002: restart pybal to remove misc-web T164609

Mentioned in SAL (#wikimedia-operations) [2018-09-14T14:28:44Z] <ema> lvs2002: restart pybal to remove misc-web T164609

Change 460540 had a related patch set uploaded (by Ema; owner: Alexandros Kosiaris):
[operations/puppet@production] Amend the phabricator icinga check

https://gerrit.wikimedia.org/r/460540

Mentioned in SAL (#wikimedia-operations) [2018-09-14T14:42:13Z] <ema> lvs3002: restart pybal to remove misc-web T164609

Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts:

['cp3008.esams.wmnet', 'cp2018.codfw.wmnet', 'cp1058.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201809141445_ema_5224.log.

Completed auto-reimage of hosts:

['cp1058.eqiad.wmnet', 'cp2018.codfw.wmnet', 'cp3008.esams.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts:

['cp1061.eqiad.wmnet', 'cp2025.codfw.wmnet', 'cp3010.esams.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201809141518_ema_14670.log.

Completed auto-reimage of hosts:

['cp1061.eqiad.wmnet', 'cp2025.codfw.wmnet', 'cp3010.esams.wmnet']

and were ALL successful.

ema updated the task description. (Show Details)Sep 14 2018, 3:50 PM

Change 460540 merged by Dzahn:
[operations/puppet@production] Amend the phabricator icinga check

https://gerrit.wikimedia.org/r/460540

Change 460562 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] varnishkafka/icinga: remove check for misc-web webrequests

https://gerrit.wikimedia.org/r/460562

Change 460562 merged by Dzahn:
[operations/puppet@production] varnishkafka/icinga: remove check for misc-web webrequests

https://gerrit.wikimedia.org/r/460562

Change 460219 merged by Ema:
[operations/puppet@production] Remove cache_misc definitions

https://gerrit.wikimedia.org/r/460219

Change 460275 merged by Ema:
[operations/dns@master] Remove misc-web

https://gerrit.wikimedia.org/r/460275

ema closed this task as Resolved.Sep 17 2018, 11:58 AM
ema updated the task description. (Show Details)

Change 460922 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_misc: cleanup leftovers

https://gerrit.wikimedia.org/r/460922

Change 460922 merged by Ema:
[operations/puppet@production] cache_misc: cleanup leftovers

https://gerrit.wikimedia.org/r/460922