Page MenuHomePhabricator

Test envoyproxy as a WMF's CDN TLS terminator with real traffic
Closed, ResolvedPublic

Description

Envoy is one of the candidates to replace ats-tls as the TLS terminator used in the WMF caching infrastructure. To fully validate its performance and stability a real traffic test is needed.
To be able to perform this test several requirements need to be fulfilled:

  • Adapt the current envoy puppetization to be able to meet Traffic requirements -> currently blocked by T265880
  • Test envoyproxy in Traffic WCMS environment
  • Test envoyproxy in ulsfo
    • Currently running on:
      • cp4025 (upload)
      • cp4031 (text)
  • Test envoyproxy in eqsin
    • Currently running on:
      • cp5005 (upload)
      • cp5011 (text)
  • Test envoyproxy in esams
    • Currently running on:
      • cp3063 (upload)
      • cp3062 (text)
  • Test envoyproxy in codfw
    • Currently running on:
      • cp2040 (upload)
      • cp2039 (text)
  • Test envoyproxy in eqiad
    • Currently running on:
      • cp1088 (upload)
      • cp1087 (text)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+10 -8
operations/puppetproduction+11 -9
operations/puppetproduction+10 -8
operations/puppetproduction+10 -8
operations/puppetproduction+10 -8
operations/puppetproduction+10 -8
operations/puppetproduction+10 -8
operations/puppetproduction+10 -13
operations/puppetproduction+20 -0
operations/puppetproduction+1 -1
operations/puppetproduction+9 -1
operations/puppetproduction+1 -1
operations/puppetproduction+9 -1
operations/puppetproduction+9 -1
operations/puppetproduction+9 -1
operations/puppetproduction+15 -0
operations/puppetproduction+9 -1
operations/puppetproduction+292 -0
operations/puppetproduction+9 -1
operations/puppetproduction+9 -1
operations/puppetproduction+9 -1
operations/puppetproduction+3 -0
operations/puppetproduction+18 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/puppetproduction+1 -0
operations/puppetproduction+19 -0
operations/puppetproduction+1 -0
operations/puppetproduction+10 -1
operations/puppetproduction+54 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+13 -0
operations/puppetproduction+29 -6
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+9 -1
operations/puppetproduction+13 -0
operations/puppetproduction+15 -0
operations/puppetproduction+17 -6
operations/puppetproduction+1 -0
operations/puppetproduction+16 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+95 -0
operations/puppetproduction+9 -1
operations/puppetproduction+5 -0
operations/puppetproduction+245 -0
operations/puppetproduction+32 -0
operations/puppetproduction+19 -0
operations/puppetproduction+10 -0
operations/puppetproduction+9 -0
operations/puppetproduction+35 -18
operations/puppetproduction+73 -0
operations/puppetproduction+25 -2
operations/puppetproduction+28 -2
operations/puppetproduction+26 -7
operations/puppetproduction+30 -0
operations/puppetproduction+5 -3
operations/puppetproduction+74 -26
operations/puppetproduction+111 -0
operations/puppetproduction+31 -0
operations/puppetproduction+39 -13
operations/puppetproduction+21 -0
operations/puppetproduction+81 -57
operations/puppetproduction+41 -1
operations/puppetproduction+8 -0
operations/puppetproduction+8 -4
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 756541 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp1088 as cache::upload_envoy

https://gerrit.wikimedia.org/r/756541

Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1088.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1088.eqiad.wmnet with OS buster completed:

  • cp1088 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201241103_vgutierrez_20056_cp1088.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-24T11:50:46Z] <vgutierrez> pool cp1088 using envoy as TLS termination layer - T271421

Change 757415 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache: Provide a text_envoy role

https://gerrit.wikimedia.org/r/757415

Change 757415 merged by Vgutierrez:

[operations/puppet@production] cache: Provide a text_envoy role

https://gerrit.wikimedia.org/r/757415

Change 757627 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp4031 as cache::text_envoy

https://gerrit.wikimedia.org/r/757627

Mentioned in SAL (#wikimedia-operations) [2022-01-27T11:12:00Z] <vgutierrez> depool cp4031 to be reimaged as cache::text_envoy - T271421

Change 757627 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp4031 as cache::text_envoy

https://gerrit.wikimedia.org/r/757627

Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4031.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4031.ulsfo.wmnet with OS buster completed:

  • cp4031 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201271129_vgutierrez_30656_cp4031.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-28T15:41:55Z] <vgutierrez> pool cp4031 using envoy as TLS termination layer - T271421

Change 757917 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] prometheus::ops: Gather varnish mtail metrics on text_envoy

https://gerrit.wikimedia.org/r/757917

Change 757917 merged by Vgutierrez:

[operations/puppet@production] prometheus::ops: Gather varnish mtail metrics on text_envoy

https://gerrit.wikimedia.org/r/757917

Change 758430 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp5011 as cache::text_envoy

https://gerrit.wikimedia.org/r/758430

Mentioned in SAL (#wikimedia-operations) [2022-01-31T09:53:24Z] <vgutierrez> depool cp5011 to be reimaged as cache::text_envoy - T271421

Change 758430 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp5011 as cache::text_envoy

https://gerrit.wikimedia.org/r/758430

Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster completed:

  • cp5011 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201310955_vgutierrez_8847_cp5011.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-31T10:58:13Z] <vgutierrez> pool cp5011 running envoy as TLS terminator - T271421

Change 758512 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp3062 as cache::text_envoy

https://gerrit.wikimedia.org/r/758512

Mentioned in SAL (#wikimedia-operations) [2022-02-01T09:01:31Z] <vgutierrez> depool cp3062 to be reimaged as cache::text_envoy - T271421

Change 758512 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp3062 as cache::text_envoy

https://gerrit.wikimedia.org/r/758512

Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster completed:

  • cp3062 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202010903_vgutierrez_28355_cp3062.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-01T10:14:19Z] <vgutierrez> pool cp3062 running envoy as TLS terminator - T271421

Change 758879 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp2039 as cache::text_envoy

https://gerrit.wikimedia.org/r/758879

Mentioned in SAL (#wikimedia-operations) [2022-02-01T16:10:43Z] <vgutierrez> depool cp2039 to be reimaged as cache::text_envoy - T271421

Change 758879 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp2039 as cache::text_envoy

https://gerrit.wikimedia.org/r/758879

Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster completed:

  • cp2039 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202011612_vgutierrez_14398_cp2039.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-01T17:21:48Z] <vgutierrez> pool cp2039 running envoy as TLS terminator - T271421

Change 759224 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache::envoy: Reduce downstream_idle_timeout

https://gerrit.wikimedia.org/r/759224

Change 759224 merged by Vgutierrez:

[operations/puppet@production] cache::envoy: Reduce downstream_idle_timeout

https://gerrit.wikimedia.org/r/759224

Change 759231 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp1087 as cache::text_envoy

https://gerrit.wikimedia.org/r/759231

Mentioned in SAL (#wikimedia-operations) [2022-02-02T11:48:40Z] <vgutierrez> depool cp1087 to be reimaged as cache::text_envoy - T271421

Change 759231 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp1087 as cache::text_envoy

https://gerrit.wikimedia.org/r/759231

Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster completed:

  • cp1087 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202021150_vgutierrez_28641_cp1087.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-02T14:13:17Z] <vgutierrez> pool cp1087 running envoy as TLS terminator - T271421

There are some user reports / IRC chatter and this ticket T300366 that seem like they are related to this. ('426 issues")

summarizing about the link above: apparently we do have HTTP/1.0 clients, and it does work with our other terminators, but not envoy.

Envoy does have some config for this (cf https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-http1protocoloptions ), but it's not on by default, was added later and not part of the original design, and it has some scary language about being "not fully standards compliant". It also requires a default hostname for when the Host header is lacking. This probably need some further investigation to determine the path forward here. I can imagine three loosely defined buckets of where we could end up:

  1. Maybe we can turn this on safely, with a reasonable (perhaps invalid) default hostname.
  2. Maybe we don't actually need to support HTTP/1.0 and can justify turning it off (seems unlikely, but possible)
  3. Maybe neither of the above is true, which gives us a useful data point in making decisions about using envoy :)
This comment was removed by Ahecht.

Envoy does have some config for this (cf https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-http1protocoloptions ), but it's not on by default, was added later and not part of the original design, and it has some scary language about being "not fully standards compliant". It also requires a default hostname for when the Host header is lacking. This probably need some further investigation to determine the path forward here. I can imagine three loosely defined buckets of where we could end up:

This has the original discussion on this feature I think: https://github.com/envoyproxy/envoy/issues/170
Here is the paper on differences between 1.0 and 1.1 which might help with risk assessment: https://web.archive.org/web/20180302204914/http://www.research.att.com/people/Krishnamurthy_Balachander/papers/h0vh1.html

Support was written by experienced envoy developer and google dev. Maybe we can ask Google for their experiences ?

Another report of user-facing impact in #mediawiki from someone using the w3m browser: https://wm-bot.wmcloud.org/logs/%23mediawiki/20220203.txt

Change 759474 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cumin: Add cache::text_envoy to cp-text alias

https://gerrit.wikimedia.org/r/759474

Change 759474 merged by Vgutierrez:

[operations/puppet@production] cumin: Add cache::text_envoy to cp-text alias

https://gerrit.wikimedia.org/r/759474

Change 762802 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache::envoy: Bound envoy to the same NUMA node as the main NIC

https://gerrit.wikimedia.org/r/762802

Change 762802 merged by Vgutierrez:

[operations/puppet@production] cache::envoy: Bound envoy to the same NUMA node as the main NIC

https://gerrit.wikimedia.org/r/762802

Change 766073 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp4025 as cache::upload_haproxy

https://gerrit.wikimedia.org/r/766073

Change 766073 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp4025 as cache::upload_haproxy

https://gerrit.wikimedia.org/r/766073

envoy instances are currently being reimaged as HAProxy ones. We're cleaning up and pausing the envoyproxy experiment

Change 766078 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp2040 as cache::upload_haproxy

https://gerrit.wikimedia.org/r/766078

Change 766078 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp2040 as cache::upload_haproxy

https://gerrit.wikimedia.org/r/766078

Mentioned in SAL (#wikimedia-operations) [2022-02-25T12:32:45Z] <vgutierrez> pool cp2040 running HAProxy as TLS termination layer - T290005 T271421

Change 766102 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp5005 as cache::haproxy_upload

https://gerrit.wikimedia.org/r/766102

Change 766102 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp5005 as cache::haproxy_upload

https://gerrit.wikimedia.org/r/766102

Mentioned in SAL (#wikimedia-operations) [2022-02-25T15:25:19Z] <vgutierrez> pool cp5005 running HAProxy as TLS termination layer - T290005 T271421

Change 766119 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp3063 as cache::upload_haproxy

https://gerrit.wikimedia.org/r/766119

Change 766119 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp3063 as cache::upload_haproxy

https://gerrit.wikimedia.org/r/766119

Mentioned in SAL (#wikimedia-operations) [2022-02-25T16:35:06Z] <vgutierrez> pool cp3063 running HAProxy as TLS termination layer - T290005 T271421

Change 766586 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp1088 as cache::upload_haproxy

https://gerrit.wikimedia.org/r/766586

Change 766586 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp1088 as cache::upload_haproxy

https://gerrit.wikimedia.org/r/766586

Mentioned in SAL (#wikimedia-operations) [2022-02-28T11:09:40Z] <vgutierrez> pool cp1088 running HAProxy as TLS termination layer - T290005 T271421

Change 766597 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp5011 as cache::text_haproxy

https://gerrit.wikimedia.org/r/766597

Change 766597 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp5011 as cache::text_haproxy

https://gerrit.wikimedia.org/r/766597

Mentioned in SAL (#wikimedia-operations) [2022-02-28T12:24:02Z] <vgutierrez> pool cp5011 running HAProxy as TLS termination layer - T290005 T271421

Change 767051 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp2039 as cache::text_haproxy

https://gerrit.wikimedia.org/r/767051

Change 767051 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp2039 as cache::text_haproxy

https://gerrit.wikimedia.org/r/767051

Mentioned in SAL (#wikimedia-operations) [2022-03-01T10:05:44Z] <vgutierrez> pool cp2039 running HAProxy as TLS termination layer - T290005 T271421

Change 767065 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] site: Reimage cp3062 as cache::text_haproxy

https://gerrit.wikimedia.org/r/767065

Change 767065 merged by Vgutierrez:

[operations/puppet@production] site: Reimage cp3062 as cache::text_haproxy

https://gerrit.wikimedia.org/r/767065

Mentioned in SAL (#wikimedia-operations) [2022-03-01T12:49:58Z] <vgutierrez> pool cp3062 running HAProxy as TLS termination layer - T290005 T271421

Mentioned in SAL (#wikimedia-operations) [2022-03-01T14:36:30Z] <vgutierrez> pool cp1087 running HAProxy as TLS termination layer - T290005 T271421

Mentioned in SAL (#wikimedia-operations) [2022-03-02T15:47:06Z] <vgutierrez> pool cp5014 running HAProxy as TLS termination layer - T290005 T271421

Mentioned in SAL (#wikimedia-operations) [2022-03-02T16:51:42Z] <vgutierrez> pool cp3061 running HAProxy as TLS termination layer - T290005 T271421