Page MenuHomePhabricator

Consider migrating our Elastic TLS termination from nginx to envoy
Closed, DeclinedPublic

Description

Our Elastic Puppet roles are the only set of roles besides maps that still use nginx to terminate TLS, as illustrated by this CR . As seen in T360439 and other tasks, this is slowly building up technical debt as almost no one uses the same Puppet code for TLS termination. Moving to envoy as TLS terminator will align us with the rest of WMF.

Creating this ticket to:

  • Discuss proposed changes with stakeholders
  • Assess our current nginx config and see what can be migrated to envoy (besides TLS, there could be headers inserted, rewrites etc)
  • Migrate our TLS termination to envoy.
  • If possible, completely remove nginx in favor of envoy.

Event Timeline

Gehel triaged this task as Medium priority.Jul 1 2024, 6:31 PM
Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.

I checked out the current Elastic nginx config and as @Gehel said last week, there is nothing elastic-specific in said nginx config: nginx is purely terminating TLS.

As such, it should be possible to treat the envoy TLS terminator (and its puppet role) as a drop-in replacement. We'll start writing the puppet code and testing this out in relforge.

Change #1052819 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: test envoy TLS terminator in relforge

https://gerrit.wikimedia.org/r/1052819

bking changed the task status from Open to In Progress.Jul 9 2024, 12:58 PM
bking claimed this task.
bking lowered the priority of this task from Medium to Low.
bking updated Other Assignee, added: RKemper.
bking raised the priority of this task from Low to Medium.Jul 9 2024, 1:04 PM
bking updated the task description. (Show Details)

Change #1052819 merged by Bking:

[operations/puppet@production] elastic: test envoy TLS terminator in relforge

https://gerrit.wikimedia.org/r/1052819

Mentioned in SAL (#wikimedia-operations) [2024-07-09T17:03:40Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T368950

Mentioned in SAL (#wikimedia-operations) [2024-07-09T17:03:57Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T368950

Change #1053041 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: test envoyproxy

https://gerrit.wikimedia.org/r/1053041

Mentioned in SAL (#wikimedia-operations) [2024-07-11T13:03:50Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T368950

Mentioned in SAL (#wikimedia-operations) [2024-07-11T13:04:07Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T368950

Per conversation with @EBernhardson , we need to verify that Envoy doesn't timeout or reject large payloads, as data sent to the Elastic bulk API can be around ~100MB in a single request.

Change #1053041 merged by Bking:

[operations/puppet@production] relforge: test envoyproxy

https://gerrit.wikimedia.org/r/1053041

Change #1053751 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: test envoyproxy with multiple instances

https://gerrit.wikimedia.org/r/1053751

Change #1053751 merged by Bking:

[operations/puppet@production] relforge: test envoyproxy

https://gerrit.wikimedia.org/r/1053751

Change #1053789 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: Attempt to use envoyproxy instead of nginx for TLS

https://gerrit.wikimedia.org/r/1053789

Change #1054578 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: remove non-functional TLS termination changes

https://gerrit.wikimedia.org/r/1054578

Change #1054578 merged by Bking:

[operations/puppet@production] relforge: remove non-functional TLS termination changes

https://gerrit.wikimedia.org/r/1054578

Upon further review, I think this is a tough climb for a couple of reasons:

  • the defined type envoyproxy::tls_terminator doesn't generate valid envoy config when using multiple Elastic clusters on the same hosts (at least, I failed with multiple different approaches, see CRs attached to this ticket). If we decided to use Envoy, we'd likely have to start with envoy.pp and work from there.
  • nginx is used to enforce read-only access for Cloudelastic via the nginx tlsproxy module, so we'd have to find/implement something similar in Envoy.

Despite this, I'm going to book some time with @Gehel to discuss this before we completely give up on the idea.

As we're already planning on migrating to Opensearch , and Opensearch supports TLS connections natively , I think we should forget about this migration for now and address it as part of the Opensearch migration.