Page MenuHomePhabricator

Switch port 80 to nginx on primary clusters
Closed, DeclinedPublic

Description

We'll eventually want to switch the cache clusters' port 80 to being served directly by nginx (ideally, with no proxy backend - just 301/403/etc logic to force other traffic off to HTTPS). It's more efficient and it simplifies request-routing as there's no conditional on where the traffic first enters at - all initial termination of HTTP(S) happens in one piece of software.

We're not really ready to make this transition yet, as we're still allowing HTTP traffic to flow through the varnish instances in various corner cases. In the meantime, we can take a few preparatory steps to ease the transition in the future.

  • - Configure varnish frontend listeners to listen on an alternate port, in addition to port 80. The backends use 3128. Port 3127 is available and makes sense.
  • - Restart varnishes to get port 3127 working
  • - Configure nginx HTTPS proxies to proxy traffic into port 3127 instead of port 80. This should have the side-benefit of making it easier to analyze remaining non-HTTPS traffic that's hitting varnish's port 80 directly (negative regexes against ReqStart or XFP are annoyingly difficult to mix with other varnishlog filters).
  • - Configure vhtcpd to use 3127 as well
  • - Set up new 301/403 code in our nginx config that can terminate all HTTP traffic directly, but on a temporary alternate port such as 8080, and test/vet it.
  • - Wait until we've resolved all the other issues and effectively killed off all non redirect/forbidden traffic on port 80
  • - Go through a complicated two-stage process of depooling nodes one by one and switching off the varnish port 80 listener then turning on the nginx port 80 listener.
  • - Cleanup - varnish code can now assume all requests are HTTPS, so we can kill some related logic here

Related Objects

StatusSubtypeAssignedTask
Resolved ema
DeclinedNone
ResolvedBBlack
ResolvedDzahn
Resolved ezachte
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
Resolved Pchelolo
Resolved Pchelolo
Declinedvalhallasw
Resolved Whatamidoing-WMF
DuplicateNone
ResolvedVgutierrez
ResolvedBBlack
ResolvedNone
ResolvedNone
DuplicateNone
ResolvedKrenair
ResolvedBBlack
ResolvedMarcoAurelio
ResolvedKrenair
Resolvedscfc
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
OpenVgutierrez
ResolvedVgutierrez
DeclinedNone
Resolved AlexMonk-WMF
Resolvedfaidon

Event Timeline

BBlack claimed this task.
BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic.
BBlack added subscribers: BBlack, faidon.
BBlack triaged this task as Medium priority.Jul 29 2015, 12:05 AM

Coincident with and related to this: there might be future plans afoot (still ill-defined) to blend the cache clusters in general (as in: text + upload on same machines, without any initial change to how things work from the user or mediawiki POV, then later coalescing the IPs for them at the nginx level and splitting on Host header).

So, the first step there should probably take that into account, and assign a new set up of alternate (to be primary, eventually) ports for all of the varnish fe and be services, using unique port numbers per cluster so that they don't conflict during later work.

Change 294532 had a related patch set uploaded (by BBlack):
r::c::instances: frontends also listen on :3127

https://gerrit.wikimedia.org/r/294532

Change 294532 merged by BBlack:
r::c::instances: frontends also listen on :3127

https://gerrit.wikimedia.org/r/294532

BBlack set Security to None.

Mentioned in SAL [2016-06-15T19:11:44Z] <bblack> rolling restart of global varnish frontends (salt -b 1: depool -> sleep 15 -> restart -> repool) - estimated ~30 mins to completion - T107236

Change 294537 had a related patch set uploaded (by BBlack):
varnish: systemd unit varnish4.1 compat for "-a"

https://gerrit.wikimedia.org/r/294537

Change 294537 merged by BBlack:
varnish: systemd unit varnish4.1 compat for "-a"

https://gerrit.wikimedia.org/r/294537

Mentioned in SAL [2016-06-15T19:25:13Z] <bblack> rolling restart of global varnish frontends (salt -b 1: depool -> sleep 15 -> restart -> repool) - estimated ~35 mins to completion - T107236 (...._

Change 294703 had a related patch set uploaded (by BBlack):
r::c::ssl::unified: set explicit server name www.wikimedia.org

https://gerrit.wikimedia.org/r/294703

Change 294704 had a related patch set uploaded (by BBlack):
r::c::ssl: use 3127 for upstream_port

https://gerrit.wikimedia.org/r/294704

Change 294705 had a related patch set uploaded (by BBlack):
vhtcpd: use port 3127 for fe

https://gerrit.wikimedia.org/r/294705

Change 294706 had a related patch set uploaded (by BBlack):
tlsproxy: redirect-only service on 8080

https://gerrit.wikimedia.org/r/294706

Change 294703 merged by BBlack:
r::c::ssl::unified: set explicit server name www.wikimedia.org

https://gerrit.wikimedia.org/r/294703

Change 294704 merged by BBlack:
r::c::ssl: use 3127 for upstream_port

https://gerrit.wikimedia.org/r/294704

Change 294705 merged by BBlack:
vhtcpd: use port 3127 for fe

https://gerrit.wikimedia.org/r/294705

Change 294706 merged by BBlack:
tlsproxy: redirect-only service on 8080

https://gerrit.wikimedia.org/r/294706

Perhaps we should wait until Varnish 5.0, supposedly out in a couple of months and with HTTP/2.0 support, before we proceed with this change?

It will be another 6 months before we're even settled into a full Varnish 4 world. We have several major followup projects pending on that (e.g. xkey, and finally sorting out TTL/grace issues well, etc) as well as other unrelated projects ongoing. Even if all of that were out of the way, we can't be confident of the V5 release timeline yet, and I'm not sure how difficult a transition Varnish 5 will be or how long it will take us once we start on it.

Even once we're past pre-requisite timeline issues, V5 probably still won't support inbound TLS (at least, not well enough!) to not put a TLS proxy in front of it, so the bulk of the traffic will still be TCP-terminating separately in nginx or similar.

By the time we get around to seriously considering V5, we can also re-evaluate whether we want to pursue ATS or not as well.

So basically, in the net I see V5 as too far off to plan anything around in the short to medium term, and unrelated to this anyways since it still won't be able to terminate TCP TLS connections on its own for us.

The upside of switching port 80 to nginx is it removes complexity and confusion around a security-critical issue. Currently there are two paths for traffic into Varnish: through the TLS terminator, or directly into Varnish's port 80. The VCL that does initial frontend processing of things like HTTPS-enforcement, X-Client-IP, X-Forwarded-For, etc has to deal with two very distinct cases, and carefully redirect rather than proxy for traffic that didn't legitimately pass through TLS termination. With port 80 moved to nginx, and that nginx virtual server configured to only return 301/403 and never proxy, VCL can assume all traffic is properly TLS-terminated from the get-go, and we can know more-concretely that there's no logical hole allowing insecure traffic into the the caches and beyond.

@BBlack: Hi! This task has been assigned to you a while ago. Do you still plan to work on this task?
If this task has been resolved in the meantime: Please update the task status (via Add Action...Change Status in the dropdown menu).
If this task is not resolved and only if you do not plan to work on this task anymore: Please consider removing yourself as assignee (via Add Action...Assign / Claim in the dropdown menu): That would allow others to work on this (in theory), as others won't think that someone is already working on this. Thanks! :)

We're not using nginx software for this functionality anymore, and everything else related to these parts of the software stack have changed and are still evolving, so this task doesn't make a ton of sense anymore as it stands.