Page MenuHomePhabricator

Put lists.wikimedia.org web interface behind LVS
Open, MediumPublic

Description

As discussed in T278495: Figure out plan for mailman IP situation, we should put lists.wikimedia.org's web interface behind LVS. Exim/mail is excluded since we might go a different route for that: T232343#7059925.

Currently, we get a TLS cert from acme-chief and Apache redirects nearly all HTTP traffic over to HTTPS, where we have a bunch of routing and redirects

We probably want to end up with Apache just serving over HTTP, and envoy doing HTTPS in between Apache<-->LVS/caches.



Event Timeline

Legoktm raised the priority of this task from Low to Medium.Nov 3 2021, 3:05 AM

Recent events have made it so that we should probably do this sooner instead of waiting. The one catch is that mail delivery is dependent upon the web server being up and currently, because of issues serving it over localhost, Mailman connects to itself over https://lists.wikimedia.org/ which is not terrible when it just loops back to itself, but we probably don't want to keep doing that if we're going through LVS.

because of issues serving it over localhost, Mailman connects to itself over https://lists.wikimedia.org/ which is not terrible when it just loops back to itself, but we probably don't want to keep doing that if we're going through LVS.

These "issues" are T190111: VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners).

Setting up envoy as a tlsproxy should be straightforward. The one thing I'm not sure about is how to have it it talk to Apache over HTTP, since we currently have Apache enforcing the HTTPS redirect, see https://gerrit.wikimedia.org/g/operations/puppet/+/61f20b6b6c2478f782b53fb31ce95756441f8bdc/modules/profile/templates/lists/apache.conf.erb#7

Or should we have envoy talk to Apache over HTTPS during the migration period? Or...

@Joe and I discussed how to do this today. To recap, the goals of this are to work towards T278495: Figure out plan for mailman IP situation (eliminating the special case IP) as well as being behind the caching layer so we can take advantage of its facilities for rate limiting, IP/UA blocking, etc. We are also constrained by that Mailman is not HA and it would be more convenient to have the mailman3 + mailman3-web services on the same host.

Joe said that there's not much value in going behind LVS then, and instead we could just have ATS route lists.wikimedia.org directly to lists1001 (likely Apache would keep doing TLS, but probably needs a different cert?). We'd change DNS to have A/AAAA records point to dyna, while MX would still point directly to lists1001.

@BBlack we'd like your input/feedback on this, especially the DNS parts.

@Legoktm it looks like the easiest approach would be adding lists1001 as a backend server on ATS and set the caching policy to pass. Under this scenario, lists.wikimedia.org TLS certificate should be a private one handled by our PKI rather than an acme-chief/LE one. After that, we should drop the A/AAAA records and just add a DYNA record like this

lists      600 IN DYNA geoip!text-addrs

the usual approach would be adding a CNAME to dyna, but this isn't feasible here cause you still need the MX records, that should point directly to lists1001.wikimedia.org rather than the current lists.wikimedia.org

I've read through the backlog of this task and followed T411895: gerrit behind CDN to try and figure out how I could move mailman's web interface behind our CDN. Please let me know if I've missed something


 ATS → Envoy

From the conversation in this thread on Gerrit highlighted by @Dzahn, I think we won't face the same issue from what I've seen:

arnaudb@lists1004:~ $ openssl s_client -connect localhost:8443   2> /dev/null </dev/null  | openssl x509 -noout -text | grep -i -A1 "subject alternative"
            X509v3 Subject Alternative Name: 
                DNS:lists1004.wikimedia.org, DNS:lists.wikimedia.org
arnaudb@lists2001:~ $ openssl s_client -connect localhost:8443   2> /dev/null </dev/null  | openssl x509 -noout -text | grep -i -A1 "subject alternative"
            X509v3 Subject Alternative Name: 
                DNS:lists2001.wikimedia.org, DNS:lists.wikimedia.org

The mapping has already been done via this change: operations/puppet/+/1072247 and seem to work properly, up until a certain point:

curl -k https://lists1004.wikimedia.org:8443/
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
Reason: You're speaking plain HTTP to an SSL-enabled server port.<br />
 Instead use the HTTPS scheme to access this URL, please.<br />
</p>
</body></html>

To fix this, I'll have to add the header SECURE_PROXY_SSL_HEADER to accept traffic from envoy in mailman and also ensure that envoy properly sends X-Forwarded-Proto: https.

 DNS

Joe said that there's not much value in going behind LVS then, and instead we could just have ATS route lists.wikimedia.org directly to lists1001 (likely Apache would keep doing TLS, but probably needs a different cert?). We'd change DNS to have A/AAAA records point to dyna, while MX would still point directly to lists1001.

I think decoupling email and web could indeed be the way to go. From what I've been able to read from the most recent server switch, IP reputation was no trouble. I think this answers some of the concerns raised in T278495: Figure out plan for mailman IP situation.


Remaining Steps

Adapting from T411895: gerrit behind CDN, I think the remaining steps could be:

  • Adapt Django configuration to accept traffic from envoy
  • Adapt the existing record to tie lists.wm.o to the hosting server
    • Assign the new public IPs: a v4 and a v6 in each of the DC-specific public service address ranges
      • create DNS records lists-http-lb.$DC.wikimedia.org
      • Prepare for geodns with the new public IP
      • Add a lists-addrs resource to operations/dns // geo-resources
      • Update dns.admin cookbook to reflect lists-addrs
      • Update geodns schema in conftool-data/geodns/services.yaml to add lists-addrs for admin_state
  • Prepare tcpproxy VMs for accepting traffic on the new public IPs
      • Create a new conftool service for tcp-proxy, add to it the realserver VMs in each DC
      • Create two new service catalog entries both sharing those same public IPs, and using LVS class high-traffic1 - should this be the same class as Gerrit?
      • add LVS profiles to tcpproxy puppet role
    • Add lists to ATS cache_text as a backend
    • Ensure Varnish VCL includes lists.wm.o in any relevant instances of its many hostname regex patterns ...?
    • Prepare cache_text servers for accepting traffic on the new public IP (mark them as profile::lvs::realservers for the new service catalog entry lists-https)
  • Reconfigure Liberica/Katran hightraffic1 to accept traffic on the new public IP, and to route it to the appropriate-for-the-dstport realservers - should this be the same class as Gerrit?
    • First, try in only one CDN site: magru or drmrs perhaps?
      • Add lists-https to profile::liberica::include_services on the secondary Liberica host in the CDN site (e.g. lvs7003)
      • In the catalog, set both services to state: lvs_setup in that one DC
      • On the secondary Liberica host, verify happy healthchecking for both ssh and https services
      • Repeat on the primary hightraffic1 host Verify functionality, soak-test for a few hours at least; then continue rollout globally. Keeping the services in lvs_setup state, repeat the above procedure, but instead adding new sites to the list in the catalog

At this point the new IP (& new data path) are accessible externally. Thus we should proceed to:

  • Opt-in SRE & developer testing
    • Write instructions and/or ship tunnelencabulator feature: modify /etc/hosts to point lists.wikimedia.org to the new, CDN-fronted public IP
    • One full business day of testing with several volunteers?

So, from this I think 2 questions are open:

  • Is hightraffic1 relevant for mailman?
  • Am I missing any step?

In the change merged back in 2024: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072247/9/hieradata/common/profile/trafficserver/backend.yaml

The map in trafficserver/backend.yaml should point to a discovery record, lists.discovery.wmnet and not directly to host name lists1004.

Suggesting we create that DNS record first, point it to lists1004 and update the backend.yaml as a next step.

You can remove the "Prepare tcpproxy VMs for accepting traffic on the new public IPs" and general tcpproxy part from the list above. That should only be relevant for the SSH part of Gerrit.

Whether this is "hightraffic" or not is a question to ask the traffic team. Maybe sukhe can answer that.

Change #1219061 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/dns@master] mailman: record update for lists.wm.o

https://gerrit.wikimedia.org/r/1219061

Change #1219062 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mailman: update lists.wm.o backend mapping

https://gerrit.wikimedia.org/r/1219062

You can remove the "Prepare tcpproxy VMs for accepting traffic on the new public IPs" and general tcpproxy part from the list above. That should only be relevant for the SSH part of Gerrit.

indeed, thanks for the clarification. I've highlighted the steps that I think could be optional/irrelevant to the process in the task description.

Whether this is "hightraffic" or not is a question to ask the traffic team. Maybe sukhe can answer that.

from what I've seen in services.yaml, it looks like mailman would fit in class: low-traffic. @ssingh does it seem reasonable to you?

Change #1219151 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mailman: add lists to service catalog

https://gerrit.wikimedia.org/r/1219151

I found a couple of IP on Netbox that don't look attached to anything despite lists.wm.o being mentioned in the comment. Unless they've been kept for historical reasons, I think they could be recycled.

I found a couple of IP on Netbox that don't look attached to anything despite lists.wm.o being mentioned in the comment. Unless they've been kept for historical reasons, I think they could be recycled.

Indeed, thanks, IPs reclaimed.

Change #1219770 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mailman: add UpstreamTlsContext on tlsproxy::envoy

https://gerrit.wikimedia.org/r/1219770

@CDanis explained to me that it was not required to have a private IP to move mailman's frontend behind CDN, so we'll skip this part for now.

I've debugged the Envoy → httpd interface. Envoy was trying to talk plaintext to HTTPD:

transport_socket:
  name: envoy.transport_sockets.tls
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
    sni: "lists1004.wikimedia.org"
    common_tls_context:
      validation_context:
        trusted_ca:
          filename: /etc/ssl/certs/ca-certificates.crt

Which gave:

curl -k https://lists1004.wikimedia.org:8443/
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://lists.wikimedia.org/postorius/lists/">here</a>.</p>
</body></html>

I've started the implementation in tlsproxy in operations/puppet/+/1219770

Change #1219770 merged by Arnaudb:

[operations/puppet@production] mailman: add UpstreamTlsContext on tlsproxy::envoy

https://gerrit.wikimedia.org/r/1219770

Change #1226247 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mailman: add lvs::realserver to puppet role

https://gerrit.wikimedia.org/r/1226247