Page MenuHomePhabricator

Webproxies are a SPOF
Open, MediumPublic

Description

Reading T242602 made me realize that the web-proxies are a SPOF.

As it's not possible to specify more than one proxy uri per client server, and the current FQDN are only CNAMEs to single servers. A server failure requires a DNS change. https://wikitech.wikimedia.org/wiki/HTTP_proxy

I'm not sure how critical those proxies are, and what level of availability is expected from them, but they look like good candidates for anycast.

It would mean simplified configuration (all hosts use webproxy.anycast.wmnet. And automatic (cross-DC) failover if a node fails, similar to rec-dns.

I haven't look at the current webproxy Puppet stanzas but configuration should be straightforward: https://wikitech.wikimedia.org/wiki/Anycast#How_to_deploy_a_new_service?

Event Timeline

ayounsi triaged this task as Medium priority.Jan 14 2020, 8:29 AM
ayounsi created this task.

One problem I see with this is - proxy IPs regularly get banned by third-party services by accident. So having multiple *external* IPs, and being able to switch between them, is a plus.

I think you're right that the proxies are a SPOF, but I see more controlled ways we could use to make them redundant, like having more than one of them per DC, and load-balance between them.

I think increasing the availability and resilience of this service is an excellent idea! However, adding more servers to per site feels like a requirement, and a standard Pybal/IPVS setup sounds much more appropriate than anycast for this use case.

More broadly: Pybal/IPVS and anycast have some overlap in the feature set that they provide to services, but given both how we use and have architected these two solutions, I think we should favor the former over the latter for the vast majority of the use cases and limit the anycast deployments only to special cases. Ideally these two offerings would eventually converge, but I feel like we're far from that right now.

+1 for improving HA of them and I agree that the LVS approach seems the saner one
If we don't plan to do this anytime soon though, maybe we could make an intermediate step with geodns.
We could have webproxy.discovery.wmnet resolving the local proxy by default but in case of maintenance or such we could depool one and have the clients use the proxy in another DC.

ayounsi renamed this task from Anycast for webproxies to Webproxies are a SPOF.Feb 18 2020, 10:43 PM