Page MenuHomePhabricator

Script to point SRE local machine traffic to another LB
Closed, ResolvedPublic

Description

Action item from an incident.

If a specific site/LB is having an outage and depooling the site is not an option, it would be useful to have a script that points resources behind a LB to a different site/LB.
Eg. a SRE in Europe need to be able to reach Grafana through eqiad if esams' LB is having issues.

One option is for example to add static records to /etc/hosts

Event Timeline

ayounsi created this task.

+1 to /etc/hosts, I've done similar in the past and has worked as expected. As a side note the script could even take the form of a puppet manifest we can then puppet apply locally.

If we go the /etc/hosts route it seems to me that a quick script should do it. It's sufficient to pass to it a parameter with the DC name and then have the script resolve text-lb.$DC.wikimedia.org and use it as the IP for a predefined static list of services that are behind CDN that we're interested in during an outage. The same parameter could have a special value like reset to clear/comment the same records.

I'm saying a statically curated list because we just need few of them and not all the 350+ records defined in the DNS repo.

My 2 cents :)

+1, I was also thinking simple script, static list of hostnames:

  • grafana
  • turnilo
  • logstash
  • etherpad
  • wikitech
  • phabricator
  • {www.,}mediawiki.org
  • a handful of Wikipedias for test traffic: {en,it,el,de,nl,es,fr}.wikipedia.org (covers most languages on our team)

(An aside: there are other tools not fronted by LVS where we it'd be nice for it to be easy for SREs to route their traffic over our own cross-site transport links, instead of using peering/transit in eqiad. Partial list: gerrit, librenms, icinga. Best I can come up for an 'easy' option is to ssh -D to a host like icinga2001.)

Each line inserted in /etc/hosts should have its own magic comment from the script, making removal trivial.

I think there should be a reset command and an override (?) / insert (?) command, and that there should be a mandatory site argument, as well as an optional argument to use test-lb instead of text-lb.

Anyone have a good idea on a name for such a thing?

Anyone have a good idea on a name for such a thing?

closest_pop_on_fire.sh

Maybe separate from a script, is there any way we can do this via DNS? Something like grafana.cp-ulsfo.wikimedia.org to specify a site rather than accepting geoip routing.

@CDanis points out this also requires some work at the caching layer to rewrite the Host header, so that each service doesn't have to be configured for five new domains. But the advantages are that you don't need anything installed ahead of time, and you don't need to remember to reset anything when you're done.

For a number of years now, work has been proceeding in order to bring to perfection the crudely-conceived idea of a machine that would not only supply the easy re-routing of traffic for load-balanced services, but would also be capable of automatically synchronizing single-homed LibreNMSes and icingas. Such an instrument is the tunnel-encabulator.

Now, basically, the only new principle involved is that instead of hostnames being resolved by the relative motion of recursive and authoritative nameservers, they are resolved instead by the modial interaction of gethostbyname and /etc/hosts.

The tunnelencabulator has now reached a high level of development, and is being used successfully in the operation of wikitrunnions. Moreover, whenever a forescent skor motion is required towards non-CDN'd services, it may also be employed in conjunction with a loopback interface reciprocation dingle arm.