Page MenuHomePhabricator

'skip_first' feature flag for gdnsd GeoIP plugin
Closed, ResolvedPublic

Description

As part of T257527: automatically collect network error reports from users' browsers (Network Error Logging API), it'd be nice to have a way to serve a DNS record that was backed by an edge datacenter different than the "usual" datacenter for any given user. This is because if a user needs to send a Network Error report, it very likely means something is wrong with their route to the usual edge DC.

This isn't a strict requirement; browsers are supposed to buffer these reports and try again later in the event that sending them immediately fails. But it would help us receive reports in a more timely fashion, which would let us react and try to fix things.

It seems it's likely that a small patch can be made to add a skip_first flag at the resource level in gdnsd, which would save us defining and maintaining multiple flavors of country-to-datacenter maps in the dns repo.

An alternative approach might be to take an entire mapping and reverse it, but that's not as straightforward given the codebase.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jijiki triaged this task as Medium priority.Aug 27 2020, 7:53 PM

I just had an alternate idea, which wouldn't require any change to gdnsd.

The Reporting API allows you to specify a whole group of endpoint URLs, successful delivery to any one of which will consider the report delivered to the whole group, but where failure to one endpoint within a group will retry on other endpoints*.

*: I think. At least to my quick reading, the spec doesn't cleanly differentiate between endpoints and endpoint groups in the key sections.

So as an alternative, we could export a per-datacenter domain name, and make them all part of that endpoint group. We could even do this in combination with providing the special skip_first geoDNS name -- that might help us catch certain kinds of failure scenarios. (But since when the skip_first name does work, it wouldn't require any retries, it would also get us faster delivery in most cases.)

This is implemented "upstream" in https://github.com/gdnsd/gdnsd/commit/b17bb0b073b4a9c6e2a65d2ddee2e5bc39f1b717 which is released with v3.3.0, so we're over halfway there. Now to figure out our local debian packaging issues with it again :)

Mentioned in SAL (#wikimedia-operations) [2020-09-10T16:04:24Z] <bblack> reprepro: uploaded gdnsd-3.3.0-1~wmf1 - T261340

Change 626656 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] geo-resources: create text-next for NEL

https://gerrit.wikimedia.org/r/626656

Change 628935 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] point intake-logging.wikimedia.org to text-next (second-best DC)

https://gerrit.wikimedia.org/r/628935

Change 626656 merged by BBlack:
[operations/dns@master] geo-resources: create text-next for NEL

https://gerrit.wikimedia.org/r/626656

Change 628935 merged by CDanis:
[operations/dns@master] point intake-logging.wikimedia.org to text-next (second-best DC)

https://gerrit.wikimedia.org/r/628935

CDanis claimed this task.

Deployed at 13:20 UTC. The original TTL of the intake-logging CNAME was 1 day, so it will take that long for all clients to migrate.