Page MenuHomePhabricator

Maxmind data update issues for DNS (and others?)
Closed, ResolvedPublic

Description

@wkandek noticed an oddity where a client machine known to exist in Portugal was being geodns-directed to our eqiad edge ever since May 8, but had been more-correctly connecting to esams back on May 7th. This lead to a bit of a rabbithole investigation:

The client machine is using Cloudflare's 1.1.1.1 for DNS resolution. Lookups on reflect.wikimedia.org showed the Cloudflare-owned DNS cache exit IP in question. Our live MaxMind GeoIP2-City database showed this IP as being in the US, and thus gdnsd_geoip_test confirmed an eqiad mapping for it as well. However, checking https://www.maxmind.com/en/geoip-demo showed that the data had already been corrected (upstream at MaxMind) to give the correct (Portugal) result. Our last sync from MaxMind was on May 10 (today is May 12), and we do that weekly from a cron on the puppetmasters (who then distribute the updates via agent runs to clients like the authdns servers).

I've corrected this specific case by manually executing the update process (by copying the cronjob to a cumin execution on the puppetmasters, then running the puppet agent on the authdns), but I think there's a couple of things worth noting and/or fixing here:

  1. We can infer (not that it matters much if it's wrong) that the timeline of this whole update sequence was very short: Cloudflare probably put a new IP range into live use in Portgural on the 8th, we pulled a weekly update on the 10th that didn't yet account for it, yet by the 12th maxmind's data was updated (presumably because it was reported by Cloudflare, or other users). Our next natural update would've been on the 15th. I know that historically MaxMind has claimed they update the data roughly on a weekly basis, and maybe in this case it was a normal weekly update and we're just misaligned with their weeks? In any case, the current geoipdate seems to be smart enough to checksum the existing databases and not re-download pointless duplicates, so we could probably run it more often on the puppetmasters.
  1. gdnsd didn't automatically reload the new file either. It has code to do so, but we're currently abstracting the path via symlink as /etc/gdnsd/geoip/GeoIP2-City.mmdb -> /usr/share/GeoIP/GeoIP2-City.mmdb, and apparently the file watcher code from the eventloop is predictably not smart enough to follow the symlink, so we're only effectively getting geodns updates when the daemon is reloaded for some other reason (e.g. config changes, datacenter depools, etc), when they should at least be hitting weekly (not that it would've necessarily helped in this case!).

So:

  • Update the puppetmaster geoipupdate script to daily instead of weekly, since we have live examples of faster-than-weekly real-world changes with upstream data updates, and the updater appears to handle it efficiently (confirm?)
  • Fix the symlink issue so that geodns uses the fresh data in a timely fashion (easy version would be to stop using the symlink abstraction, but I suspect that has CI implications. Or we could copy the databases we care about instead of symlinking them, or we could look at this as an upstream gdnsd issue and make it check for symlinks, or something?).

Event Timeline

BBlack triaged this task as Medium priority.May 12 2020, 6:16 PM
BBlack created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Diving a little deeper on the symlink issue:

  1. gdnsd uses libev's ev_stat watcher for this and other similar cases, as documented here: http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#code_ev_stat_code_did_the_file_attri
  2. Looking at the source code for ev_stat_stat() in e.g. https://salsa.debian.org/debian/libev/-/blob/master/ev.c#L4973 , we can see that it calls lstat() rather than stat(), and thus will only pay attention to the symlink itself, not the final target file (and, perhaps, the watcher type is poorly named?).

We could abstract a real stat() watcher over the top of ev_stat() by chasing the symlink chain and building libev watchers for all involved paths, and that might be the best upstream solution. Probably copying the data would be simpler for now in the short term, though!

I know that historically MaxMind has claimed they update the data roughly on a weekly basis, and maybe in this case it was a normal weekly update and we're just misaligned with their weeks? In any case, the current geoipdate seems to be smart enough to checksum the existing databases and not re-download pointless duplicates, so we could probably run it more often on the puppetmasters.

MaxMind says that "The GeoIP2 Country, City, ISP, Connection Type, and Enterprise databases are updated weekly, every Tuesday". May 10th was a Sunday, so if we update weekly on Sunday… then it sounds like there is Tuesday -> Sunday lag in the freshness of our data.

Just FYI: my machine is being served from esams again.

I was bitten by this again today - ping!

Change 641747 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] authdns: copy geoip data rather than symlink

https://gerrit.wikimedia.org/r/641747

Change 641748 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] fetch maxmind geoip daily instead of weekly

https://gerrit.wikimedia.org/r/641748

Change 641747 merged by BBlack:
[operations/puppet@production] authdns: copy geoip data rather than symlink

https://gerrit.wikimedia.org/r/641747

Change 641748 merged by BBlack:
[operations/puppet@production] fetch maxmind geoip daily instead of weekly

https://gerrit.wikimedia.org/r/641748

BBlack claimed this task.

This should be fixed now!