@wkandek noticed an oddity where a client machine known to exist in Portugal was being geodns-directed to our eqiad edge ever since May 8, but had been more-correctly connecting to esams back on May 7th. This lead to a bit of a rabbithole investigation:
The client machine is using Cloudflare's 220.127.116.11 for DNS resolution. Lookups on reflect.wikimedia.org showed the Cloudflare-owned DNS cache exit IP in question. Our live MaxMind GeoIP2-City database showed this IP as being in the US, and thus gdnsd_geoip_test confirmed an eqiad mapping for it as well. However, checking https://www.maxmind.com/en/geoip-demo showed that the data had already been corrected (upstream at MaxMind) to give the correct (Portugal) result. Our last sync from MaxMind was on May 10 (today is May 12), and we do that weekly from a cron on the puppetmasters (who then distribute the updates via agent runs to clients like the authdns servers).
I've corrected this specific case by manually executing the update process (by copying the cronjob to a cumin execution on the puppetmasters, then running the puppet agent on the authdns), but I think there's a couple of things worth noting and/or fixing here:
- We can infer (not that it matters much if it's wrong) that the timeline of this whole update sequence was very short: Cloudflare probably put a new IP range into live use in Portgural on the 8th, we pulled a weekly update on the 10th that didn't yet account for it, yet by the 12th maxmind's data was updated (presumably because it was reported by Cloudflare, or other users). Our next natural update would've been on the 15th. I know that historically MaxMind has claimed they update the data roughly on a weekly basis, and maybe in this case it was a normal weekly update and we're just misaligned with their weeks? In any case, the current geoipdate seems to be smart enough to checksum the existing databases and not re-download pointless duplicates, so we could probably run it more often on the puppetmasters.
- gdnsd didn't automatically reload the new file either. It has code to do so, but we're currently abstracting the path via symlink as /etc/gdnsd/geoip/GeoIP2-City.mmdb -> /usr/share/GeoIP/GeoIP2-City.mmdb, and apparently the file watcher code from the eventloop is predictably not smart enough to follow the symlink, so we're only effectively getting geodns updates when the daemon is reloaded for some other reason (e.g. config changes, datacenter depools, etc), when they should at least be hitting weekly (not that it would've necessarily helped in this case!).
- Update the puppetmaster geoipupdate script to daily instead of weekly, since we have live examples of faster-than-weekly real-world changes with upstream data updates, and the updater appears to handle it efficiently (confirm?)
- Fix the symlink issue so that geodns uses the fresh data in a timely fashion (easy version would be to stop using the symlink abstraction, but I suspect that has CI implications. Or we could copy the databases we care about instead of symlinking them, or we could look at this as an upstream gdnsd issue and make it check for symlinks, or something?).