Page MenuHomePhabricator

Firefox SPDY-coalesces requests to geoiplookup over text-lb, causing GeoIP IPv6 failures
Closed, ResolvedPublic1 Estimated Story Points

Description

From @faidon on IRC:

Firefox is doing SPDY coalescing probably and is buggy in that it tries the lookup over the existing IPv6 connection despite geoiplookup.wikimedia.org having no AAAA. To reproduce you have to first open a SPDY session e.g. with en.wikipedia.org. Starting a browser and opening geoiplookup.wikimedia.org works, but opening the main site first, then quickly opening a tab with geoiplookup fails (returns Geo { IPv6:true })

Event Timeline

AndyRussG raised the priority of this task from to Needs Triage.
AndyRussG updated the task description. (Show Details)
AndyRussG added subscribers: AndyRussG, faidon, Ejegg and 2 others.
faidon triaged this task as High priority.Dec 18 2015, 10:47 PM
faidon updated the task description. (Show Details)
faidon added projects: SRE, Traffic.
faidon added a subscriber: BBlack.

This looks like a Firefox bug and it seems to affect fundraising right now. It doesn't appear to be a huge problem so it's not a UBN issue. Numbers:

faidon@oxygen:~$ wc -l ~/geoiplookup-20151218
4541 /home/faidon/geoiplookup-20151218
faidon@oxygen:~$ wc -l ~/geoiplookup-20151218 
4541 /home/faidon/geoiplookup-20151218
faidon@oxygen:~$ jq '.ip'  ~/geoiplookup-20151218 | grep -c :
5

The only fix I see for now (besides the more risky T99226) is to move geoiplookup.wikimedia.org away from text-lb's IPv4 and into an entirely new service IP (say, geoiplookup-lb) across all four sites.

The only fix I see for now (besides the more risky T99226) is to move geoiplookup.wikimedia.org away from text-lb's IPv4 and into an entirely new service IP (say, geoiplookup-lb) across all four sites.

This wouldn't be all that difficult to do (a few steps, but none are very risky, if we add the IP to text's set in LVS + caches). The downside is that for majority IPv4 clients, we might have a perf drop from lack of SPDY coalesce for geoiplookup.wm.o fetches. Given it looks like 0.1% of geoiplookup requests affected, this probably isn't worth it, since this will be fixed in the long run by FF's bugfix and/or our move to libmaxminddb.

On IPv4-only clients, geoiplookup.wm.org isn't used at all (the GeoIP cookie is, and that outperforms any separate request).

On dual-stack clients, if SPDY coalescing is used for geoiplookup.wm.org (which is a bug, present in Firefox but not Chrome), we don't get the intended behavior (no geolocation at all). IOW, there is absolutely no benefit in having a working SPDY coalescing for this endpoint -- it actually hurts us.

While this is definitely a real bug, it appears that its effect was greatly exaggerated, as the Javascript code that falls back to geoiplookup.wikimedia.org for IPv6 users was broken (cf. T121938 and the corresponding fix, 41f18414c3b82457640b6a6f0a2f3f146ee7315c).

That would explain the very low numbers above. The numbers above should be discarded for the purposes of analyzing the impact of this bug and re-run after a full 24hr period has passed since the above commit.

Some stats for a full day post-fallback-fix:

15,539,986 hits to geoiplookup
800,141 (5.1%) from v6 client IP addresses

Firefox was 1.25M of those hits, 62% of which were v6
Firefox mobile was another 57k with 33% v6.

Looks worth fixing if it's not a big risk.

Just to complete the numbers above: only 2 out of 743 1:1000 sampled requests hitting geoiplookup over IPv6 are not Firefox and these may be even be bogus/spoofed UAs. This is definitely limited to Firefox, and as such I've filed this as a bug in Firefox's bug tracker (BZ #1235068).

Furthermore, to clarify the numbers, this may be affecting 5.1% of all the requests to geoiplookup, but geoiplookup is a fallback mechanism already, and requests to geoiplookup only happen by dual-stacked clients in the first place. These are less than 10% of all clients on average globally and ~23% in the US according to Google (which has usually matched our own statistics).

In other words, this bug affects about 0.5-1% of all clients.

The change to a separate IP isn't super complicated, but it's definitely above my threshold for this particular day and the holiday season, especially considering the low overall impact this has and the fact that dual-stacked clients (this ~10-23%) had been broken from August and until last week because of T121938.

Finally, changing the IP would probably be a very short-term fix, as we've been essentially waiting for the end of the fundraiser to deploy GeoIP2 and IPv6 GeoIP lookups (T99226) which in turn would allow us to entirely deprecate geoiplookup.wikimedia.org (T100902). I'm not sure if it's even worth wasting our time towards an almost dead service that close to the end of the fundraising period.

faidon renamed this task from Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users to Firefox SPDY-coalesces requests to geoiplookup over text-lb, causing GeoIP IPv6 failures.Dec 26 2015, 9:37 PM

Change 264111 had a related patch set uploaded (by BBlack):
Add new geoiplookup IPs to DNS, start using them

https://gerrit.wikimedia.org/r/264111

Change 264112 had a related patch set uploaded (by BBlack):
text LVS: add new IPv4-only for geoiplookup

https://gerrit.wikimedia.org/r/264112

Change 264112 merged by BBlack:
text LVS: add new IPv4-only for geoiplookup

https://gerrit.wikimedia.org/r/264112

Change 264111 merged by BBlack:
Add new geoiplookup IPs to DNS, start using them

https://gerrit.wikimedia.org/r/264111

I've implemented the "separate IP" fix for now so we can get past this issue without blocking on geoip2 work. The TTLs were ~10 minutes, but some caches could take longer to expire the record. After that, Firefox shouldn't be coalescing this anymore (although I wouldn't be surprised if it doesn't take effect until FF clients restart...).

BBlack claimed this task.

Assuming this is no longer an issue, re-open if so!