Page MenuHomePhabricator

LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping
Closed, ResolvedPublic

Description

We've been having the occasional alert flap on the LVS HTTPS IPv6 on mobile-lb.eqiad alert, which has even caused some opsens to completely ignore it. This is something that we should fix ASAP, as a) i's highly probable it's a real problem, b) it conditions us to ignore pages.

I realized that despite his happening often, I have never been awake and/or present when this was happening. This made me go look at my IRC logs, which show this:

--- Day changed Tue Sep 08 2015
04:06 < icinga-wm> PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
04:10 < icinga-wm> RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 505 bytes in 0.010 second response time
--- Day changed Wed Sep 09 2015
--- Day changed Thu Sep 10 2015
--- Day changed Fri Sep 11 2015
05:07 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
05:09 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10770 bytes in 0.137 second response time
--- Day changed Sat Sep 12 2015
--- Day changed Sun Sep 13 2015
--- Day changed Mon Sep 14 2015
00:08 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
00:10 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10772 bytes in 0.334 second response time
--- Day changed Tue Sep 15 2015
--- Day changed Wed Sep 16 2015
04:38 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
04:40 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10693 bytes in 0.114 second response time
04:48 < icinga-wm> PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
04:55 < icinga-wm> RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 505 bytes in 1.008 second response time
05:16 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
05:19 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10693 bytes in 0.103 second response time
--- Day changed Thu Sep 17 2015
05:28 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
05:30 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10693 bytes in 1.079 second response time
05:50 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
05:51 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10693 bytes in 0.079 second response time
--- Day changed Fri Sep 18 2015
04:45 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
04:46 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10514 bytes in 1.105 second response time
06:04 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
06:06 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10514 bytes in 0.096 second response time
--- Day changed Sat Sep 19 2015
04:02 < icinga-wm> PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
04:04 < icinga-wm> RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10512 bytes in 0.121 second response time
04:19 < icinga-wm> PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
04:20 < icinga-wm> RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 503 bytes in 1.003 second response time
04:42 < icinga-wm> PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out

(hours are UTC+3)

Apparently there is some correlation with times of the day; this could be related to traffic levels or some other periodic tasks (IPsec session renewal?).

Event Timeline

faidon raised the priority of this task from to Unbreak Now!.
faidon updated the task description. (Show Details)
faidon added projects: acl*sre-team, Traffic.
faidon subscribed.

FWIW, I think this pre-dated IPSec and probably isn't related to it. In earlier investigations it looked like a monitoring failure of some kind and not real. Later theory was that it was real, and related to the issues with short RA TTLs for the IPv6 default gateways, but it has persisted since fixing that as well...

This has always hit text and upload lb's (again, IPv6, in eqiad) as well, but usually they're less-likely than mobile to reach 3/3 and actually send an alert. I've downtimed all 3 temporarily just to avoid the excess paging, and we can look at this more in detail Monday.

So, this evening the same flaps hit (as they do most evenings), but they hit codfw ipv6 service IPs rather than eqiad. Both DCs have been active and monitored all along, so this means the alerts follow the large-scale traffic more than anything else, and they don't seem to be eqiad-specific. They've never hit esams or ulsfo before, though.

Digging around a bit and thinking, I stumbled on a new theory: this might be because of the small fixed default size of /proc/sys/net/ipv6/route/max_size. It's dynamically adjustable with normal sysctl, but it's always a fixed 4096 by default, whereas the ipv4 equivalent ends up being 2147483647 (2^31-1), and supposedly is based on system memory constraints and such. The big cache clusters (upload, mobile, text) in codfw were all showing values just under 4K for wc -l /proc/net/ipv6_route in codfw this evening, and presumably the same in eqiad when it has full traffic. For whatever reason, the same tables in ulsfo and esams are much smaller and nowhere near the 4K value (ulsfo closer to 2K-ish, esams only a few hundreds), which would explain why this doesn't hit those DCs.

I've manually adjusted the sysctl on all of the cp* machines to 131072 via salt for now, and I've been tailing the icinga log since then (nearly an hour now) and haven't even seen any 1/3 soft-fails for v6 service endpoints or the usual v6 ipsec flaps, either. Will leave the tail running overnight. Once the max_size was lifted, the codfw wc -l /proc/net/ipv6_route values jumped from their just-under-4K values to values in 5-7K range on those machines.

Will continue to observe, and tomorrow will dig deeper on exactly how that table works and what a sane way to size it would be, and also why esams seems so much less affected when it has at least as much total traffic (is v6 adoption really that much less in the EU than the US? there are other possible factors too, such as more cache boxes per cluster).

Update: log tail caught a few 1/3 soft fails on ipsec ipv6 still, but those could be due to legitimate timing and/or packet loss. The rate of them is much lower than before (more than an order of magnitude), and no soft-fails on ipv6 service IPs yet. Will keep watching through the next expected flap window.

Digging around a bit and thinking, I stumbled on a new theory: this might be because of the small fixed default size of /proc/sys/net/ipv6/route/max_size. It's dynamically adjustable with normal sysctl, but it's always a fixed 4096 by default, whereas the ipv4 equivalent ends up being 2147483647 (2^31-1), and supposedly is based on system memory constraints and such. The big cache clusters (upload, mobile, text) in codfw were all showing values just under 4K for wc -l /proc/net/ipv6_route in codfw this evening, and presumably the same in eqiad when it has full traffic. For whatever reason, the same tables in ulsfo and esams are much smaller and nowhere near the 4K value (ulsfo closer to 2K-ish, esams only a few hundreds), which would explain why this doesn't hit those DCs.

I've manually adjusted the sysctl on all of the cp* machines to 131072 via salt for now, and I've been tailing the icinga log since then (nearly an hour now) and haven't even seen any 1/3 soft-fails for v6 service endpoints or the usual v6 ipsec flaps, either. Will leave the tail running overnight. Once the max_size was lifted, the codfw wc -l /proc/net/ipv6_route values jumped from their just-under-4K values to values in 5-7K range on those machines.

Will continue to observe, and tomorrow will dig deeper on exactly how that table works and what a sane way to size it would be, and also why esams seems so much less affected when it has at least as much total traffic (is v6 adoption really that much less in the EU than the US? there are other possible factors too, such as more cache boxes per cluster).

That's a very good find and catch. A few years ago we were having frequent issues with the IPv4 route cache filling up, in particular in cases of traffic surges. We had to workaround these until upstream Linux ditched the IPv4 route cache entirely.

This is the knob for the IPv6 routing table but IIRC is also the same for the route cache table. 4096 would definitely not be enough and cause this effect. Ironically, us switching to HTTPS for which we use sh probably delayed the occurence of this issue considerably!

IPv6 connectivity is on the rise (exponentially, even) which could explain the sudden occurence of this issue. It's currently at ~8.5% globally and ~21.5% in the US, according to Google's IPv6 statistics, the most complete analysis on the subject. The iOS 9 release probably also gave a boost to the number of IPv6 users globally because of a changed behavior in the OS.

To answer your other question on US vs. EU IPv6: as you can see from the Google stats above, Europe's IPv6 connectivity varies greatly from country to country due to different operators, but it's clear that on average cannot top the US' numbers.

Moreover, I ran a quick log analysis over our own (sampled 1:1000) logs for two dates, yesterday and Aug 8th. The findings are:

  • IPv6 globally is at 8.45%, was 7.64% back in August.
  • 46.49% of our (IPv4/IPv6) traffic hits esams (42.85% in August); ulsfo is at 20.56% (14.11%), eqiad+codfw is at 32.95% (43.04% in August). Wwe've reshuffled half of the US' traffic during this time period, so the difference between Aug/Sep makes sense.
  • 11.47% (11.87%) of all of requests ending up at eqiad+codfw are IPv6. For ulsfo it's 10.6% (3.79%) and for esams it's 5.35% (4.67%).
  • IPv6 requests hitting eqiad+codfw+ulsfo are 70.55% of all IPv6 requests (73.79%).
  • IPv6 requests hitting just eqiad+codfw are now 44.70% (66.78%) of all IPv6 requests, i.e. ulsfo handles 25.85% (7.01%) of all IPv6 requests.

All of the above seem to support that a) eqiad/codfw get a lot more IPv6 traffic both in comparison and in absolute numbers than esams/ulsfo (without factoring in number of servers/requests per server), b) there is a strong correlation of the US-West move to ulsfo earlier this month changing the IPv6 ratio for ulsfo, which in turn suggests that the percentages of IPv6 users in the US are much higher than Asia (or the rest of the world for that matter).

Change 242122 had a related patch set uploaded (by BBlack):
Bump v6 route max_size to 131072 for all

https://gerrit.wikimedia.org/r/242122

Change 242122 merged by Faidon Liambotis:
Bump IPv6 route max_size to 131072 for all

https://gerrit.wikimedia.org/r/242122

FWIW, I did some additional searching and ended up... in Facebook's IPv6 work, which explains well the symptoms we're seeing. Their work has been merged into Linux 4.2-rc1 from what I can see, so things would be considerably improved then — but in the meantime, the fixes max_size increase that was just pushed should alleviate them for our scale.

Let's monitor this for another 24 hours and resolve.

BBlack claimed this task.

No flap paged in the usual timeframe last night. icinga logs are clear of the usual raft of 1/3 soft fails too. We still get some ipsec ipv6 flaps, but again at a much lower rate than before, so the remaining ones are probably from something unrelated to this.