Page MenuHomePhabricator

Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two
Closed, ResolvedPublic

Description

There's a fair amount of traffic that crosses via redirects, plus login/meta fetches to desktop from mobile, etc. With the merge of the two caches, the only blocker for this is ensure it's ok with Zero - I don't think they'll take issue, as I believe partners are only paying attention to whitelist block difference for text-vs-multimedia, not desktop-vs-mobile, but it's best to check/coordinate first.

Event Timeline

BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added projects: Traffic, Zero.
BBlack subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
BBlack triaged this task as Low priority.Feb 9 2016, 9:00 PM

Updates:

  1. We're still trying to get to the bottom of historical and present mysteries about Zero-rated whitelist subnets, which holds up making a decision on whether it's ok to move the m-dot and/or zero-dot hostnames to the text IP.
  1. Over in T125979 we're experimenting with disabling SPDY altogether on the text caches for the foreseeable future due to the fact that the heavy performance loss on slow devices/networks may not be worth the modest gains on fast ones. This is all interrelated with the fact that our primary HTML output is heavy (article content not split to a separate fetch from the page/UI bits) and our main CSS isn't inlined, as discussed in T125208 . If we end up sticking with the SPDY-disable on cache_text, there's less reason to worry about this IP change in the first place (although it would still be nice to get it done just to clean up unnecessary IPs and LVS services, and prepare for future SPDY and/or H/2).
BBlack raised the priority of this task from Low to Medium.Apr 6 2016, 1:33 PM
BBlack added subscribers: DFoy, Yurik, dr0ptp4kt.

We didn't end up keeping SPDY disabled, and HTTP/2 is coming. From our end, this is a relatively simple change now, but there are still open questions about the effect on Zero which we need help resolving. Past email threads petered out with no common understanding between ops + zero on how the IP blocks work today...

The Zero picture is clearer now from some email threads with @DFoy and @dr0ptp4kt . We're clear for this change on the Zero front already, just not the multimedia one in T116132.

Change 283364 had a related patch set uploaded (by BBlack):
Switch mobile hostnames to text IP

https://gerrit.wikimedia.org/r/283364

https://gerrit.wikimedia.org/r/283364 above does the functional user-facing change. If it's successful without issue, there will eventually be a number of followup commits afterwards to clean up the leftover bits of the mobile addrs and eventually decom them from use completely at the DNS/LVS/etc levels. after we've confirmed traffic dropoff on the old IPs down to an acceptable level.

Holding on merging the above until after the codfw-switchover week, so as not to create too many overlapping effects when comparing graphs and such.

Change 283364 merged by BBlack:
Switch mobile hostnames to text IP

https://gerrit.wikimedia.org/r/283364

Change 285227 had a related patch set uploaded (by BBlack):
mobile IP DNS decom

https://gerrit.wikimedia.org/r/285227

Change 285229 had a related patch set uploaded (by BBlack):
decom mobile IPs from LVS/caches

https://gerrit.wikimedia.org/r/285229

This was merged around 2016-04-25 18:40 UTC, and legit caches that honor TTLs correctly should have all stopped handing out the old IPs by ~19:00.

Next step here is auditing the trailing traffic, in case somewhere these old IPs or LB hostnames are hardcoded, or there's significantly broken DNS cache/client stuff, before we can finish decomming the IPs.

It's been 8.8 days since 10-minute TTL expiry, and the rates are low enough that we definitely don't have any kind of systemic issue with e.g. hardcoded IPs in our own apps or server-side code.

LVSes still show a tiny handful of connections to the mobile IPs, but it's very tiny. These are expected, from sources such as:

  1. One-off instances of 3rd parties hardcoding our IPs for debugging or something
  2. Broken DNS caches
  3. Random HTTP[S] hits to random IPs (scanning and probing)
  4. Probably, our own healthchecks in e.g. catchpoint / watchmouse -type stuff?

I don't expect the rate will ever reach zero, but it's close enough to kill it, IMHO. Will look into catchpoint/watchmouse first and see if I can eliminate anything there before it alerts on us.

Checked watchmouse + catchpoint, didn't find any hardcoded IP refs there (but did find a bits.wm.o ref to kill in watchmouse!)

Mentioned in SAL [2016-05-04T15:34:48Z] <bblack> removing old mobile IPs from actual production config (no longer in use) - T124482

Change 285229 merged by BBlack:
decom mobile IPs from LVS/caches

https://gerrit.wikimedia.org/r/285229

Mentioned in SAL [2016-05-04T16:00:45Z] <bblack> REALLY (from active LVS) removing old mobile IPs from actual production config (no longer in use) - T124482

BBlack claimed this task.

old IPs decommed, done here