Page MenuHomePhabricator

Merge mobile cache into text cache
Closed, ResolvedPublic

Description

[See also T89177 and T102524 - may eventually merge those in as duplicates]

The driving factors here are:

  1. We want to align the two on various analytics and functional header info related to X-Forwarded-For, X-Forwarded-By, X-CS, X-CS2, X-Analytics, etc. That and X-Subdomain are really the key functional differences in the VCL code of the two today (the other is Cookie handling).
  2. One of the key reasons to leave mobile segregated in the past was its high potential fragmentation due to Vary-ing on X-CS for many different carriers in support of Zero. However, that was long ago addressed with the "X-CS: ON" -related work. Only a handful of special URLs now fully vary on a real carrier-id.
  3. Mobile is currently under-utilizing a 4-node cache cluster at each site, when its traffic could easily now be merged into the text cluster in terms of load and hot dataset size, etc. Reducing the count of distinct cache clusters with independent complex configurations is a big complexity/maintenance win at various levels, and helps to further reduce the required minimum machine counts at cache datacenters to support all traffic reliably.

Some of the key steps and/or issues to address here (many can be done as slow refactor work leading up to the bigger switches):

  • Align the current text and mobile VCL code better where they differ for trivial or non-existent reasons, or one (usually text!) is simply better-configured than the other.
  • get netmapper and analytics synced up between the clusters
  • Fix mobile cookie handling and merge code-wise with text cookie handling
  • Turn on the key mobile code in the text-caches with X-Subdomain regexes protecting exclusive parts in both directions (e.g. current text's mobile redirect code, and mobile's X-Subdomain code).
  • Ensure no mobile-vs-desktop cache pollution on the text cluster (fixed via Vary changes or vcl_hash)
  • Coordinate with Analytics on the shift of mobile request logs to the text data sources, make any necessary changes there to support
  • Move the mobile IPs to the text-cluster. The two would still differ on IP address, but would both come through the same nginx proxy and varnish cluster.

[... at this stage, we can decom the actual mobile cache cluster and reuse it for other purposes ...]

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+3 -55
operations/puppetproduction+11 -367
operations/puppetproduction+0 -41
operations/puppetproduction+0 -58
operations/puppetproduction+2 -1
operations/puppetproduction+9 -72
operations/puppetproduction+1 -1
operations/puppetproduction+4 -4
operations/puppetproduction+8 -0
operations/puppetproduction+8 -0
operations/puppetproduction+4 -4
operations/puppetproduction+16 -0
operations/puppetproduction+4 -4
operations/puppetproduction+6 -0
operations/puppetproduction+2 -2
operations/puppetproduction+4 -4
operations/puppetproduction+8 -0
operations/puppetproduction+4 -1
operations/puppetproduction+9 -0
operations/puppetproduction+3 -9
operations/puppetproduction+82 -81
operations/puppetproduction+108 -314
operations/puppetproduction+218 -246
operations/puppetproduction+11 -12
operations/puppetproduction+16 -39
operations/puppetproduction+0 -5
operations/puppetproduction+16 -54
operations/puppetproduction+29 -0
operations/puppetproduction+26 -2
operations/puppetproduction+35 -8
operations/puppetproduction+7 -0
operations/puppetproduction+5 -0
operations/puppetproduction+7 -7
operations/puppetproduction+1 -1
operations/puppetproduction+29 -28
operations/puppetproduction+14 -8
operations/puppetproduction+167 -145
operations/puppetproduction+211 -243
operations/puppetproduction+167 -145
operations/puppetproduction+6 -10
operations/puppetproduction+0 -5
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+10 -5
operations/puppetproduction+22 -22
operations/puppetproduction+14 -14
operations/puppetproduction+7 -12
operations/puppetproduction+44 -0
operations/puppetproduction+49 -50
operations/puppetproduction+91 -91
operations/puppetproduction+80 -17
operations/puppetproduction+58 -16
operations/puppetproduction+5 -3
operations/puppetproduction+5 -4
operations/puppetproduction+2 -10
operations/puppetproduction+8 -10
operations/puppetproduction+38 -14
operations/puppetproduction+11 -12
operations/puppetproduction+33 -31
operations/puppetproduction+3 -3
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 244204 merged by BBlack:
X-Client-IP 4/12 - move XFF-setter out of recv_fe_ip_processing

https://gerrit.wikimedia.org/r/244204

Change 244205 merged by BBlack:
X-Client-IP 5/12 - recv_fe_ip_proc frontend-only

https://gerrit.wikimedia.org/r/244205

Change 244206 merged by BBlack:
X-Client-IP 6/12 - unset the 4x new headers

https://gerrit.wikimedia.org/r/244206

Change 244207 merged by BBlack:
X-Client-IP 7/12 - Set X-T-P

https://gerrit.wikimedia.org/r/244207

Change 244208 merged by BBlack:
X-Client-IP 8/12 - Set X-CIP

https://gerrit.wikimedia.org/r/244208

Change 244209 merged by BBlack:
X-Client-IP 9/12 - Set X-C X-C-M

https://gerrit.wikimedia.org/r/244209

Change 244210 merged by BBlack:
X-Client-IP 10/12 - switch zero.inc to using XC XCM

https://gerrit.wikimedia.org/r/244210

Change 244211 merged by BBlack:
X-Client-IP 11/12 - remove outdated 404-01b zero case

https://gerrit.wikimedia.org/r/244211

Change 244212 merged by BBlack:
X-Client-IP 12/12 - switch zero analytics to use XC/XCM

https://gerrit.wikimedia.org/r/244212

Change 244442 had a related patch set uploaded (by BBlack):
X-Client-IP: get rid of temp var, update commentary

https://gerrit.wikimedia.org/r/244442

Change 244442 merged by BBlack:
X-Client-IP: get rid of temp var, update commentary

https://gerrit.wikimedia.org/r/244442

Change 243977 merged by BBlack:
Move all X-Analytics code to analytics.inc, include in common VCL

https://gerrit.wikimedia.org/r/243977

Change 257699 had a related patch set uploaded (by BBlack):
varnish: use same VCL files for text mobile

https://gerrit.wikimedia.org/r/257699

Change 257774 had a related patch set uploaded (by BBlack):
text VCL: remove hiera mobile/text conditionals

https://gerrit.wikimedia.org/r/257774

Change 257699 merged by BBlack:
varnish: use same VCL files for text mobile

https://gerrit.wikimedia.org/r/257699

Change 257774 merged by BBlack:
text VCL: remove hiera mobile/text conditionals

https://gerrit.wikimedia.org/r/257774

Change 258208 had a related patch set uploaded (by BBlack):
Text VCL: same no-article-cache for mobile as desktop

https://gerrit.wikimedia.org/r/258208

Change 258208 merged by BBlack:
Text VCL: same no-article-cache for mobile as desktop

https://gerrit.wikimedia.org/r/258208

I've successfully tested loading mobile content through the text cache (local DNS hack of m-dot hostname to text cluster), and used the same URLs through both and seen that they don't pollute each other. So in functional user-facing terms, I think we're ready to make the switch here. However, I think we'll need to coordinate with Analytics about how this affects their stats streams (e.g. what used to be requests in webrequest_mobile will now be mixed in with desktop in webrequest_text, differentiated by the m-dot request hostnames).

@Ottomata?

Change 258458 had a related patch set uploaded (by BBlack):
cache_text: add mobile IPs to loopback

https://gerrit.wikimedia.org/r/258458

Change 258459 had a related patch set uploaded (by BBlack):
mobile-lb: use text caches as LVS backends

https://gerrit.wikimedia.org/r/258459

Change 258648 had a related patch set uploaded (by BBlack):
text VCL: protect mobile cache from text pollution

https://gerrit.wikimedia.org/r/258648

Change 258648 merged by BBlack:
text VCL: protect mobile cache from text pollution

https://gerrit.wikimedia.org/r/258648

Ok, I think we are ready on the Analytics side. We'll need to do some things right after this change is made, so some planning is in order over in https://phabricator.wikimedia.org/T122651. Let's set a date for making this change, and we'll make sure we are ready to do our part.

The switch of traffic off of the mobile cluster is tentatively scheduled to begin on Tuesday, Jan 19th and take at least several hours. Will update here when it's complete.

Change 258458 merged by Ema:
cache_text: add mobile IPs to loopback

https://gerrit.wikimedia.org/r/258458

The traffic move from mobile->text is now on hold (we did convert codfw, then we rolled back) due to purge-related issues that need to be addressed first, in blocking task T124165.

Change 265710 had a related patch set uploaded (by Ema):
codfw: add text nodes to mobile cluster

https://gerrit.wikimedia.org/r/265710

Change 265710 merged by Ema:
codfw: add text nodes to mobile cluster

https://gerrit.wikimedia.org/r/265710

Change 265742 had a related patch set uploaded (by Ema):
codfw: remove varnish-fe,nginx services from mobile cluster

https://gerrit.wikimedia.org/r/265742

Change 265742 merged by Ema:
codfw: remove varnish-fe,nginx services from mobile cluster

https://gerrit.wikimedia.org/r/265742

Change 258459 abandoned by BBlack:
mobile-lb: use text caches as LVS backends

https://gerrit.wikimedia.org/r/258459

Change 266230 had a related patch set uploaded (by Ema):
ulsfo: add text nodes to mobile cluster

https://gerrit.wikimedia.org/r/266230

Change 266230 merged by Ema:
ulsfo: add text nodes to mobile cluster

https://gerrit.wikimedia.org/r/266230

Change 266253 had a related patch set uploaded (by Ema):
ulsfo: remove varnish-fe,nginx services from mobile cluster

https://gerrit.wikimedia.org/r/266253

Change 266253 merged by Ema:
ulsfo: remove varnish-fe,nginx services from mobile cluster

https://gerrit.wikimedia.org/r/266253

Change 266475 had a related patch set uploaded (by Ema):
esams: add text nodes to mobile cluster

https://gerrit.wikimedia.org/r/266475

Change 266475 merged by Ema:
esams: add text nodes to mobile cluster

https://gerrit.wikimedia.org/r/266475

Change 266499 had a related patch set uploaded (by Ema):
esams: remove varnish-fe,nginx services from mobile cluster

https://gerrit.wikimedia.org/r/266499

Change 266499 merged by Ema:
esams: remove varnish-fe,nginx services from mobile cluster

https://gerrit.wikimedia.org/r/266499

Change 266503 had a related patch set uploaded (by Ema):
eqiad: add text nodes to mobile cluster

https://gerrit.wikimedia.org/r/266503

Change 267159 had a related patch set uploaded (by BBlack):
eqiad: add text nodes to cache_mobile frontends

https://gerrit.wikimedia.org/r/267159

Change 267160 had a related patch set uploaded (by BBlack):
eqiad: remove mobile frontends from cache_mobile

https://gerrit.wikimedia.org/r/267160

Change 267159 merged by BBlack:
eqiad: add text nodes to cache_mobile frontends

https://gerrit.wikimedia.org/r/267159

Change 266503 abandoned by Ema:
eqiad: add text nodes to mobile cluster

Reason:
Already done in https://gerrit.wikimedia.org/r/#/c/267159/.

https://gerrit.wikimedia.org/r/266503

Change 267230 had a related patch set uploaded (by BBlack):
eqiad: remove last cache_mobile frontend

https://gerrit.wikimedia.org/r/267230

Change 267160 merged by BBlack:
eqiad: remove most mobile frontends from cache_mobile

https://gerrit.wikimedia.org/r/267160

Status update: We're pretty much done with the cache traffic migration, but there's still 1x eqiad mobile cache (cp1060) pooled with low weight to keep mobile webrequest analytics data flowing, until they're ready on their end to deal with it, which will probably be on Monday. After that we'll probably hold on any further related work for a week just in case latent issues or complaints crop up, so that it's easier to revert. Assuming no issues and comfort levels are high, then we can proceed with removing the puppet definitions for the cache_mobile cluster in LVS/varnish terms and moving the mobile IPs into cache_text's list of IPs.

Ok great! We’re having some issues with jobs right now due to some Kafka
problems, and we’ll want to make sure everything is fine before we try to
move on this. Hopefully everything will be fine by Monday and we can
proceed.

I just talked to @BBlack, and also looked at requests in the webrequest_mobile topic in Kafka. There are still real user requests from cp1060, but most of the requests in that topic are internal monitoring requests. Analytics jobs will not block as long as at least some data continues to flow into webrequest_mobile. I told @BBlack that they can go ahead and drain traffic from cp1060, and that we will proceed with Analytics changes after that is done, but before monitoring traffic is also turned off.

cp1060 is depooled for users now. Once Analytics is done with their oozie thing, we can proceed on the next steps for actually stopping the cache_mobile cluster itself (which is devoid of real users now).

Change 267230 merged by BBlack:
eqiad: remove last cache_mobile frontend

https://gerrit.wikimedia.org/r/267230

Change 268226 had a related patch set uploaded (by BBlack):
cache_mobile LVS decom: 1/2 remove LVS service

https://gerrit.wikimedia.org/r/268226

Change 268227 had a related patch set uploaded (by BBlack):
cache_mobile LVS decom: 2/2 remove conftool data

https://gerrit.wikimedia.org/r/268227

Change 268228 had a related patch set uploaded (by BBlack):
cache_mobile decom: 1/2 remove realserver IPs

https://gerrit.wikimedia.org/r/268228

Change 268229 had a related patch set uploaded (by BBlack):
cache_mobile decom: 2/2 Remove most cache config

https://gerrit.wikimedia.org/r/268229

Change 268226 merged by BBlack:
cache_mobile LVS decom: 1/2 remove LVS service

https://gerrit.wikimedia.org/r/268226

Change 268228 merged by BBlack:
cache_mobile decom: 1/2 remove realserver IPs

https://gerrit.wikimedia.org/r/268228

Change 269127 had a related patch set uploaded (by BBlack):
cache_mobile LVS decom: 3/3 remove conftool service data

https://gerrit.wikimedia.org/r/269127

Change 268227 merged by BBlack:
cache_mobile LVS decom: 2/3 remove conftool node data

https://gerrit.wikimedia.org/r/268227

Change 269127 merged by BBlack:
cache_mobile LVS decom: 3/3 remove conftool service data

https://gerrit.wikimedia.org/r/269127

Change 268229 merged by BBlack:
cache_mobile decom: 2/2 Remove most cache config

https://gerrit.wikimedia.org/r/268229

BBlack updated the task description. (Show Details)

Change 269141 had a related patch set uploaded (by BBlack):
torrus: remove cache_mobile stuff

https://gerrit.wikimedia.org/r/269141

Change 269141 merged by BBlack:
torrus: remove cache_mobile stuff

https://gerrit.wikimedia.org/r/269141