Page MenuHomePhabricator

[Clean up] Redirect m-dot URLs to canonical domains
Closed, ResolvedPublic

Assigned To
Authored By
Krinkle
Sep 29 2025, 4:12 PM
Referenced Files
F70576169: 2025_varnishpurgereq_dcall_total_per_day_edit2.png
Mon, Nov 24, 12:19 PM
F70575798: 2025_varnishpurgereq_dcall_total_per_day_plot2_anno.png
Mon, Nov 24, 12:12 PM
F69819680: T405931-purge-drop.png
Nov 3 2025, 6:59 PM
F67609356: cap.png
Oct 27 2025, 7:37 AM
F67549276: 2025_varnishpurgereq_dcall.png
Oct 27 2025, 12:28 AM
F66887821: purges-total-annoted.png
Oct 24 2025, 8:53 AM
F66887823: purges-total.jpg
Oct 24 2025, 8:53 AM
F66886784: Screenshot 2025-10-24 at 00.53.22.png
Oct 24 2025, 8:53 AM

Description

Status quo

See also RFC: Mobile domain sunsetting § Infra cost on mediawiki.org.

  • The CDN (Varnish/ATS) respond to requests on mobile subdomains, with a mobile version.
  • They store this response under that URL as-is.
  • This means the standard copy and mobile copy are cached under different URLs and thus has historically required MediaWiki to emit twice as many purges to the CDN (for both URLs) after every edit and links update.
  • Historically the reason for this was that MobileFrontend did not emit Vary:X-Subdomain from its backend responses. This was fixed in T390929.

The way this works today:

  • Varnish handles requests to the mobile subdomain by:
    • computing the standard domain (e.g. strip the "m") and store it in the temporary x-dt-host. This is ignorede by Varnish itself, and the request hostname (e.g. "en.m.wikipedia.org") is left unchanged, as was historically needed to ensure mobile versions have a separate cache instead of poisoning the desktop cache.
    • After cache miss, Varnish frontend fetches from ATS backend.
  • ATS leaves the request unchanged during cache lookup, for the same reason (lack of Vary:X-Subdomain).
    • After cache miss, ATS fetches from MediaWiki.
    • Before performing that fetch, it sets Host = x-dt-host via our rb-mw-mangling.lua plugin.
  • MediaWiki has a hook for MobileFrontend, which modifies the purge list to add a duplicate of each purge command with the hostname swapped for the mobile subdomain.
Technical diagram from Mobile domain sunsetting on mediawiki.org:

WMF Unified mobile routing 2025.png (2×2 px, 408 KB)

The mobile handling code in Varnish is defined at https://w.wiki/FG4S.
The mobile domain code in Varnish and ATS is defined at https://w.wiki/FG4Y and https://w.wiki/FG4a.

Acceptance criteria

  • One cache: Varnish and ATS should not maintain separate caches for mobile domain and unified mobile routing.
  • One purge: MediaWiki doesn't send duplicate purges for the mobile domain.

Option A: Share CDN cache now, redirect later

  1. Copy Host = x-dt-host assignment from ATS to Varnish, behind a feature flag.
  2. Move MobileFrontend purging hook, currently in wmf-config, behind a feature flag.
  3. Rollout to Beta Cluster and testwikis
  4. Rollout to production on the same wikis that enable unified mobile routing:
    • Turn on Varnish feature flag. This will essentially clear the m-dot cache and make it use the unified cache instead. At this point purges from MediaWiki for the mobile domain are redundant.
    • Turn on the MW feature flag to stop MobileFrontend sending purges to the mobile domain.

Option B: Redirect now

I think a reasonable question may be: Why not redirect m-dot to standard and forget about the m-dot cache?

The original RFC suggested we redirect at the same time (or immediately after) enabling the mobile version routing on the standard domain for a given wiki. This was revised at T405429#11208655 to instead follow the drop-off of traffic, acknowledwing that we may need to wait take several months for external traffic sources to adapt.

But, we don't need to wait for that in order to start reaping the benefits of reduced cache variants, and the benefits of cutting the purge load. We have the m-dot domain and unified URLs both share the same cache now, and do the redirect later.

On the other hand, we now have data (T405429#11221223) that shows for the 300+ wikis we have enabled so far, 98.6% of mobile traffic moves to the standard domain within 24 hours, leaving only 1% on the mobile subdomain (i.e. direct links). […]

So maybe we don't need two separate steps and we can instead redirect now, which naturally eliminates the need for a duplicate cache and purge:

  • Add Varnish logic for redirecting m-dot to canonical, behind a feauture flag.
  • Enable that one day after the main feature flag, for a given set of wikis.
  • Once both rollouts are complete, we remove the duplicate purging from MediaWiki.

Event Timeline

@BBlack @ssingh @BCornwall

I think a reasonable question may be: Why not redirect m-dot to standard and forget about the m-dot cache?

The original RFC suggested we redirect at the same time (or immediately after) enabling the mobile version routing on the standard domain for a given wiki. This was revised at T405429#11208655 to instead follow the drop-off of traffic, acknowledwing that we may need to wait take several months for external traffic sources to adapt.

But, we don't need to wait for that in order to start reaping the benefits of reduced cache variants, and the benefits of cutting the purge load. We have the m-dot domain and unified URLs both share the same cache now, and do the redirect later.

On the other hand, we now have data (T405429#11221223) that shows for the 300+ wikis we have enabled so far, 98.6% of mobile traffic moves to the standard domain within 24 hours, leaving only 1% on the mobile subdomain (i.e. direct links).

[…]
Screenshot 2025-09-24 at 03.30.15.png (1×1 px, 282 KB) Screenshot 2025-09-26 at 21.19.59.png (728×1 px, 62 KB)

So maybe we don't need two separate steps and we can instead redirect now, which naturally eliminates the need for a duplicate cache and purge:

  • Add Varnish logic for redirecting m-dot to canonical, behind a feauture flag.
  • Enable that one day after the main feature flag, for a given set of wikis.
  • Once both rollouts are complete, we remove the duplicate purging from MediaWiki.
Krinkle renamed this task from Share CDN cache between m-dot and unified mobile routing to [Clean up] Redirect m-dot URLs to canonical domains.Oct 2 2025, 6:52 AM
Krinkle updated the task description. (Show Details)

Change #1194558 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Remove unused "Mobile Redirect" logic

https://gerrit.wikimedia.org/r/1194558

Change #1194558 merged by BCornwall:

[operations/puppet@production] varnish: Remove unused "Mobile Redirect" logic

https://gerrit.wikimedia.org/r/1194558

Change #1197341 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Add test for m.wikisource.org x-dt-host rewrite

https://gerrit.wikimedia.org/r/1197341

Change #1197343 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug

https://gerrit.wikimedia.org/r/1197343

Change #1197351 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Implement enable_m_redir and enable in Beta Cluster

https://gerrit.wikimedia.org/r/1197351

Change #1197693 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Enable enable_m_redir in Beta Cluster for all wikis

https://gerrit.wikimedia.org/r/1197693

Change #1197694 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Enable enable_m_redir in esams and drmrs

https://gerrit.wikimedia.org/r/1197694

Change #1197695 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Enable enable_m_redir everywhere

https://gerrit.wikimedia.org/r/1197695

Change #1197341 merged by BCornwall:

[operations/puppet@production] varnish: Add test for m.wikisource.org x-dt-host rewrite and POST

https://gerrit.wikimedia.org/r/1197341

Change #1197343 merged by BCornwall:

[operations/puppet@production] varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug

https://gerrit.wikimedia.org/r/1197343

Change #1197730 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Remove unreachable optin=beta code

https://gerrit.wikimedia.org/r/1197730

Change #1197351 merged by BCornwall:

[operations/puppet@production] varnish: Implement enable_m_redir and enable on test wikis

https://gerrit.wikimedia.org/r/1197351

Change #1197693 merged by BCornwall:

[operations/puppet@production] varnish: Enable enable_m_redir in Beta Cluster for all wikis

https://gerrit.wikimedia.org/r/1197693

Change #1197730 merged by BCornwall:

[operations/puppet@production] varnish: Remove unreachable optin=beta code

https://gerrit.wikimedia.org/r/1197730

Change #1197694 merged by BCornwall:

[operations/puppet@production] varnish: Enable enable_m_redir in esams and drmrs

https://gerrit.wikimedia.org/r/1197694

Change #1198412 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] wmf-config: Stop sending HTTP purges for mobile domains

https://gerrit.wikimedia.org/r/1198412

Change #1197695 merged by BCornwall:

[operations/puppet@production] varnish: Enable enable_m_redir everywhere

https://gerrit.wikimedia.org/r/1197695

Tested Varnish purge change with tcpdump in Beta.

Change #1198412 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] wmf-config: Stop sending HTTP purges for mobile domains

https://gerrit.wikimedia.org/r/1198412

Tab A
krinkle@deployment-cache-text08:~$ sudo tcpdump -i lo -A 'tcp port 3128' | grep -A3 PURGE
Tab B
krinkle@deployment-mediawiki14:~$ curl -X POST -i 'http://en.wikipedia.beta.wmcloud.org/w/api.php?action=purge&format=json&titles=Sandbox' --connect-to ::deployment-mediawiki14
Before
..}r..}qPURGE /wiki/Sandbox HTTP/1.1
Host: en.wikipedia.beta.wmcloud.org
User-Agent: purged

--
..}r..}qPURGE /w/index.php?action=history&title=Sandbox HTTP/1.1
Host: en.wikipedia.beta.wmcloud.org
User-Agent: purged

--
..}s..}sPURGE /wiki/Sandbox HTTP/1.1
Host: en.m.wikipedia.beta.wmcloud.org
User-Agent: purged

--
..}s..}sPURGE /w/index.php?action=history&title=Sandbox HTTP/1.1
Host: en.m.wikipedia.beta.wmcloud.org
User-Agent: purged
Cherry-pick
jenkins-deploy@deployment-deploy04:/srv/mediawiki-staging$ git fetch https://gerrit.wikimedia.org/r/operations/mediawiki-config refs/changes/12/1198412/1 && git cherry-pick FETCH_HEAD

…

krinkle@deployment-mediawiki14:~$ scap pull

…
After
...0..iZPURGE /w/index.php?action=history&title=Main_Page HTTP/1.1
Host: en.wikipedia.beta.wmcloud.org
User-Agent: purged

--
...1..iZPURGE /wiki/Main_Page HTTP/1.1
Host: en.wikipedia.beta.wmcloud.org
User-Agent: purged

Change #1198412 merged by jenkins-bot:

[operations/mediawiki-config@master] wmf-config: Stop sending HTTP purges for mobile domains

https://gerrit.wikimedia.org/r/1198412

Mentioned in SAL (#wikimedia-operations) [2025-10-24T06:53:15Z] <krinkle@deploy2002> Started scap sync-world: Backport for [[gerrit:1198412|wmf-config: Stop sending HTTP purges for mobile domains (T405931)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-24T06:57:46Z] <krinkle@deploy2002> krinkle: Backport for [[gerrit:1198412|wmf-config: Stop sending HTTP purges for mobile domains (T405931)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Change #1198429 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Promote new m-dot redirect from 302/307 to 301/308

https://gerrit.wikimedia.org/r/1198429

Change #1198430 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Remove temporary enable_m_redir flag

https://gerrit.wikimedia.org/r/1198430

Change #1198412 merged by jenkins-bot:

[operations/mediawiki-config@master] wmf-config: Stop sending HTTP purges for mobile domains

https://gerrit.wikimedia.org/r/1198412

Production test, from Logstash type:mediawiki channel:squid

Before
MediaWiki\Deferred\CdnCacheUpdate::purge: https://test.wikipedia.org/wiki/Gingival_enlargement https://test.wikipedia.org/w/index.php?title=Gingival_enlargement&action=history https://test.m.wikipedia.org/wiki/Gingival_enlargement https://test.m.wikipedia.org/w/index.php?title=Gingival_enlargement&action=history
After
MediaWiki\Deferred\CdnCacheUpdate::purge: https://test.wikipedia.org/wiki/Sandbox https://test.wikipedia.org/w/index.php?title=Sandbox&action=history

Mentioned in SAL (#wikimedia-operations) [2025-10-24T07:06:51Z] <krinkle@deploy2002> Finished scap sync-world: Backport for [[gerrit:1198412|wmf-config: Stop sending HTTP purges for mobile domains (T405931)]] (duration: 13m 35s)

Looking at the impact on Varnish, after removing half the purge load from MediaWiki.

Change #1198412 merged by jenkins-bot:

[operations/mediawiki-config] Stop sending HTTP purges for mobile domains

https://gerrit.wikimedia.org/r/1198412

Grafana dashboard: Varnish HTTP Requests / Text PURGE per DC

Screenshot 2025-10-24 at 00.53.22.png (546×2 px, 193 KB)

These are subtotals per-DC. Each DC usually does around 12.5K purge requeests/second. Elsewhere that'll translate to ~88K/sec total across DCs. It's quite spiky, so this isn't as clear cut an impact on any given moment. I'll look at this again in the UTC evening for how it compares for the day as a whole.

Grafana dashboard: Varnish Aggregate Client Status Code / PURGE

purges-total.jpg (1×2 px, 281 KB) purges-total-annoted.png (1×2 px, 214 KB)

It is at this point that I realized the installation of the m-dot redirect earlier today also affects the purged.go client issueing HTTP purge requests to Varnish. Purges need a response code, and whether that's HTTP 204 or HTTP 307 doesn't matter here, but it's a funny thing to see. The benefit however is that it gives us ~8 hours of perfect telemetry on exactly what infrastructure cost we're cutting. Because HTTP 307 on PURGE is exclusively used by this. The totals will go down and that should suffice, but it's very useful in retrospect to have these explicitly earmarked for a few hours before turning them off.

Using the 00:04 UTC peak as an example, the load breaks down as follows:

  • 100% = Total purges on varnish-text: 365K/s
  • Per-DC: 52.7K/s across 7 data centers
  • 40% = 149K/s mobile purges by MediaWiki
  • 60% = 216K/ other purges
    • 40% = 149K/s desktop purges by MediaWiki (presumed)
    • 20% = 69K/s other services (inferred)

Using a median point instead, such as 03:53 UTC

  • 100% = Total purges on varnish-text: 108K/s
  • Per-DC: 15.4K/s across 7 data centers
  • 20% = 21.2K/s mobile purges by MediaWiki
  • 80% = 86.8K/s other purges
    • 20% = 21.2K/s desktop purges by MediaWiki (inferred)
    • 40% = 65.6K/s other services (inferred)

The baseline rate of purges from outside MediaWiki seems both surprisingly high and suprisingly constant at ~65K/s. During a "typical" minute, the drop can be hard to spot. But, during peaks this cuts load by 40%!

RESTBase sends purges at a fairly constant rate of 65K/s, whereas MediaWiki purges vary from 40K/s to 300K/s. The overall saving is thus between 20% and 40% ((40/2)/105 and (300/2)/365).

cap.png (1×1 px, 96 KB)

For the highlighted 6h period of this ad-hoc instrumentation, we saw 691M purges for the mobile domain from MediaWiki, and 2,130M other purges. That's 25%.

I peeked at what these RESTBase purges are. Unscientific sample, taken after mobile purges were stopped.

$ kafkacat -C -b kafka-jumbo1013.eqiad.wmnet:9092 -o -10000 -t codfw.resource-purge | head -n10000 | grep -Eo '"uri":".*\org/[^"?]*' | grep -Eo '(/wiki|/w/index.php|/api/rest_v1/[^ ?/]*/[^ ?/]*)' | sort | uniq -c | sort -rn
   6188 /wiki
   1087 /api/rest_v1/page/summary
   1040 /api/rest_v1/page/definition
    833 /api/rest_v1/page/mobile-html
    650 /api/rest_v1/page/media-list
    202 /w/index.php
      1 /api/rest_v1/media/math

$ kafkacat -C -b kafka-jumbo1013.eqiad.wmnet:9092 -o -10000 -t eqiad.resource-purge | head -n10000 | grep -Eo '"uri":".*\org/[^"?]*' | grep -Eo '(/wiki|/w/index.php|/api/rest_v1/[^ ?/]*/[^ ?/]*)' | sort | uniq -c | sort -rn
   9116 /wiki
   8198 /api/rest_v1/media/math
    897 /w/index.php
      2 /api/rest_v1/page/summary

MediaWiki purges two URLs after an edit: /wiki/Banana and /w/index.php?title=Banana&action=history.

MediaWiki purges one only URL after a links update: /wiki/Banana, hence /wiki is more prominent.

The rest are RESTBase purges, presumably from change-prop.

The variability of the CDN purge rate means, to quantify the impact we look at a longer period of time (either range totals, or averaging the rate). I've made a variation of the chart that looks at the 12-hour rate, 24-hour rate and a week-over-week average:

Adapted from Grafana dashboard: Varnish Aggregate Client Status Code

2025_varnishpurgereq_dcall.png (1×2 px, 208 KB)

Prior to the change, the baseline is around 100K/sec (65K RESTBase + 40K MediaWiki) with spikes to 300K/sec (65K RESTBase + 240K MediaWiki).

Given that we cut MediaWiki by half, we should see the baseline drop 20% from 100K/s to 80K/s, and spikes drop 40% from 300K/s to 180K.

What we actually see is the base rate drop to to around 80-90K/sec, which matches what we expected. Over the past day, we've seen a few small spikes rise to 110K. I don't think we've experienced a comparable major spike event yet, so we'll have to see how major spikes turn out.

EDIT: As of 31 Oct: Comparing each day to the same day the week before, the difference in percentages is -32% (26 Oct), -57% (27 Oct), -42% (28 Oct), -52% (29 Oct), and -39% (30 Oct).

T405931-purge-drop.png (1×2 px, 202 KB)

EDIT: As of 15 Nov, we can see the drop more clearly. This plots a daily total purge and a 7-day average, which drops from ~12 billion purges a day in the previous weeks, to ~8 billion a day the following weeks, saving an average of 4 billion purges a day.

2025_varnishpurgereq_dcall_total_per_day_edit2.png (1×2 px, 292 KB)

2025_varnishpurgereq_dcall_total_per_day_plot2_anno.png (743×2 px, 115 KB)

Change #1198429 merged by BCornwall:

[operations/puppet@production] varnish: Promote new m-dot redirect from 302/307 to 301/308

https://gerrit.wikimedia.org/r/1198429

Change #1198430 merged by BCornwall:

[operations/puppet@production] varnish: Remove temporary enable_m_redir flag

https://gerrit.wikimedia.org/r/1198430