Page MenuHomePhabricator

Mean MediaWiki backend latency increased by 60% between October and December 2025
Open, Needs TriagePublic

Assigned To
None
Authored By
ori
Wed, Feb 4, 11:37 PM
Referenced Files
F71823196: Screenshot 2026-02-09 at 16.13.14.png
Wed, Feb 11, 6:47 PM
F71823195: Screenshot 2026-02-09 at 16.13.57.png
Wed, Feb 11, 6:47 PM
F71823185: Screenshot 2026-02-09 at 16.03.50.png
Wed, Feb 11, 6:47 PM
F71774054: Screenshot 2026-02-09 at 15.49.32.png
Wed, Feb 11, 6:47 PM
F71821360: image.png
Wed, Feb 11, 6:04 PM
F71712049: image.png
Fri, Feb 6, 8:53 PM
F71711258: image.png
Fri, Feb 6, 8:53 PM
F71709137: image.png
Fri, Feb 6, 5:02 PM

Description

This graph indicates that mean MediaWiki backend latency has increased by ~60% between October and December 2025:

https://grafana.wikimedia.org/goto/AOTG3FNDg?orgId=1

image.png (1×2 px, 371 KB)

The one-year view suggests this is not due to seasonality.

As mentioned in T302623, looking at Grafana: Backend Pageview Timing dashboard:

image.png (1×2 px, 1022 KB)

And the mw-web RED dashboard

image.png (572×1 px, 317 KB)

Potential Factors (to be investigated)

  • parsoid rollout
  • PHP 8.3 rollout
  • scraper activity
  • other ideas?

Event Timeline

FYI, the spikes in the graphs, are potentially due to T416567

For mw-web latency, the first jump seems to be ~30th of October.

Parsoid rollout has been posted by @Jgiannelos here T416567#11587120

SAL Log: https://sal.toolforge.org/production?p=1&q=&d=2025-10-30

(Posting here for transparancy, but mainly intended for @ArielGlenn.)

The first thing I tend to do after a change in reported latency, is to translate it to something absolute. It's easy to assume that something got slower, and that may be the case (a change in code, config, a shared resource like a database, the web servers through reduced CPU capacity, edits to a popular template that increased the cost of parsing most pages, etc). But, this is not a given.

For example, if we have fewer "fast" requests now, and other latency buckets contain the same number of requests, there is no latency regression, but most likely a change in traffic. Some requests no longer take place, thus shifting the average up, without changing anyone's experience and without increasing server load. It have been the result of an optimisation that removed redundant web requests from a workflow.

Likewise, if we have more "slow" requests than before, but the same amount in other buckets, there is also no latency regression. Requests did not move from faster to slower, but new traffic was added of a type that was likely already slow before and we have more of that type now. E.g. a slow endpoint that is called more often, thus shifting the average up, without regressins any specific experience or workflow. (Maybe. The new feature might be enabled within an existing experience where its addition might regress overall performance. But it could also be a standalone new thing.)

You can answer this through Rate, Histogram, and Heatmaps graphs, such as on the Grafana: Appserver RED k8s dashboard (select variables and date range first).

You can then use the same dashboard to start narrowing down what the affected traffic has in common (i.e. something more specific than "mw-web" servergroup) by looking for subsets (DC, http method) and related server groups (mw-api-ext, mw-api-int), and compare their average latencies over time, and try to rule out subsets that differ or rule in other servergroups that share the same change in latency.

There seems to be some amount of temporal causality with a more general pattern of a drop in hit rates for ParserCache - see https://grafana.wikimedia.org/goto/SBEpS9NDR?orgId=1 for details, and the attached graph for the TL;DR of what I mean:

image.png (386×786 px, 52 KB)

It's worse for Parsoid (but also present on legacy), and it doesn't seem to match a (much) larger increase in overall access:
image.png (343×791 px, 58 KB)

I'm particularly puzzled by the fact that it seems to miss metadata completely. One of my fears is/was that it also kind of matches the timeline of "starting to push patches to introduce the post-processing cache", but a bug there would typically (normally!) have impacted the "unsafe" rate or the "miss" rate, not the "miss metadata" rate.

The drop in hit rates is explained as follows:

The nightly drop by 20% is T416567: Cache expirations (?) every UTC midnight and the cause described in T416616: Create new cache-friendly lua/parser function for "is today before X date" and "is today after X date"; and the decline on Parsoid since 11/25 is due to Fix cache expiration time for parsoid usage (1211138) -- Parsoid was completely ignoring wikitext-specified cache time reductions before that patch (T408741). This is a larger problem (partly described in T416567) where wikitext-initiated cache time reductions are probably having a disproportionate impact on cache hit rates and we should probably figure out better ways to do whatever folks are trying to do in the wikitext, possibly including selective update (T373258) of the portions of the page which are date-sensitive.

Some general decline of hit rate is due to T320668: Reduce TTL lowering effect and raise cache accuracy of CURRENT*/LOCAL* magic words deployed in March 2023 (parser: improve cache TTL accuracy for CURRENT*/LOCAL* magic words (569628)) for wikitext, but with followups on Nov 6 that affected Scribunto modules: T51803: Calculated Age of persons can become outdated until next cache purge/T389085: Frame::getTTL() appears to be completely unused, Remove use of Frame::setTTL/getTTL (1201829).

I'm not convinced that a 60% slowdown in backend request latency can be completely attributed to the decline in hit rates, but the reasons why hit rates declined is known (more or less).

To copy some context from a conversation elsewhere: one of the key things to notice in the cache hit rate graphs is that the cache hit rate starts getting sharper daily spikes due to T416567: Cache expirations (?) every UTC midnight/T416616 at the same time the cache rate drops by ~20%:

image.png (549×931 px, 144 KB)

This points to the root cause being wikitext- and lua-derived reduced cache lifetimes being honored. A similar pattern is observed in the Parsoid-only cache, tied to the slightly later (~Nov 25) patch that fixed a bug in Parsoid where wikitext- and lua-derived reduced cache lifetimes were being ignored (not merged into the final ParserOutput):
image.png (549×931 px, 140 KB)

Same pattern, but later -- and sharper, because in Parsoid it was *all* reduced cache lifetimes which were being ignored (before Nov 25), while on the legacy parser site it was only cache lifetimes that were being reduced *by Scribunto* (ie, Scribunto modules which used datetime functions) which were being ignored (before Nov 5). But in both cases the general decline in cache hit rates corresponds to a simultaneous increase in daily spikiness.

This is suggestive, but doesn't actually answer the original question posed by this task. In order to make the math work out correctly, we'd need to know the percentage of overall backend requests that are parsing related, and the relative latencies of cache hits and cache misses. It is plausible that these numbers work out to explain how a 20% decrease in cache hits creates an overall 60% latency increase... but that case hasn't actually been closed yet.

This diagram might also be interesting:
https://grafana.wikimedia.org/goto/Aqf_YrHvg?orgId=1

From what I understand, it looks like the sampled requests for profiling (thus the number of requests) for article views increased around the time we see the increase in latency.
This means that we get more requests bypassing caches.

On a day of Nov (the first that we have webrequest metrics) and on the same day in January.
Uncached desktop views increased by:

  • ~2x on desktop
  • ~1.7x on mobileweb
  • ~1.2x on mobileapps

I only use webrequest queries for a day just because I don't know if we have actual stats that we can break it down in eg. grafana. Ideally its better to show we things evolve with time

Parsoid looks to be consuming basically all the CPU that mw-jobrunners have to offer:

image.png (933×2 px, 695 KB)

image.png (925×2 px, 628 KB)

The units of those graphs are seconds/second -- so every second, we're burning about 3-4 minutes of CPU time.

I'm not sure if a backlog on the jobrunners could cause a latency regression of this severity, though. But the CPU cost for mw-api-int and mw-api-ext have increased a fair bit in a relative sense as well.

I don't think Parsoid is the whole story though -- there are at least two other compounding factors here.

  • Much greater than usual load from API scrapers in December
  • Some much-longer-running but weird/concerning SELECT query & read load on the database masters (T416171) -- although I am still digging on that in between writing a KR proposal.

Those graphs are very interesting, because you're looking at enwiki, and Parsoid read views isn't deployed on enwiki yet. So these aren't read views. This is either refreshlinksjobs / linter, or apps/PCS. The increase also happens sharply on 11/27. We deployed a caching fix on 11/25, but I'm pretty sure that would be distinguishable from this spike on 11/27.

@Jgiannelos also noticed an gradual increase in index_view requests to the backend starting at the start of november.

Parsoid looks to be consuming basically all the CPU that mw-jobrunners have to offer:

Maybe this would be useful or not: Flamegraph of CPU time on job runners:

The latency trend for this past ~week is even worse:

image.png (1×2 px, 665 KB)

This diagram might also be interesting:
https://grafana.wikimedia.org/goto/Aqf_YrHvg?orgId=1

From what I understand, it looks like the sampled requests for profiling (thus the number of requests) for article views increased around the time we see the increase in latency. This means that we get more requests bypassing caches. […]

Screenshot 2026-02-09 at 15.49.32.png (1×2 px, 526 KB)

More requests could imply more purges or higher average latency, but it does not have to.

If the added requests do not specifically target unpopular or uncached pages, then CDN caching, request coalescing, and ParserCache, should leave average latency unchanged or even lower it, if the excess traffic has a narrower distribution than the baseline (hitting warm ParserCache/Memc/APCU/DB caches). For example, a small scale CDN-cachebusting flood on a specific article lower average latency. (Of course, at higher scales, such flood may sature k8s/mysql capacity and slow things down, but this period doesn't show that so I'm assuming relatively small scale changes for now, something where we're slower on the CPU rather than slower due to I/O saturation).

I'm looking mainly at Oct 10 - Dec 12 as the range of the main rise (per the graphs in task description)

I don't see any increase in time spent in Parsoid. Looking at the new "One component" dashboard, which zooms in on mediawiki-entrypoint-profiling dash, which aggregates Flame graphs: We actually see a slight drop in percentage of time spent in index.php?action=view:

Screenshot 2026-02-09 at 16.03.50.png (1×2 px, 401 KB)

What I see instead, is 10 pc point increase on of legacy Parser time spent:

Screenshot 2026-02-09 at 16.13.57.png (974×2 px, 209 KB)

Screenshot 2026-02-09 at 16.13.14.png (1×2 px, 322 KB)

This was a bit harder to plot, because the folders and namespaces were renamed around Oct 30 (e.g. change 1200040 and related changes), and the dropdown menu only lists current values, not historical ones. The above was ad-hoc edited to include the previous name.

In my opinion, there may be more than one contributing factor has led to this, each in different ways. Moreover, as we've found out the hard way, traffic patterns have been changing rapidly since last year, with the most notable one being increase in uncached requests.

Strictly speaking, from a capacity perspective (not taking into account information such as % of uncached requestss or overall traffic increase), our mw-on-k8s deployments were resized last October, based on some very good math @Scott_French put together.

Running estimate_mw_replicas week by week in an attempt to see the trend:

Weekmw-web eqiadmw-web codfw
2025-10-06 → 2025-10-13287267
2025-10-13 → 2025-10-20269275
2025-10-20 → 2025-10-27256289
2025-10-27 → 2025-11-03260282
2025-11-03 → 2025-11-10294287
2025-11-10 → 2025-11-17315349
2025-11-17 → 2025-11-24332327
2025-11-24 → 2025-12-01296326
2025-12-01 → 2025-12-08307374
2025-12-08 → 2025-12-15355399
2025-12-15 → 2025-12-22290389
2025-12-22 → 2025-12-29322406
2025-12-29 → 2026-01-05356382
2026-01-05 → 2026-01-12315387
2026-01-12 → 2026-01-19314382
2026-01-19 → 2026-01-26305382
2026-01-26 → 2026-02-02315388
2026-02-02 → 2026-02-09313407

One of the reasons we have ended up here could be that we need more capacity. I would suggest we consider adding so, until we figure out what else might no be right or has room for improvements.

I have a scenario that might be interesting where there is a combination of parsoid, rest API and CDN purges cascading effect that I am trying to reproduce. I believe that its worth checking what happens when:

  • We get cache misses because of the TTL enforced from magic words
  • The requests don't come from read views but from API calls who hit parsoid eventually regardless of rollout of read view in production
    • even though read views in enwiki are using legacy parser, api calls use parsoid
  • CDN is purged/evicted for those requests in a faster rate
  • Backend load is creating backpressure to read views

What is unclear in this scenario is whether reparses occurring in the cluster can impact web backend latency. I am not entirely sure in which part of the infrastructure we run reparses and if its a job in the jobqueue or part of req/response.

A slightly OT question: When I see Cache expiry: 1807, is it possible to identify the reason for it? I suspect it comes from the MIN_DEADLINE_TTL (30*60=1800) in CoreMagicVariables.php. The method I have found so far is bisecting a page's source code, and inspect the NewPP report by previewing.

An instance I found on zhwiki (template) is caused by invocation of {{#time:}}, resulting from gerrit #1201838.

Mentioned in SAL (#wikimedia-operations) [2026-02-18T00:58:21Z] <Krinkle> Edit Module:Date on various wikis in attempt to mitigate T416616, T416540. Details at https://phabricator.wikimedia.org/T416616#11625838.