Page MenuHomePhabricator

Increased response times linked to Module:Citation/CS1 changes
Closed, ResolvedPublic

Assigned To
None
Authored By
jijiki
Sep 3 2025, 10:32 AM
Referenced Files
Restricted File
Sep 3 2025, 10:32 AM
F65948862: image.png
Sep 3 2025, 10:32 AM
F65948860: image.png
Sep 3 2025, 10:32 AM

Description

On 2nd Sept ~13:15 UTC, we observed the following anomalies:

  • Increase in memcached traffic

~40% in write traffic

image.png (548×1 px, 97 KB)

~30% in read traffic

image.png (534×1 px, 100 KB)

{F65948996}

I am guessing the above have lead to a mw p50 latency increase, especially mw-web Mediawiki vs the Databse

The relevant keygroups following the pattern are SqlBlobStore_blob, revision_row_1_29, page
https://grafana.wikimedia.org/goto/jclfNH9HR?orgId=1

What happened?

We originally believed this was related to the deployment of https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181#11138927). After working with @Dreamy_Jazz @kostajh, @Ladsgroup and @matmarex on slack (thank you folks!), we were quite convinced that the change was unrelated but it was simply poor timing.

Turns out that it was a change in https://en.wikipedia.org/wiki/Module:Citation/CS1, a module used in over 6.2 mil pages (tx to @Izno for pointing it out)

Other observations (copy pasted from T402181)

@hnowlan

Some notable jumps that line up exactly with these increases are a significant jump in mw-web 200s:

image.png (1×2 px, 147 KB)

Notably this does not seem to coincide with a significant increase in external requests that we can easily discern

Digging into this we can also see a shift in requests that were previously emitting 304s becoming 200s (which in turn would lead to increased load and page generation in theory)

image.png (602×1 px, 72 KB)

Possibly a side-effect: A notable increase in parsoidCachePrewarm jobs in both datacentres:

image.png (457×2 px, 91 KB)

@Scott_French

Following up, as @hnowlan notes, there's a clear inversion between 304 and 200 responses from mw-web starting shortly after 13:10. Further, we don't see a clear source of new additional external traffic that correlates.

However, if we specifically focus on cache misses at the CDN, we can see a clear correlated increase, seemingly entirely focused on enwiki.

So, something changed in enwiki, causing a visible reduction in CDN cache hit rate (change to a common template?). This too would be consistent with the influx of parsoidCachePrewarm jobs.

Unless there's something very subtle here that I'm missing, given the limited scope of wikis to which https://gerrit.wikimedia.org/r/1180532 is applicable (which does not include enwiki), I'm fairly confident the temporal correlation is an unfortunate coincidence.

Event Timeline

jijiki renamed this task from Increased latency on mw-web, mw-api-int, mw-ext-api to Increased response times linked to Module:Citation/CS1 changes.Sep 3 2025, 10:34 AM
jijiki closed this task as Resolved.
jijiki triaged this task as High priority.
jijiki updated the task description. (Show Details)