On 2nd Sept ~13:15 UTC, we observed the following anomalies:
- Increase in memcached traffic
~40% in write traffic
~30% in read traffic
- 50% increase in mysql connections https://grafana.wikimedia.org/goto/Q30KNN9Ng?orgId=1
{F65948996}
I am guessing the above have lead to a mw p50 latency increase, especially mw-web Mediawiki vs the Databse
The relevant keygroups following the pattern are SqlBlobStore_blob, revision_row_1_29, page
https://grafana.wikimedia.org/goto/jclfNH9HR?orgId=1
What happened?
We originally believed this was related to the deployment of https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181#11138927). After working with @Dreamy_Jazz @kostajh, @Ladsgroup and @matmarex on slack (thank you folks!), we were quite convinced that the change was unrelated but it was simply poor timing.
Turns out that it was a change in https://en.wikipedia.org/wiki/Module:Citation/CS1, a module used in over 6.2 mil pages (tx to @Izno for pointing it out)
Other observations (copy pasted from T402181)
Some notable jumps that line up exactly with these increases are a significant jump in mw-web 200s:
Notably this does not seem to coincide with a significant increase in external requests that we can easily discernDigging into this we can also see a shift in requests that were previously emitting 304s becoming 200s (which in turn would lead to increased load and page generation in theory)
Possibly a side-effect: A notable increase in parsoidCachePrewarm jobs in both datacentres:
Following up, as @hnowlan notes, there's a clear inversion between 304 and 200 responses from mw-web starting shortly after 13:10. Further, we don't see a clear source of new additional external traffic that correlates.
However, if we specifically focus on cache misses at the CDN, we can see a clear correlated increase, seemingly entirely focused on enwiki.
So, something changed in enwiki, causing a visible reduction in CDN cache hit rate (change to a common template?). This too would be consistent with the influx of parsoidCachePrewarm jobs.
Unless there's something very subtle here that I'm missing, given the limited scope of wikis to which https://gerrit.wikimedia.org/r/1180532 is applicable (which does not include enwiki), I'm fairly confident the temporal correlation is an unfortunate coincidence.




