Page MenuHomePhabricator

1.41.0-wmf.7 increases Memcached call rate by +100%
Closed, ResolvedPublicPRODUCTION ERROR

Description

https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1&from=now-8d&to=now

Memcached cache miss rate increases by > 100%:

  • SqlBlobStore_blob
  • page_content_model
  • revision_row_1_29

Screenshot 2023-05-05 at 02.59.38.png (774×1 px, 465 KB) Screenshot 2023-05-05 at 03.08.49.png (776×1 px, 200 KB) Screenshot 2023-05-05 at 03.09.47.png (766×1 px, 203 KB)

Memcached cache hit rate increases:

  • SqlBlobStore_blob + 50%
  • page_content_model + > 100%

Screenshot 2023-05-05 at 03.12.02.png (768×1 px, 207 KB) Screenshot 2023-05-05 at 03.11.10.png (786×1 px, 193 KB) Screenshot 2023-05-05 at 02.59.41.png (778×1 px, 613 KB)

Those are huge numbers to hold and sustain throughout (not a short burst). This warrants investigation as that seems likely result of something having broken how pages are cached, or resujlt of accidentally adding calls in a critical path somewhere.

I'm guessing for now that the rise resembles this week's first group1+group2 deploys, and then the revert for T336008 to "just" group1.

Event Timeline

  • revision_row_1_29 is generated by RevisionStore::getRevisionRowCacheKey and that is called by RevisionStore::getKnownCurrentRevision
  • SqlBlobStore_blob is generated by SqlBlobStore::getCacheKey, called by SqlBlobStore::getBlob and SqlBlobStore::expandBlob
  • page_content_model is used in WikiPage::getContentModel

Does not found something directly related to that

@Krinkle, am I wrong in thinking this ought to block wmf.7? (Edit: Sorry, I see it's already a child task.)

brennen triaged this task as Unbreak Now! priority.May 5 2023, 3:46 PM

Raising to UBN! as blocker.

For me the api module is not using RevisionStore::getKnownCurrentRevision or WikiPage::getContentModel

There is also a change directly to the caches - https://gerrit.wikimedia.org/r/c/mediawiki/core/+/830706 - but that seems to be tested in producation already.

It should be possible to revert the api change.

Change 915719 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@wmf/1.41.0-wmf.7] Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records"

https://gerrit.wikimedia.org/r/915719

Change 915719 merged by jenkins-bot:

[mediawiki/core@wmf/1.41.0-wmf.7] Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records"

https://gerrit.wikimedia.org/r/915719

Mentioned in SAL (#wikimedia-operations) [2023-05-05T18:42:44Z] <brennen@deploy1002> Started scap: Backport for [[gerrit:915719|Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" (T336008 T336022)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-05T18:44:17Z] <brennen@deploy1002> umherirrender and brennen: Backport for [[gerrit:915719|Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" (T336008 T336022)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-05T18:57:05Z] <brennen@deploy1002> Finished scap: Backport for [[gerrit:915719|Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" (T336008 T336022)]] (duration: 14m 21s)

Narrowed it down a bit.
Here is a flamegraph of when wmf.6 was the majority and when wmf.7 was the majority:

Went with wall for obvious reasons.

API quota goes from 34% to 40% which basically all of it is because of increase in ApiParse. They also have a drastically different flamegraph now:

wmf.6:

grafik.png (716×1 px, 204 KB)

wmf.7:

grafik.png (810×1 px, 203 KB)

Now need to figure out what made the shift but that looks like a change in parsing.

Can't say for sure but https://gerrit.wikimedia.org/r/c/mediawiki/core/+/901690 might have had some unintended consequences.

Either parsing got really slower or ParserCache started to become fragmented due to TOC flags.

cc. @cscott @matmarex

We've got Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" out to wmf.7; train is still at group1. It seems possible this has made a difference?

2023-05-05-13:16:42.png (804×744 px, 258 KB)

...but doesn't seem like root cause?

Looking at the webrequest sample graph, it doesn't seem to be a major spike in calls to action=parse: https://w.wiki/6fcE (remember to multiply numbers by 128). So calls to action=parse definitely get a lot slower in wmf.7

I can't debug much more anymore, it's quite late here. If anyone wants to take over, feel free. If not, I'll pick it up early Monday.

I can't debug much more anymore, it's quite late here. If anyone wants to take over, feel free. If not, I'll pick it up early Monday.

Thanks for the assist!

Can't say for sure but https://gerrit.wikimedia.org/r/c/mediawiki/core/+/901690 might have had some unintended consequences.

Either parsing got really slower or ParserCache started to become fragmented due to TOC flags.

I don't think that change can cause cache fragmentation. I don't mind it if you try to revert to see if that fixes the issue, but it seems unlikely to me.

Krinkle lowered the priority of this task from Unbreak Now! to High.

There are rolling reboots of the Codfw memcached hosts this week as part of T335846.
https://sal.toolforge.org/production?p=0&q=%22sre.hosts.reboot%22&d=2023-05-04

These reboots naturally wipe the cache memory of each host as it reboots, one at a time.

This explains the sudden regression in cache misses (which accidentally coincided with the train).
This explains the very slow recovery over the past 48 hours (not correlated with train or revert), as the values are slowly being recomputed or backfilled in memcached.

This might also explain the regression in ApiParse latency that @Ladsgroup shows as that would naturally suffer when there is a decrease in overall cache effectiveness for various internal values used throughout the process of parsing a page, its templates, its thumbnails, etc.

But, this does not explain an increase in total WAN cache calls (it only explains a shift from hits to misses, not an added bulk of new hits or new misses). That still remains a mystery such as these huge increase for key groups such as page-content-model that remain are more than double even now, without any apparent decrease in cache hits.

https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?var-kClass=page_content_model&orgId=1&from=now-7d&to=now

Screenshot 2023-05-05 at 23.14.57.png (1×1 px, 159 KB)
Screenshot 2023-05-05 at 23.14.56.png (796×1 px, 162 KB)

I suspect this may be a case of layered caching, where a higher layer is normally preventing this key from being reached at all, not even as a cache hit. However, we're talking hundreds of thousands per minute, no longer tens of millions, at a layer that can handle that (Memcached).

And in terms of ground truth, latency appears not impacted on either appservers nor api_appservers at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red.

I'll keep this task open to confirm the ApiParse recovery with @Ladsgroup, and to further understand the page-content-model cache behaviour as that's currently not well understood (at least not by me). No longer a train blocker.

This comment was removed by Ladsgroup.

That's explains it, thanks for the work. I think it's fine to close this given that reads mostly recovered.