Page MenuHomePhabricator

Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages
Closed, ResolvedPublic

Description

Problem

MessageCache::load uses the WANObjectCache (Memcached for WMF) via MessageCache::saveToCaches() to save several different cache keys (the hash, the check key, and key:message blob) on a per-language bases. There is $wgMaxMsgCacheEntrySize to prevent the value from becoming too large, but that only focuses on large messages within the blob, not on the total blob size. Even with gzip, we have blobs approaching 800KB (metawiki,en). Although most hits should come from APC anyway, if the blob ends up too big (larger than the 1MB memc limit, thus un-settable), then edits to MW: pages would cause various problems.

Disaster scenario

As of writing, this problem mainly affects Meta-Wiki. The other two consumers of MediaWiki namespace pages for content that is not "interface message overrides" (site scripts, and gadgets), are important to think about, but are minor in comparison to the years of building up CentralNotice banners, variations, and translations.

The below is what would happen if the 1MB size were to be exceeded.

  • All servers will block on a global lock to de-duplicate regeneration effort for a value only storeable locally in APC. They will log 'global cache is presumed expired' around purge time and 'global cache is empty afterwards. Blocking on getReentrantScopedLock() will be a waste unless that thread was from the server in question itself. If not, there will be another iteration in the loop, which will either block again or reach loadFromDB(). In the former case, $failedAttempts is spent and the $staleValue (from APC) is used.
  • If there is no APC value at all (not just expired), then the would some slow requests doing regeneration as well as many more request failing to load anything for the MessageCache instance, logged as 'waited for other thread to complete'. This is due to the stampede protection from the non-blocking getReentrantScopedLock() call (combined with global key failure).

Solutions

  • Automatic shrink: It might be useful to check the whole size of the message name/text map and if it's too big, then some items (largest first) would use the individual key logic.
  • Limit to localisation overrides: It also is worth considering whether a message key appearing in the title of a MW:page is defined in i18n code and overridden, or, if the message is arbitrary, or if it the name is dynamic and used by some extension (e.g. messages with a magic prefix). Those could perhaps be cached differently, always or at least when combination blob can't fit everything.
  • Reduce time to rebuild blob: It would also be nice of unchanged keys could avoid all of the ExternalStore fetch logic by having page_latest in the cache (or integer version of page_touched).

Event Timeline

aaron created this task.Apr 27 2018, 8:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 27 2018, 8:11 PM
Krinkle updated the task description. (Show Details)Apr 27 2018, 8:25 PM
Krinkle moved this task from Untriaged to MessageCache on the MediaWiki-Cache board.
aaron updated the task description. (Show Details)Apr 27 2018, 8:35 PM

IN particular, CentralNotice messages being stored in the mediawiki namespace have the potential to overrun the size of a value.

Vvjjkkii renamed this task from Handle large MessageCache key values in memcached to o2daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
1339861mzb renamed this task from o2daaaaaaa to Handle large MessageCache key values in memcached.Jul 1 2018, 5:33 PM
1339861mzb updated the task description. (Show Details)

I'm re-summarising this with the perspective of recent tasks T203925 and T205563 (restricted).

Problem: MessageCache currently has no limit on how many messages it will add to the messages blob. It will create a serialised blob containing the revision text of all pages that exist in the "MediaWiki:" namespace on the local wiki. The only thing we exclude is individual pages beyond a certain size. But it is still growing uncontrollably due to lots of small/medium-size pages adding up.

Requirements: The MessageCache was modelled after LocalisationCache: All messages together in a single blob. For the interface, this is important because we cannot afford a round-trip to anything external for every interface message used by the skin. This would otherwise incur an unacceptable time cost on the fetching of messages (with Memcached, that would be 300 messages x 1ms = 300ms delay just for fetching text, never mind wikitext expansion of each, and the actual content the user asked for).

  • Bad news: "All pages in the MediaWiki-namespace" and "Local overrides for interface messages" are not the same thing. Three notable exceptions: Site scripts (Common.js, Vector.css, Group-sysop.js etc.), Gadgets (Gadget-example.js), and CentralNotice banner partials (Centralnotice-example-subexample).
  • Good news: For the use case of interface messages, the total size of 1MB suffices on all our wikis. We may need to shard it in the future, but we don't currently have a need for that, and I expect we'd still be years away from having a wiki have so many overrides for that many messages (if that does ever happen, we should also seriously consider why because that seems like it might be a side-effect of a defect elsewhere).

Proposal: The main problem here is that the the wildcard query MessageCache performs is proactively fetching and caching pages that don't need to be, and don't want to be, cached. As such, I propose we stop using MessageCache for this, and instead restrict MessageCache only to interface message overrides. In other words, keys that exist in LocalisationCache.

  • When ResourceLoader creates HTTP responses, it already does batch queries against Revision storage, and doesn't need MessageCache. It also only needs a small number of pages for each HTTP request. Not every page of every gadget ever created.
  • When CentralNotice creates an individual HTTP response for a banner, it too only involves a small number of messages. We don't need every translation of every variation of every banner we've had in the last 15 years preloaded into memory. The revision text storage already has a cache layer in Memcached, which we can use directly.
Krinkle renamed this task from Handle large MessageCache key values in memcached to Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages.Sep 27 2018, 8:34 PM
Krinkle updated the task description. (Show Details)

Change 463376 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] Document some understanding of MessageCache in RawAction/EditPage

https://gerrit.wikimedia.org/r/463376

Change 463377 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] EditPage: Remove fake "Edit" label when creating a message override

https://gerrit.wikimedia.org/r/463377

Change 463378 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] Skin: Remove 'usemsgcache' and counter-productive getDynamicStylesheetQuery

https://gerrit.wikimedia.org/r/463378

Change 463376 merged by jenkins-bot:
[mediawiki/core@master] Document some understanding of MessageCache in RawAction/EditPage

https://gerrit.wikimedia.org/r/463376

Change 463378 merged by jenkins-bot:
[mediawiki/core@master] skins: Remove 'usemsgcache' and deprecate getDynamicStylesheetQuery

https://gerrit.wikimedia.org/r/463378

Agabi10 added a subscriber: Agabi10.Oct 1 2018, 8:25 AM
Ejegg added a subscriber: Ejegg.Oct 2 2018, 2:50 PM

Change 464045 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] messagecache: avoid caching message pages that do not override anything

https://gerrit.wikimedia.org/r/464045

aaron claimed this task.Oct 3 2018, 8:53 PM

Change 464398 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] MessageCache: do not store the EXCESSIVE array as it is only needed for HASH

https://gerrit.wikimedia.org/r/464398

Change 464398 merged by jenkins-bot:
[mediawiki/core@master] MessageCache: do not store the EXCESSIVE array as it is only needed for HASH

https://gerrit.wikimedia.org/r/464398

Change 464713 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] messagecache: use MergeableUpdate for the deferred replace() update

https://gerrit.wikimedia.org/r/464713

Change 464713 merged by jenkins-bot:
[mediawiki/core@master] messagecache: use MergeableUpdate for the deferred replace() update

https://gerrit.wikimedia.org/r/464713

Change 464045 merged by jenkins-bot:
[mediawiki/core@master] messagecache: avoid caching message pages that do not override

https://gerrit.wikimedia.org/r/464045

I'm posting here instead of at a separate task, because I can't decipher whether this is a regression, side effect, bug or anything else. As a result of a3d6c1411dad, a lot interface messages (as in, messages actually used in the interface) are no longer cached. This results in a whopping amount of 172 database queries for interface messages on Special:Version on MediaWiki-Vagrant using MW master a3d6c1411dad or newer. Compared to the 10 there were before, this is a 1620% increase. As every query ends up in the debug log, both the Query overview and debug log tab of the debug toolbar have become rather difficult to use.

aaron added a comment.Oct 15 2018, 2:53 PM

I'm posting here instead of at a separate task, because I can't decipher whether this is a regression, side effect, bug or anything else. As a result of a3d6c1411dad, a lot interface messages (as in, messages actually used in the interface) are no longer cached. This results in a whopping amount of 172 database queries for interface messages on Special:Version on MediaWiki-Vagrant using MW master a3d6c1411dad or newer. Compared to the 10 there were before, this is a 1620% increase. As every query ends up in the debug log, both the Query overview and debug log tab of the debug toolbar have become rather difficult to use.

Yeah, that should happen. The second loadCachedMessagePageEntry() call is not properly optimized.

Change 467402 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] messagecache: check overridable message array in getMsgFromNamespace()

https://gerrit.wikimedia.org/r/467402

Change 467818 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.26] messagecache: check overridable message array in getMsgFromNamespace()

https://gerrit.wikimedia.org/r/467818

Change 467402 merged by jenkins-bot:
[mediawiki/core@master] messagecache: check overridable message array in getMsgFromNamespace()

https://gerrit.wikimedia.org/r/467402

Change 467818 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.26] messagecache: check overridable message array in getMsgFromNamespace()

https://gerrit.wikimedia.org/r/467818

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:00:18Z] <krinkle@deploy1001> Synchronized php-1.32.0-wmf.26/includes/cache/: T193271 - I25aa0e27200a0 (duration: 01m 01s)

aaron closed this task as Resolved.Oct 19 2018, 3:33 AM

Change 463377 abandoned by Krinkle:
EditPage: Remove fake "Edit" label when creating a message override

Reason:
The idea of using "remote content" principles from the UI perspective makes a ton of sense, but I don't have time to implement it. It was an itch I wanted to scratch, but will leave for someone else or my future self to pick up instead.

https://gerrit.wikimedia.org/r/463377