Deprecate and remove $wgCachePrefix
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Krinkle
	Jan 15 2021, 11:39 PM

Description

Background

It is unclear to me what behaviour is expected from admins that set this variable, and I suspect that it is quite likely that those who do have expectations for it, are in fact not getting it. (Which hopefully means they avoid it, but more likely means they're experiencing issues without realising it or its cause).

I guess at a basic level, this (optional) configuration variable overrides the keyspace used by BagOStuff instances for cache keys, which otherwise defaults to the Wiki ID (usually identical to the string form of the DB Domain).

The original justification for this, from r97468 and r105523 (cc @tstarling, @aaron).

@Nikerabbit wrote in September 2011:

I'm overriding wfWikiID to run multiple instances on same database
but with different settings, and I don't want to them mess each others caches.

Problem

The expectation to separate caches in this manner falls apart when you consider global cache keys (previously known as "shared" or "foreign" cache keys). These use the a constant prefix global, and generally utilize the Wiki ID as one of the key segments if the information varies by wiki.

At a fundamental level, I believe MediaWiki does not support this kind of separation, with cache keys perhaps being merely the most noticible area where it breaks down. In general, I think, we assume the Wiki ID to be globally unique within the scope of things that MediaWiki internally interacts with (db, caches, file system, etc.)

As such, to facciliate something like this, the two wikis should be given a distinct wiki ID. Or, if one really wants to have identical clusters operating independently where two wikis have the same db name (and thus presumably have dedicated and separate MySQL servers), then they should also have separate Memc instances. For example, one could invoke memcached twice on the same server with differnet ports.

Some additional background in T269226 (currently restricted).

Event Timeline

Krinkle created this task.Jan 15 2021, 11:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 15 2021, 11:39 PM

Krinkle updated the task description. (Show Details)Jan 15 2021, 11:46 PM

Krinkle moved this task from Untriaged to 2020 | MW 1.35 on the Technical-Debt (Deprecation process) board.

Krinkle added a project: translatewiki.net.Jan 16 2021, 2:18 AM

FWIW I stopped long ago the practice of running two wikis on the same database but with different code (it was useful for development and debugging).

Something like this could be useful to avoid split-brain type of issues with our deployment canary setup (we only have a single server), but we are not using this variable or other means of separation currently. Does WMF's canary run a separate memcahed?

For development/testing, WMF mainly has Beta Cluster which is indeed a logically separate farm (and thus can re-use the same db names as they are separate backends entirely).

The canary servers (mwdebug for manual canary testing, and a handful of other mw/app servers for automated promotion via Scap) are fully participating production servers with the only difference being the version of the MW software. They don't have a copy of, or use separate caching infastructure. This is not a choice or trade-off but a logical requirement. If a server that isn't read-only can talk to one part of production, it must talk to all parts of production as otherwise e.g. one could save an edit to the real database from a canary, but then fail to queue, purge, invalidate, or otherwise update secondary data stores that other app servers are expecting to have changed.

Fully separate with the same db name or fully joined with different db names are both supported ways of running MW and fairly easy to setup if the scale is small. It doesn't nececary require more hardware or virtual resources. I suppose running a second memcached will need a bit of RAM, but then again, if one were to otherwise share the same instance one can also take a bit of RAM from one instance (e.g. 400M vs 100M rather than 2x 500M)

The canary servers are fully participating production servers with the only difference being the version of the MW software.

Or the configuration when doing SWATs. That is the case for translatewiki.net canary as well. This can still lead to issues if cache contents are not compatible across both.

he canary servers are fully participating production servers with the only difference being the version of the MW software.

Or the configuration when doing SWATs.

s/version of the MW software/the mediawiki deployment directory/; It includes configuration files and other docroot files indeed.

This can still lead to issues if cache contents are not compatible across both.

I don't follow. What kind of compatibility scenario are you thinking about? Any code run on a production app server, canary, debug or regular naturally interacts with production databases and memcached. This is expected and required (it's not a choice, since the requirement goes in both directions from and to the rest of prod). If something is not forward or backward compatible in this manner, it must not be deployed to a production server by any means. Not via SWAT, not via train, and not on a canary.

This is expected and required

But not always easy to notice before being tested on the canary (or even then if you don't hit affected code paths). I have seen this multiple times with Translate, especially because a cache value contains serialized references to a class we have moved or changed. We are migrating away from PHP serialization, but there are still a few left.

Other issue is that our message group configuration changes directly affect what is stored in the cache, so we have canary and non-canary competing to rebuild the cache entry. I haven't yet figured out how to avoid that.

OK, that is no longer related to this task. But since you're asking, I would have two recommendations for that.

Test it out in a beta-like environment. Not connected to live. You can freely issue a full Memc wipe there, or a more finegrained purge if the feature in question supports that from web or CLI in some way. By "beta-like" I mean that the database tables, memc, apcu etc. are all separate, not just one or some of them.
Accept that the rebuild will affect other production servers not yet having the code (shared memc), and then test the feature in a way that your canary request is doing the rebuild. For example, when doing last-mile verification of parser changes in production, I perform a null edit, purge or regular edit on a page using the canary, which bypassses the cache. Likewise other features may also have a way to bypass the cache.

Alternatively, if you liked how changing wgCachePrefix behaved previously and are confident that this unsupported scenario with a split-brain Memcached will not cause issues in your particular canary test, then that can be accomplished by starting a second memcached deamon on another port and varying wgMemCachedServers instead of wgCachePrefix. As said, this is not supported, and can cause problems if you do any kind of non-idempotent request on the canary, or are interacting with any kind of shared memc keys. Since those updates would not go to the other server, and likewise the canary would not receive updates from the other server.