Page MenuHomePhabricator

Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc)
Closed, ResolvedPublic

Description

Background

During the edit process, Extension:ConfirmEdit temporarily stores a small captcha token. The retention is set to 30min, but in practice we delete it proactively after a few seconds once the captcha has been completed.

In 2019, as part of the MainStash audit for Multi-DC, we determined that Captcha data can remain in the MainStash interface and move along with the rest from MainStash Redis to MainStash DB.

In 2022, we added a new conceptual interface to MediaWiki in the WMF config, named mcrouter-primary-dc. This is currently adopted for:

  • MediaWiki core rate limiter (PingLimiter, via WRStatsFactory and wgStatsCacheType)
  • AbuseFilter profiling stats (via WRStatsFactory and wgStatsCacheType)
  • CentralAuth tokens (via wgCentralAuthTokenCacheType)

Captcha data resides in MainStash DB today, and that's okay from a functional perspective and in terms of load/storage on MySQL. It works well enough in practice, but the mcrouter-primary-dc store may be a better fit for this type of highly ephemeral data nowadays.

Rationale

  • Each of the three use cases currently has a novel configuration setting that sysadmins must discover and set "correctly". This contrary to the ethos of "good defaults" and means they are not part of standard explanations, diagrams and overviews such as at https://www.mediawiki.org/wiki/Object_cache. Recognising these three use cases as a genuine need will make these easier to understand and configure, resulting in a better default experience.
  • Captcha data is sometimes read from the secondary DC, with an implicit assumption that data was written just a moment ago in the primary DC and already replicated, or vice versa. There is no guruantee in place that ensures this, making this effectively an undocumented dependency, contrary to the MainStash contract as documented at https://www.mediawiki.org/wiki/Object_cache#Main_stash which states that "reads can potentially be stale".
  • Captcha data is highly ephemeral. The MainStash however is intended to provide "strong persistence", and produces a fair bit of churn on MainStash DB and its binlogs.

Proposal

Introduce a fourth cache interface in MediaWiki, to be documented on https://www.mediawiki.org/wiki/Object_cache and recognised through a services getter:

  1. Local server cache: php-apcu, local server. Optional, defaults to empty.
  2. WAN cache: memcached, local data center. Optional, defaults to empty. (Wraps internal "main cache".)
  3. (To be named): memcached, primary data center. Required, default to main cache, fallback to CACHE_DB.
  4. Main stash: mysql, local data center + replication. Required, default to CACHE_DB.
Work
  • Introduce a new setting in wmf-config to control the new cache service interface.
  • Add a new cache service interface in MediaWiki core, similar to LocalServerCache and MainStash. This will have name (TBD), a configuration variable, a documented set of contract expectations for developers, and must have a non-empty default that meets the contract (probably MainStash, since ClusterCache is not replicated).
  • Document it on https://www.mediawiki.org/wiki/Object_cache
  • Migrate ConfirmEdit tokens to the new service interface.
  • Migrate CentralAuth tokens to the new service interface.
  • Remove workarounds in favour of the new standard interface (wgStatsCacheType ✅, WRStatsFactory ✅, wgCentralAuthTokenCacheType ✅)

Event Timeline

Brainstorming with @aaron and @tstarling.

Purely descriptive names:

  • LightweightPrimaryStash (too long?)
  • PrimaryStash (ambiguous with MainStash?)
  • CanonicalStash (too heavy/long-term?)
  • CentralStash (doesn't represent small/short lived purpose)
  • MemCacheStash (too confusing?)
  • QuickStash (speed isn't part of the contract, it is allowed to be slow)
  • MemoryStash
  • ShortStash
  • TinyStash

Alliteration, with the help of synonyms for ephemeral, temporary, stash, and shelf.

  • ShortStash
  • ShakyShack
  • ActiveAmbry
  • PerishablePlane
  • PortaPot
  • CapriciousCrib

More imaginative, possibly more memorable, less easily confused, and allows us to attach our own meaning to it:

  • LightningStash
  • MagicStash
  • ActiveStash
Krinkle triaged this task as Medium priority.
Krinkle added subscribers: aaron, tstarling.

Change 927763 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/extensions/ConfirmEdit@master] Use WRITE_BACKGROUND in CaptchaCacheStore and rename "cache" to "store"

https://gerrit.wikimedia.org/r/927763

Change 927763 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] Use WRITE_BACKGROUND in CaptchaCacheStore and rename "cache" to "store"

https://gerrit.wikimedia.org/r/927763

Krinkle renamed this task from Switch ConfirmEdit extension from MainStash DB to mcrouter-primary-dc to Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc).Nov 7 2023, 4:57 AM
Krinkle updated the task description. (Show Details)

Change 972379 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] objectcache: Introduce WANObjectStash for mcrouter-primary-dc

https://gerrit.wikimedia.org/r/972379

I feel like the current MainStash could equally or more so be called a WANObjectStash that a one that routes to a single datacenter. Essentially, the main stash now is one that must perform in multiple DCs, sacrificing more consistency in order to do so. It's tricky to come up with a good name though. Maybe:

  • ColocaleStash
  • OneRegionStash
  • LinearStash (e.g. best-effort tries to be linearizable for single-key operations)

It reminds me of the idea in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/524663/50/includes/libs/LightweightObjectStore/OneRegionKeyValueStore.php .

I noticed the patch doesn't mention "Main" in getWANObjectStash. Maybe the existing getMainWANObjectCache() could also be renamed to getWANObjectCache() as another task?

I feel like the current MainStash could equally or more so be called a WANObjectStash that a one that routes to a single datacenter. Essentially, the main stash now is one that must perform in multiple DCs, sacrificing more consistency in order to do so. It's tricky to come up with a good name though. Maybe:

  • ColocaleStash
  • OneRegionStash
  • LinearStash (e.g. best-effort tries to be linearizable for single-key operations)

Yeah, it's hard coming up with a good name for this one without using a name that reveals meaning too much like OneRegionStash etc. I think LinearStash makes sense but we need this store to be at least multi-DC aware. I tried to come up with a name that relates to the currently existing naming we have for other services. Also, something that also comes into mind as a possible name for this service is: LightObjectStash.

It reminds me of the idea in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/524663/50/includes/libs/LightweightObjectStore/OneRegionKeyValueStore.php .

Noted, will have a look at this patch.

I noticed the patch doesn't mention "Main" in getWANObjectStash. Maybe the existing getMainWANObjectCache() could also be renamed to getWANObjectCache() as another task?

Yes, this is something that Timo and I touched on in one of our meetings. Maybe it's something we can file a task for a treat separately.

Maybe SyncObjectStash is an OK name?

Noted, will have a look at this patch.

Note that I think the code comments/concepts would be the main takeaway from that older patch for now. You probably don't want to get hung up in interface hierarchies in your current patch.

Change 974506 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[operations/mediawiki-config@master] wmf-config: Introduce setting for "mcrouter-primary-dc"

https://gerrit.wikimedia.org/r/974506

Change 974507 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] MainConfigSchema: Remvove support for StatsCacheType setting

https://gerrit.wikimedia.org/r/974507

Change 974508 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[operations/mediawiki-config@master] wmf-config: Remove StatsCacheType (unused)

https://gerrit.wikimedia.org/r/974508

Change 974507 abandoned by D3r1ck01:

[mediawiki/core@master] MainConfigSchema: Remvove support for StatsCacheType setting

Reason:

Squashed into: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/972379

https://gerrit.wikimedia.org/r/974507

DAlangi_WMF changed the task status from Open to In Progress.Nov 17 2023, 1:14 PM

Change 974506 merged by jenkins-bot:

[operations/mediawiki-config@master] Set new $wgMicroStashType setting to "mcrouter-primary-dc"

https://gerrit.wikimedia.org/r/974506

Mentioned in SAL (#wikimedia-operations) [2023-11-20T14:05:39Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:974506|Set new $wgMicroStashType setting to "mcrouter-primary-dc" (T336004)]]

Mentioned in SAL (#wikimedia-operations) [2023-11-20T14:06:57Z] <urbanecm@deploy2002> urbanecm and d3r1ck01: Backport for [[gerrit:974506|Set new $wgMicroStashType setting to "mcrouter-primary-dc" (T336004)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-20T14:12:45Z] <urbanecm@deploy2002> Finished scap: Backport for [[gerrit:974506|Set new $wgMicroStashType setting to "mcrouter-primary-dc" (T336004)]] (duration: 07m 06s)

Change 972379 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Introduce MicroStash service for mcrouter-primary-dc

https://gerrit.wikimedia.org/r/972379

Change 975868 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/ConfirmEdit@master] Store: Enable ConfirmEdit to begin writing to MicroStash p.1

https://gerrit.wikimedia.org/r/975868

Change 975874 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/ConfirmEdit@master] Store: Begin using data from the new backend p.2

https://gerrit.wikimedia.org/r/975874

Change 975875 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/ConfirmEdit@master] Store: Delete logic relating to MainStash store

https://gerrit.wikimedia.org/r/975875

Change 975875 abandoned by D3r1ck01:

[mediawiki/extensions/ConfirmEdit@master] Store: Delete logic relating to MainStash store p.3

Reason:

Already done in step 2.

https://gerrit.wikimedia.org/r/975875

Change 975868 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] Store: Enable ConfirmEdit to use MicroStash for captcha storage

https://gerrit.wikimedia.org/r/975868

@Krinkle is it possible to measure the perf impact of this improvement? I was chatting with Derick, but I'm not sure we actually have observability around this. I'm only aware of this dashboard for Bagostuff, but I wonder if there are others

@larissagaulia I don't expect a notable end-user latency improvement as part of this change. Rather, the main impact is expected to take place on the server-side. Specifically, a reduction in read-write database load on the "x2" DB cluster, and a reduction in binlog size by no longer churning so many "set" and "delete" writes on this database.

It is true that captcha storage will be notably faster when moved from MainStash DB via the new memcached-backed MicroStash. The ConfirmEdit extension signals that captchas may be stored using the WRITE_BACKGROUND option, which MainStash DB does not support, but MicroStash does. This means the latency of the captcha writes will effectively be hidden from the response time, and for captcha reads memcached will be faster than sql.

However, there is at most one captcha during create account actions (backend latency avg ~200ms), or edit submissions (frontend latency p75 1,450ms), and the DB write currently take about ~1ms. The perf change there might be hard to notice, but it's worth confirming that. Especially in the event of an unforeseen regression.

Measuring "x2" DB cluster load: https://grafana.wikimedia.org/d/000000278/mysql-aggregated

Measuring Special:CreateAccount latency: https://grafana.wikimedia.org/d/rLMucuf4k/mediawiki-entrypoint-profiling?var-entrypoint=index_CreateAccount

Excimer profile before the change (to compare against later): https://performance.wikimedia.org/excimer/profile/740e641af1f21036

Change 975874 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] Store: Use the MicroStash store only and drop dead code

https://gerrit.wikimedia.org/r/975874

Unblocked per the above commit having been deployed by the train.

Change 974508 merged by jenkins-bot:

[operations/mediawiki-config@master] wmf-config: Remove unused wgStatsCacheType setting

https://gerrit.wikimedia.org/r/974508

Mentioned in SAL (#wikimedia-operations) [2024-01-08T08:17:13Z] <derick@deploy2002> Started scap: Backport for [[gerrit:974508|wmf-config: Remove unused wgStatsCacheType setting (T336004)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-08T08:18:44Z] <derick@deploy2002> derick and d3r1ck01: Backport for [[gerrit:974508|wmf-config: Remove unused wgStatsCacheType setting (T336004)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-01-08T08:26:24Z] <derick@deploy2002> Finished scap: Backport for [[gerrit:974508|wmf-config: Remove unused wgStatsCacheType setting (T336004)]] (duration: 09m 11s)

DAlangi_WMF updated the task description. (Show Details)

Deployed the config cleanup patch this morning. This can now be resolved and thanks a lot @Krinkle & @aaron for code reviews, guidance and discussions. <3

There is still wgCentralAuthTokenCacheType in CA, needs cleanup as well and removal from wmf-config repo.

Change 988401 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/CentralAuth@master] CentralAuthSessionManager: Use MicroStashType config setting

https://gerrit.wikimedia.org/r/988401

Change 988403 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[operations/mediawiki-config@master] wmf-config: Remove unused wgCentralAuthTokenCacheType

https://gerrit.wikimedia.org/r/988403

Change 988401 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] CentralAuthSessionManager: Use MicroStash service for CA tokens

https://gerrit.wikimedia.org/r/988401

Change 988403 merged by jenkins-bot:

[operations/mediawiki-config@master] wmf-config: Remove unused wgCentralAuthTokenCacheType

https://gerrit.wikimedia.org/r/988403

Mentioned in SAL (#wikimedia-operations) [2024-01-22T08:15:54Z] <derick@deploy2002> Started scap: Backport for [[gerrit:988403|wmf-config: Remove unused wgCentralAuthTokenCacheType (T336004)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-22T08:26:56Z] <derick@deploy2002> d3r1ck01 and derick: Backport for [[gerrit:988403|wmf-config: Remove unused wgCentralAuthTokenCacheType (T336004)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-01-22T08:34:09Z] <derick@deploy2002> Finished scap: Backport for [[gerrit:988403|wmf-config: Remove unused wgCentralAuthTokenCacheType (T336004)]] (duration: 18m 15s)

DAlangi_WMF updated the task description. (Show Details)

Finally 🎉