Page MenuHomePhabricator

CentralAuthUser returning outdated data after user creation
Closed, ResolvedPublicBUG REPORT

Description

In some cases, right after a user gets created or autocreated, code like CentralAuthUser::getInstance( $user )->getId() will return 0 (ie. see the user as unattached, on that wiki even though it's definitely attached).

Two examples are:
T379909: Define where to add code that needs to run after a new central user has been created
T380042: RuntimeException: Global user does not have ID '0'.

There's a separate getPrimaryInstance method, but it's not very clear when it should be needed - CentralAuthUser::loadState() tries to load from the primary anyway when there have been recent DB changes. And sometimes using the primary is not an option - in the case of T379909 the lookup is happening on a GET request (in the GET-after-POST pattern of finishing user signup).

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

CentralAuthUser has three persistence layers: DB, WAN cache, in-process cache. Since the issue affects subsequent requests, it can't be (only) the in-process cache. It can't be replag because (in theory) CentralAuthUser::loadState() will always force using the primary. So it seems like the problem is with the WAN cache.

During user creation, either CentralAuthUser::register() or CentralAuthUser::attach() is called; those both call CentralAuthUser::invalidateCache(). Which doesn't seem to take effect. Can the WAN cache lag?

The other suspicious thing: CentralAuthUser::invalidateCache() calls CentralAuthUser::quickInvalidateCache() which calls WANObjectCache::delete(), but only as an onTransactionPreCommitOrIdle hook. Seems like that might be too late?

WAN cache is expected to be in line with a replica database, meaning it is subject to a lag that is generally imperceivable, but can in edge cases be noticed. For example, purges/deletes/touches are broadcasted across DCs and while usually <100ms, it is tolerated upto 5 seconds (e.g. in the case of a down memc server, a gutter pool may be active for ~5 seconds). This is the same delay we tolerate for DB replication (we auto-depool replicas with more lag than this).

https://wikitech.wikimedia.org/wiki/Memcached_for_MediaWiki#WANObjectCache

"Like a replica database."

In practice at our scale, nothing truly happens "immediately". Requests can overlap and may continue to act on older information. However, WANObjectCache is designed with a limited interface that allows it to prevent, solve, or hide these kinds of issues automatically and transparently. More details/examples at https://techblog.wikimedia.org/2022/12/08/perf-matters-at-wikipedia-2016/#one-step-closer-to-multi-dc

One potential source of lag is replication lag:

After performing DB writes (assuming they're pre-send), we emit a cookie that pins the user to the same DC for the next few seconds. This way, no cross-dc delay or memcached purge broadcasting delays factor in. The only thing to worry about on subsequent GET requests is DB replication within the same DC. ChronologyProtector makes it so that you always see your own writes. We set a cookie after a POST request that writes DB data, that makes you 1) pin to the same DC for the next few seconds, and 2) stores some data in local memcached that informs Rdbms to pick or wait for a db replica that has caught up to your own previous writes.

If the second request is not routed to the primary DC, however, then cross-dc delay is of course a possibility.

One potential thing to explore is, what happens with auto-creation on GET pageviews. Those pageviews may very well have gone to the secondary DC. I don't recall exactly what we guruantee with regards to "seeing your own writes" if you perform writes from within a GET request on the secondary DC. Secondary DCs are technically able to write to the primary DC's database and to write to the primary DCs memcached (i.e. MicroStash / ChronologyProtector store). Whether we do in this case, I don't know.

If the write and the read are on different domains, then the cookie mechamism is not sufficient. This is among the reasons why login.wikimedia.org is pinned to the primary DC, so that it can read from the primary DB and/or find an up-to-date replica based on your CP positions. Any other cross-domain scenarios, are taken care of by MediaWikiEntryPoint.php, by injecting cpPosIndex as a query parameter, on any cross-domain OutputPage redirect.

If I understand correctly, this task is about a same-domain scenario (i.e. you locally created and ended that interaction on the local domain, even if sso/loginwiki is involved, it ends on the local domain), and then a subsequent request/pageview is not seeing the data? That generally can't happen in terms of multi-dc and Rdbms, short of freak accidents where writes get lost or replication was unusually high. I don't see anything in WANObjectCache that would explain it either, however, that assumes it is used "correctly" by CentralAuth. There is no "lag" in WANObjectCache within the same data center, since memcached data is spread over a pool of hosts where each value is only hosted in one place. Once that is deleted, it is gone.

onTransactionPreCommitOrIdle is before the commit not after. Why do you think that might be too late?

In any event, I agree it is suspicious. Does it need to be defered at all? WANCache deletes include a 5 second hold-off period. And, since it is only a cache, it is expected to be safe to delete data even if we're not 100% sure that the source is being changed. Usually, the reason people use onTransactionPreCommitOrIdle is to peform a second step conditonally on whether the writes are successful, e.g. there was no db rollback, no PHP fatal exception. Cache purges do not need to be conditional in that way.

The way WANcache holdoff works, is that deletes internally store an "empty" value under the memcached key, which rejects any WANObjectCache writes for the next few seconds (forced cache-miss). This ensures that no overlapping or near-future requests populate it with a potentiallly lagged value from a replica database. Not something you should have to worry about, but FYI.

After spending some time on this, I'm not sure I've been able to determine the root cause but pretty confident that this has very little to do with WANObjectCache but rather it has a lot to do with edge-login and potentially a race condition with background HTTP requests (+ changes to session backend).

Will investigate this some more after SUL3 rollout is complete.

@DAlangi_WMF Thanks for looking into this. Is it possible to know when the SUL3 rollout will be complete? The outdated data from CentralAuth is causing incorrect information to be sent to our Graphana charts displaying A/B experiment KPIs because we can't track the control group correctly due to '0' being sent as the user_id.

Our proposed solution around this would entail breaking our enwiki experiment which isn't ideal for our use case.

Is it possible to know when the SUL3 rollout will be complete?

We are targeting end of Q3: https://www.mediawiki.org/wiki/MediaWiki_Platform_Team/SUL3#Phased_rollout

What would be needed to reproduce T379682: Growth KPI Grafana dashboard claims no A/B testing happens at pilot wikis when using global assignment (which AIUI is reliably produces this issue)? It seems the associated logging was only active for a few hours, so I assume some Growth feature would need to be re-deployed?

What would be needed to reproduce T379682: Growth KPI Grafana dashboard claims no A/B testing happens at pilot wikis when using global assignment (which AIUI is reliably produces this issue)? It seems the associated logging was only active for a few hours, so I assume some Growth feature would need to be re-deployed?

The feature is still deployed so the issue is still reproducing in production, (see example), we lowered the log level to debug to avoid noise. It can can also be observed in Growth's Grafana dashboard for variants assigned on account creation not capturing any data, (dashboard Account creation section). The relevant code path is from HomepageHooks::onLocalUserCreated requesting the variant via user options and calling ExperimentUserDefaultsManager::shouldAssignGlobalBucket which requests an ID to centralIdLookup.

One potential thing to explore is, what happens with auto-creation on GET pageviews.

The issues we have seen so far which might be caused by CentralAuthUser seeing outdated cached data:

  • T379682 / T379909: GrowthExperiments uses CentralIdLookup in a LocalUserCreated hook after account creation, and sees the freshly created user as unattached.
  • T382292: global_user_editcount AbuseFilter variable is (sometimes?) 0 for the MediaWiki message delivery user; the likely code path here is CentralAuthUser seeing the account as unattached
  • T385311: sometimes central autologin fails with an error saying the user's central ID is 0
  • T384549: after SUL3 account creation we should be setting a global preference, but we don't and there's a Couldn't find a global ID for user <username> warning in the logs

The first and last happens during account creation (so a POST request), the second happens during a job (so always primary DC I think?), the third happens on the local leg of central autologin (so a GET request, and not forced to the primary DC, but no auto-creation / writes involved). While it's entirely possible that some or all of these are unrelated issues, it's not really suggestive of write-on-GET being the problem.

onTransactionPreCommitOrIdle is before the commit not after. Why do you think that might be too late?

User::addToDatabase() (used for both signup and autocreation) wraps its DB operations in an atomic section. AIUI since web requests use implicit transaction rounds, that transaction won't commit until normal request processing ends and MediaWikiEntryPoint tears things down. So all the way until then, other code in the same request (such as LocalUserCreated handlers) will still see the un-invalidated cache.

Does it need to be defered at all?

That was my question as well, the obvious thing to me would be to just delete immediately.

(The deferral was introduced in rECAUdbd2c03403c4: Use a pre-commit hook for quickInvalidateCache() purges FWIW. That was in 2015, which makes it hard to believe that patch would have caused the erroneous behavior on its own - that's a long time to not notice a cache issue that could interfere with all kinds of things.)

Change #1121784 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CentralAuth@master] CentralAuthIdLookup: Reuse cached object on single-value lookup

https://gerrit.wikimedia.org/r/1121784

Change #1121789 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CentralAuth@master] [WIP] CentralAuthIdLookup: Use primary DB after writes

https://gerrit.wikimedia.org/r/1121789

Change #1121784 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] CentralAuthIdLookup: Reuse cached object on single-value lookup

https://gerrit.wikimedia.org/r/1121784

Change #1121789 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] CentralAuthIdLookup: Use primary DB after writes

https://gerrit.wikimedia.org/r/1121789

Change #1124781 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CentralAuth@wmf/1.44.0-wmf.18] CentralAuthIdLookup: Reuse cached object on single-value lookup

https://gerrit.wikimedia.org/r/1124781

Change #1124782 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CentralAuth@wmf/1.44.0-wmf.18] CentralAuthIdLookup: Use primary DB after writes

https://gerrit.wikimedia.org/r/1124782

Change #1124781 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.44.0-wmf.18] CentralAuthIdLookup: Reuse cached object on single-value lookup

https://gerrit.wikimedia.org/r/1124781

Change #1124782 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.44.0-wmf.18] CentralAuthIdLookup: Use primary DB after writes

https://gerrit.wikimedia.org/r/1124782

Mentioned in SAL (#wikimedia-operations) [2025-03-05T16:19:50Z] <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|Ad

Mentioned in SAL (#wikimedia-operations) [2025-03-05T16:22:50Z] <tgr@deploy2002> tgr: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|Add passive central do

Mentioned in SAL (#wikimedia-operations) [2025-03-05T16:54:47Z] <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|A

The issue with T379682 / T379909 / T384549 was CentralIdLookup using a replica read and ignoring the in-process cache, and being called in various hooks immediately after central account creation; that is now fixed. But the code responsible for T382292 and T385311 are both using CentralAuthUser directly, so that seems like a different bug.

Do we think that this still happens? As far as I can tell, all of the recent reports were either resolved, or determined to be caused by a mistake in the code.

matmarex reassigned this task from matmarex to Tgr.

All of the related tasks are resolved or have been determined to have a different root cause. We believe that this was fixed by @Tgr's patches in March.