Page MenuHomePhabricator

CentralAuthUser returning outdated data after user creation
Open, Needs TriagePublicBUG REPORT

Description

In some cases, right after a user gets created or autocreated, code like CentralAuthUser::getInstance( $user )->getId() will return 0 (ie. see the user as unattached, on that wiki even though it's definitely attached).

Two examples are:
T379909: Define where to add code that needs to run after a new central user has been created
T380042: RuntimeException: Global user does not have ID '0'.

There's a separate getPrimaryInstance method, but it's not very clear when it should be needed - CentralAuthUser::loadState() tries to load from the primary anyway when there have been recent DB changes. And sometimes using the primary is not an option - in the case of T379909 the lookup is happening on a GET request (in the GET-after-POST pattern of finishing user signup).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

CentralAuthUser has three persistence layers: DB, WAN cache, in-process cache. Since the issue affects subsequent requests, it can't be (only) the in-process cache. It can't be replag because (in theory) CentralAuthUser::loadState() will always force using the primary. So it seems like the problem is with the WAN cache.

During user creation, either CentralAuthUser::register() or CentralAuthUser::attach() is called; those both call CentralAuthUser::invalidateCache(). Which doesn't seem to take effect. Can the WAN cache lag?

The other suspicious thing: CentralAuthUser::invalidateCache() calls CentralAuthUser::quickInvalidateCache() which calls WANObjectCache::delete(), but only as an onTransactionPreCommitOrIdle hook. Seems like that might be too late?

WAN cache is expected to be in line with a replica database, meaning it is subject to a lag that is generally imperceivable, but can in edge cases be noticed. For example, purges/deletes/touches are broadcasted across DCs and while usually <100ms, it is tolerated upto 5 seconds (e.g. in the case of a down memc server, a gutter pool may be active for ~5 seconds). This is the same delay we tolerate for DB replication (we auto-depool replicas with more lag than this).

https://wikitech.wikimedia.org/wiki/Memcached_for_MediaWiki#WANObjectCache

"Like a replica database."

In practice at our scale, nothing truly happens "immediately". Requests can overlap and may continue to act on older information. However, WANObjectCache is designed with a limited interface that allows it to prevent, solve, or hide these kinds of issues automatically and transparently. More details/examples at https://techblog.wikimedia.org/2022/12/08/perf-matters-at-wikipedia-2016/#one-step-closer-to-multi-dc

One potential source of lag is replication lag:

After performing DB writes (assuming they're pre-send), we emit a cookie that pins the user to the same DC for the next few seconds. This way, no cross-dc delay or memcached purge broadcasting delays factor in. The only thing to worry about on subsequent GET requests is DB replication within the same DC. ChronologyProtector makes it so that you always see your own writes. We set a cookie after a POST request that writes DB data, that makes you 1) pin to the same DC for the next few seconds, and 2) stores some data in local memcached that informs Rdbms to pick or wait for a db replica that has caught up to your own previous writes.

If the second request is not routed to the primary DC, however, then cross-dc delay is of course a possibility.

One potential thing to explore is, what happens with auto-creation on GET pageviews. Those pageviews may very well have gone to the secondary DC. I don't recall exactly what we guruantee with regards to "seeing your own writes" if you perform writes from within a GET request on the secondary DC. Secondary DCs are technically able to write to the primary DC's database and to write to the primary DCs memcached (i.e. MicroStash / ChronologyProtector store). Whether we do in this case, I don't know.

If the write and the read are on different domains, then the cookie mechamism is not sufficient. This is among the reasons why login.wikimedia.org is pinned to the primary DC, so that it can read from the primary DB and/or find an up-to-date replica based on your CP positions. Any other cross-domain scenarios, are taken care of by MediaWikiEntryPoint.php, by injecting cpPosIndex as a query parameter, on any cross-domain OutputPage redirect.

If I understand correctly, this task is about a same-domain scenario (i.e. you locally created and ended that interaction on the local domain, even if sso/loginwiki is involved, it ends on the local domain), and then a subsequent request/pageview is not seeing the data? That generally can't happen in terms of multi-dc and Rdbms, short of freak accidents where writes get lost or replication was unusually high. I don't see anything in WANObjectCache that would explain it either, however, that assumes it is used "correctly" by CentralAuth. There is no "lag" in WANObjectCache within the same data center, since memcached data is spread over a pool of hosts where each value is only hosted in one place. Once that is deleted, it is gone.

onTransactionPreCommitOrIdle is before the commit not after. Why do you think that might be too late?

In any event, I agree it is suspicious. Does it need to be defered at all? WANCache deletes include a 5 second hold-off period. And, since it is only a cache, it is expected to be safe to delete data even if we're not 100% sure that the source is being changed. Usually, the reason people use onTransactionPreCommitOrIdle is to peform a second step conditonally on whether the writes are successful, e.g. there was no db rollback, no PHP fatal exception. Cache purges do not need to be conditional in that way.

The way WANcache holdoff works, is that deletes internally store an "empty" value under the memcached key, which rejects any WANObjectCache writes for the next few seconds (forced cache-miss). This ensures that no overlapping or near-future requests populate it with a potentiallly lagged value from a replica database. Not something you should have to worry about, but FYI.