Page MenuHomePhabricator

SkinTest::testBuildSidebarCache is flaky in CI on sqlite
Closed, ResolvedPublic

Description

https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-sqlite-php74-docker/3911/console

There was 1 failure:

1) SkinTest::testBuildSidebarCache
cache hit
Failed asserting that array contains the expected submap.
--- Expected
+++ Actual
@@ @@
       'id' => 't-info',
     ),
   ),
-  'myhook' => 'foo 1',
+  'myhook' => 'foo 2',
   'navigation' => 
   array (
     0 =>

/workspace/src/tests/phpunit/MediaWikiTestCaseTrait.php:234
/workspace/src/tests/phpunit/includes/skins/SkinTest.php:523
/workspace/src/tests/phpunit/MediaWikiIntegrationTestCase.php:547
phpvfscomposer:///workspace/src/vendor/phpunit/phpunit/phpunit:97

Event Timeline

I have first notice this flaky test on gate-and-submit of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/942756, but it passed on the third try.
My local tests shows that the changed code of that patch set is not called during the test, so I hope it is not this patch set, but no idea which one.

Skin::buildSidebar can set WANObjectCache::TTL_UNCACHEABLE which could be result in this failure. That is related to a disabled MessageCache.

It seems the entry wikidb-unittest_:sidebar:en:foo is not in the cache, message "miss, new value computed" vs. "volatile hit" - instead of uncachable I have mention above.

Good
[PHPUnit] Start test SkinTest::testBuildSidebarCache
[objectcache] MainWANObjectCache using store HashBagOStuff
[localisation] LocalisationCache using store LCStoreNull
[objectcache] MainWANObjectCache using store HashBagOStuff
[localisation] LocalisationCache::isExpired(en): cache missing, need to make one
[MessageCache] MessageCache using store HashBagOStuff
[rdbms] Wikimedia\Rdbms\LoadBalancer::reuseOrOpenConnectionForNewRef: reusing connection for 0/wikidb-unittest_
[rdbms] MessageCache::loadFromDB(en)-big [0s] localhost: SELECT  page_title,page_latest  FROM "unittest_page"    WHERE page_is_redirect = 0 AND page_namespace = 8 AND (page_title NOT LIKE '%/%' ESCAPE '`' ) AND (page_len > 10000)  
[rdbms] MessageCache::loadFromDB(en)-small [0s] localhost: SELECT  /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS "rev_comment_text",comment_rev_comment.comment_data AS "rev_comment_data",comment_rev_comment.comment_id AS "rev_comment_cid",actor_rev_user.actor_user AS "rev_user",actor_rev_user.actor_name AS "rev_user_text",rev_actor,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len  FROM "unittest_page" JOIN "unittest_revision" ON ((page_id = rev_page)) JOIN "unittest_comment" "comment_rev_comment" ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN "unittest_actor" "actor_rev_user" ON ((actor_rev_user.actor_id = rev_actor))   WHERE page_is_redirect = 0 AND page_namespace = 8 AND (page_title NOT LIKE '%/%' ESCAPE '`' ) AND (page_len <= 10000) AND (page_latest = rev_id)  
[MessageCache] MessageCache::loadUnguarded: Loading en... local cache is empty, global cache is empty, loading from DB
[objectcache] fetchOrRegenerate(wikidb-unittest_:sidebar:en:foo): miss, new value computed
[session] SessionBackend "vn5ne7qd4d37e5iopp61iahe94hbl11l" is unsaved, marking dirty in constructor
[session] SessionBackend "vn5ne7qd4d37e5iopp61iahe94hbl11l" save: dataDirty=1 metaDirty=1 forcePersist=0
[session] SessionBackend "vn5ne7qd4d37e5iopp61iahe94hbl11l" data dirty due to dirty(): MediaWiki\Session\SessionManager->getSessionForRequest/MediaWiki\Session\SessionManager->getInitialSession/MediaWiki\Session\Session->getToken/MediaWiki\Session\Session->set/MediaWiki\Session\SessionBackend->dirty
[session] SessionBackend "vn5ne7qd4d37e5iopp61iahe94hbl11l" save: dataDirty=1 metaDirty=0 forcePersist=0
[objectcache] fetchOrRegenerate(wikidb-unittest_:sidebar:en:foo): volatile hit
[objectcache] fetchOrRegenerate(wikidb-unittest_:sidebar:en:bar): miss, new value computed
[PHPUnit] Successful test SkinTest::testBuildSidebarCache, completed in 0.003277 seconds
Failure
[PHPUnit] Start test SkinTest::testBuildSidebarCache
[objectcache] MainWANObjectCache using store HashBagOStuff
[localisation] LocalisationCache using store LCStoreNull
[objectcache] MainWANObjectCache using store HashBagOStuff
[localisation] LocalisationCache::isExpired(en): cache missing, need to make one
[MessageCache] MessageCache using store HashBagOStuff
[rdbms] Wikimedia\Rdbms\LoadBalancer::reuseOrOpenConnectionForNewRef: reusing connection for 0/wikidb-unittest_
[rdbms] MessageCache::loadFromDB(en)-big [0s] localhost: SELECT  page_title,page_latest  FROM "unittest_page"    WHERE page_is_redirect = 0 AND page_namespace = 8 AND (page_title NOT LIKE '%/%' ESCAPE '`' ) AND (page_len > 10000)  
[rdbms] MessageCache::loadFromDB(en)-small [0s] localhost: SELECT  /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS "rev_comment_text",comment_rev_comment.comment_data AS "rev_comment_data",comment_rev_comment.comment_id AS "rev_comment_cid",actor_rev_user.actor_user AS "rev_user",actor_rev_user.actor_name AS "rev_user_text",rev_actor,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len  FROM "unittest_page" JOIN "unittest_revision" ON ((page_id = rev_page)) JOIN "unittest_comment" "comment_rev_comment" ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN "unittest_actor" "actor_rev_user" ON ((actor_rev_user.actor_id = rev_actor))   WHERE page_is_redirect = 0 AND page_namespace = 8 AND (page_title NOT LIKE '%/%' ESCAPE '`' ) AND (page_len <= 10000) AND (page_latest = rev_id)  
[MessageCache] MessageCache::loadUnguarded: Loading en... local cache is empty, global cache is empty, loading from DB
[objectcache] fetchOrRegenerate(wikidb-unittest_:sidebar:en:foo): miss, new value computed
[session] SessionBackend "9esdsj0cq7e2ddgbgr5ft73sg3n9khhv" is unsaved, marking dirty in constructor
[session] SessionBackend "9esdsj0cq7e2ddgbgr5ft73sg3n9khhv" save: dataDirty=1 metaDirty=1 forcePersist=0
[session] SessionBackend "9esdsj0cq7e2ddgbgr5ft73sg3n9khhv" data dirty due to dirty(): MediaWiki\Session\SessionManager->getSessionForRequest/MediaWiki\Session\SessionManager->getInitialSession/MediaWiki\Session\Session->getToken/MediaWiki\Session\Session->set/MediaWiki\Session\SessionBackend->dirty
[session] SessionBackend "9esdsj0cq7e2ddgbgr5ft73sg3n9khhv" save: dataDirty=1 metaDirty=0 forcePersist=0
[objectcache] fetchOrRegenerate(wikidb-unittest_:sidebar:en:foo): miss, new value computed
[PHPUnit] FAILURE in test SkinTest::testBuildSidebarCache: cache hit

Reducing WANObjectCache::RECENT_SET_LOW_MS and WANObjectCache::RECENT_SET_HIGH_MS sufficiently makes the failure reproducible.

Alternatively you can put usleep( 100000 ) between the first and second assertion, that also triggers it.

Skin::buildSidebar() uses a check key, which comes from MessageCache::getCheckKey(). The check key is absent from the cache on the first run of buildSidebar(), so WANObjectCache::processCheckKeys() adds it to the cache, with the purge time set to the current time, rounded down to the nearest integer.

On the second call to buildSidebar(), $res[self::RES_CUR_TTL] is set to $ckPurge[self::PURGE_TIME] - $now which is a random negative number between about -1 and 0, because it is the time between the current time and the current time rounded down to the nearest second. So WANObjectCache::isAcceptablyFreshValue() always returns false.

Thus we head into the murky waters of stampede protection. The test needs to complete in a random time between 50ms and 100ms or else isExtremelyNewValue() will return false and a cache miss will occur.

I don't understand why processCheckKeys() is adding missing keys to the cache, with an expiry time in the past. That seems like a bug.

This feature is called the hold-off period, and it is discussed in a few places in the source. There is likely some production impact. For 11 seconds after a purge, every request is a miss. OK, reasons are given for this. But also for 11 seconds following regeneration due to an empty cache, every request is a miss except in the first 50-100ms after each regeneration. This does not seem to be justified. It dates back to 2015 a0cce5e4b63d9607f4561298d346695bd4d1c31c. This detail was not discussed in Gerrit.

Change 948683 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] skins: Mock clock in SkinTest to fix flaky test

https://gerrit.wikimedia.org/r/948683

I'm working on a commit that avoids doing purges when the cache is empty.

There are a bunch of explicit tests for this behaviour, but still nothing explaining why you would want this. An empty cache does not imply a recently changed database.

Change 948683 merged by jenkins-bot:

[mediawiki/core@master] skins: Mock clock in SkinTest to fix flaky test

https://gerrit.wikimedia.org/r/948683

Change 948714 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] WANObjectCache: don't purge an empty cache

https://gerrit.wikimedia.org/r/948714

$res[self::RES_CUR_TTL] is set to $ckPurge[self::PURGE_TIME] - $now which is a random negative number between about -1 and 0, because it is the time between the current time and the current time rounded down to the nearest second. So WANObjectCache::isAcceptablyFreshValue() always returns false.

That's likely a regression from a patch of mine that that aimed to reduce complexity by removing unrealistically precise purge timestamps: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/719148

The test needs to complete in a random time between 50ms and 100ms or else isExtremelyNewValue() will return false and a cache miss will occur.

I wonder if we really need randomization here.

I don't understand why processCheckKeys() is adding missing keys to the cache, with an expiry time in the past. That seems like a bug.

Ack.

[…] But also for 11 seconds following regeneration due to an empty cache, every request is a miss except in the first 50-100ms after each regeneration. This does not seem to be justified. It dates back to 2015 a0cce5e4b63d9607f4561298d346695bd4d1c31c. This detail was not discussed in Gerrit.

The first 50-100ms is referred to in the code as "volatile cache-hit" and was added in 2019 with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/493148. This happens regardless of whether the check-key or hold-off features are at play.

When a getWithSet callback has both its own key and a shared check key, and the check key is absent, the code assumes the check key may have recently been touched recently. This design has existed since Aaron's first version of WANObjectCache in 2015 (https://gerrit.wikimedia.org/r/187074>). I've heard this part of the design explained numerous times in code review comments by Aaron as relating to multi-DC race conditions.

I assume the following design intentions, as documented by me based on recent discussions with SRE and around multi-dc planning at https://wikitech.wikimedia.org/wiki/Memcached_for_MediaWiki#High_level:

  • We generally aim that changes propagate and with caches eventually consistent. Some propagations are initiated over the span of seconds or hours via the job queue, but most are initiated directly in the same web request. Database writes and cache purges are generally expected to replicate/broadcast within a few seconds. Any object caches and HTTP caches populated during that window need to self-disable or self-correct.
  • Local memcached should generally be perceived as a database replica (i.e. caller may not rely on an purges applying instantly and atomically everywhere at once. In the same sense that DB replication isn't instantanous and that there can always be on-going requests that fully overlap a concurrent edit.).

In broad strokes, the way this is realized between memcached and mysql is that MediaWiki is aware of how lagged a queried mysql database replica is, and if it is more than a few seconds lagged (I believe today our threshold is 6 seconds?) it will signal to the Wikimedia CDN (i.e. Varnish) to not cache its web response for longer than a few seconds. This way, when the CDN purges articles after edits, and new requests come in to re-create the cache entry, if that cache entry is known to likely still be the old version, we store it only for a few seconds. Almost as if to virtually purge it again once the mysql replication has completed. In the same way, WANObjectCache also coordinates with Rdbms behind the scenes and automatically shortens the TTL of memcached keys that queried a lagged database to TTL_LAGGED (30 seconds) rather than e.g. for several hours or days as it might otherwise.

Further more, there is a concept of "tombstones". These serve two purposes as I understand it:

  1. By implementing purges as a SET value rather than a DELETE we avoid race conditions where another web request may repopulate an old value from a database replica back under the same key. When WANObjectCache stores the result of a getWithSet callback, it does so by sending the ADD command if there was a cache miss (or CAS to replace the known-previous memcached value, but not blindly replace something else that arrived there meanwhile).
  2. By having the purge occupy the key for a few seconds, we allow both local-dc and cross-dc database replicas to finish the normally tolerated amount of replication lag without actually having to coordinate anything explicit about database positions or otherwise tightly coupling the two. In particular, with the knowledge of the above broad approach in which we trust that if a database replica is unusually lagged for any data (regardless whether recently purged or not) memcache and CDN cache are naturally shortened to allow for quick self-correction.

But — none of this explains why WANObjectCache would have to impose a tombstone on itself when simply generating a cache value for the first time after it expires from storage, or why that is specific to check keys. The best I can think of is that this was implemented in 2015 because at the time, cross-dc broadcasting of purges was imagined to work through a secondary service, whereas we ended up going with mcrouter and its memcached-native use of SET/DELETE.

I'll try to think through several scenarios that I think are relevant to this feature. In these scenarios, I'm considering all of the above described behaviour, but not the behaviour that Tim is considering to remove (namely the interpreting an asbent check key as if there was a recent purge along with hold-off).

  1. Remote DC stores previous value between primary DC "set" and primary DC "broadcast".
    • Codfw: empty.
    • Eqiad: Receives read request for "Bar" containing "X". Populates local cache with ADD key-foo-bar=X and ADD checkkey-shared-foo=1.
    • Eqiad: Receives edit for "Bar", changing to "Y". Broadcasts purge as SET key-foo-bar=tombstone, recieved locally and on the way to Codfw.
    • Codfw: Receives read request for "Bar". Populates local cache with ADD key-foo-bar=X and ADD checkkey-shared-foo=1.
    • Codfw: Receives broadcast SET key-foo-bar=tombstone
    • ✅ Future reads in Codfw see the purge and populate as normal after hold-off.
  1. Remote DC race with a check key.
    • Codfw: empty.
    • Eqiad: Read "Bar", containing "X" from Foo v1. Populate local cache with ADD key-foo-bar=X, ADD checkkey-shared-foo=1.
    • Eqiad: Edit shared resource "Foo" to "Y". Broadcast purge as SET checkkey-shared-foo=2.
    • Codfw: Read "Bar". Populates local cache with ADD key-foo-bar=X, ADD checkkey-shared-foo=1.
    • Codfw: Receive broadcast, SET checkkey-shared-foo=2
    • ✅ Future reads in Codfw see Foo's new check key and interpret Bar cache as stale/purged, populate as normal after hold-off.
  1. Remote DC reads before edit, writes after broadcast.
    • Codfw: empty.
    • Eqiad: Read "Bar", containing "X". Populate local cache with ADD key-foo-bar=X, ADD checkkey-shared-foo=1.
    • Eqiad: Edit "Bar" to "Y". Broadcast purge as SET key-foo-bar=tombstone, SET checkkey-shared-foo=2.
    • Codfw: Start reading of "Bar", local cache miss.
    • Codfw: Receive broadcast SET key-foo-bar=tombstone, SET checkkey-shared-foo=2.
    • Codfw: Finish reading of "Bar", attempt to populate local cache with ADD key-foo-bar=X (denied), ADD checkkey-shared-foo=1 (denied)
    • ✅ Future reads in Codfw see the purge and populate as normal after hold-off.
  1. Remote DC fully overlaps edit.
    • Codfw: empty.
    • Eqiad: Read "Bar", containing "X". Populate local cache with ADD key-foo-bar=X, ADD checkkey-shared-foo=1.
    • Eqiad: Edit "Bar" to "Y". Broadcast purge as SET key-foo-bar=tombstone, SET checkkey-shared-foo=2.
    • Codfw: Start reading of "Bar", local cache miss.
    • Codfw: Receive broadcast SET key-foo-bar=tombstone, SET checkkey-shared-foo=2.
    • Codfw: Finish reading of "Bar", attempt to populate local cache with ADD key-foo-bar=X (denied), ADD checkkey-shared-foo=1 (denied)
    • ✅ Future reads in Codfw see the purge and populate as normal after hold-off.
  1. Remote DC fully overlaps edit, with check key.
    • Codfw: empty.
    • Eqiad: Read "Bar", containing "X" from Foo v1. Populate local cache with ADD key-foo-bar=X, ADD checkkey-shared-foo=1.
    • Eqiad: Edit shared resource "Foo" to "Y". Broadcast purge as SET checkkey-shared-foo=2.
    • Codfw: Start reading of "Bar", local cache miss.
    • Codfw: Receive broadcast SET checkkey-shared-foo=2.
    • Codfw: Finish reading of "Bar", attempt to populate local cache with ADD key-foo-bar=X (works), ADD checkkey-shared-foo=1 (denied)
    • ✅ Future reads in Codfw see Foo's new check key and interpret Bar cache as stale/purged, populate as normal after hold-off. Unlike the previous scenrios, this one will respond for ~1 request after the broadcast arrives with the old value "X" served without any indication of staleness. This seems acceptable given that the only way Codfw would have value "X", is if its database hasn't caught up yet. If this is within a second or so from the broadcast, then that value is as just good as it would have been without any caching, so despite technically knowing that the check key was denied, it is still fully upholding its contract of responding from a fresh-enough local replica. Re-ordering these operations that may be mere nanoseconds apart would lead to the same outcome without a way to detect it (given we can't see the future).

All in all, I don't see a way in which local absence of check keys can result in stale data being served in a way that isn't immediately corrected by broadcasts, or already handled by general cache shortening for lagged replicas.

Krinkle triaged this task as Medium priority.

Change 948714 merged by jenkins-bot:

[mediawiki/core@master] WANObjectCache: don't set a hold-off when the cache is empty

https://gerrit.wikimedia.org/r/948714