Page MenuHomePhabricator

Fix flaky ContributionsCount.js test
Open, HighPublic

Description

As seen in https://integration.wikimedia.org/ci/job/integration-quibble-fullrun-extensions/4/console:

17:46:36   1) GET /contributions/count
17:46:36        GET /me/contributions/count
17:46:36          Returns edits filtered by tag:
17:46:36 
17:46:36       AssertionError: expected 0 to deeply equal 1
17:46:36       + expected - actual
17:46:36 
17:46:36       -0
17:46:36       +1
17:46:36       
17:46:36       at testGetContributionsCountByTag (tests/api-testing/REST/ContributionsCount.js:60:10)
17:46:36       at processTicksAndRejections (internal/process/task_queues.js:95:5)
17:46:36       at async Context.<anonymous> (tests/api-testing/REST/ContributionsCount.js:117:4)
17:46:36 
17:46:36   2) GET /contributions/count
17:46:36        GET /user/{user}/contributions/count
17:46:36          Returns edits filtered by tag:
17:46:36 
17:46:36       AssertionError: expected 0 to deeply equal 1
17:46:36       + expected - actual
17:46:36 
17:46:36       -0
17:46:36       +1
17:46:36       
17:46:36       at testGetContributionsCountByTag (tests/api-testing/REST/ContributionsCount.js:60:10)
17:46:36       at processTicksAndRejections (internal/process/task_queues.js:95:5)
17:46:36       at async Context.<anonymous> (tests/api-testing/REST/ContributionsCount.js:151:4)
17:46:36

It seems this is a case of the API request for contributions count being handled before the data is committed in MySQL. There are already existing uses of sleep() in the api-testing codebase that work around similar problems but it would be better to have a try/retry loop for n seconds, like wdio's waitUntil() method for browser tests.

In the meantime, I'll submit a patch to disable this test.

Event Timeline

Change 760519 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] api-testing: Skip flaky test

https://gerrit.wikimedia.org/r/760519

Change 760519 merged by jenkins-bot:

[mediawiki/core@master] api-testing: Skip flaky test

https://gerrit.wikimedia.org/r/760519

Change 760952 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] api-testing: Skip flaky contributions count test

https://gerrit.wikimedia.org/r/760952

Change 760952 merged by jenkins-bot:

[mediawiki/core@master] api-testing: Skip flaky contributions count test

https://gerrit.wikimedia.org/r/760952

From a comment by Timo on T230211:

If it does need to work with php-fpm, then I recommend setting postSendStrategy = DEFER_SET_LENGTH_AND_FLUSH instead of postSendStrategy = DEFER_FASTCGI_FINISH_REQUEST, which will give you the same treatment. We could make this configurable and perhaps even the default for CI and DevelopmentSettings.php.

daniel triaged this task as High priority.Feb 9 2022, 12:26 PM

I'm sorry, it's possible that these tests aren't the problem, because now I see that the Quibble extensions fullrun build is failing in the PHPUnit integration test phase. Let me see if I can figure out what is wrong with the Quibble build, e.g. https://integration.wikimedia.org/ci/job/integration-quibble-fullrun-extensions/10/console#console-section-13

Change 761320 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] DeferredUpdates: configure update strategy.

https://gerrit.wikimedia.org/r/761320

Change 761320 merged by jenkins-bot:

[mediawiki/core@master] DeferredUpdates: Introduce $wgForceDeferredUpdatesPreSend

https://gerrit.wikimedia.org/r/761320

Change 763262 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] api-testing: Disable failing contributions count tests

https://gerrit.wikimedia.org/r/763262

So my patch for forcing deferred updates to happen immediately didn't fix the issue with ContributionsCount.js.

Looking at the code some more, this seems plausible: user contributions are computed from the revision table(*), which is always updated immediately by an edit. Updates to links tables and such are deferred, and updating the search index may be done via the job queue, but the revision table should always have the latest data immediately.

This leaves replication lag as a possible cause - but we are not using a DB replica on the text instance.

So we must be hitting a cache somewhere. But where?

(*) use_editcount exists and it's updated deferred, by UserEditCountUpdate. But ContributionsCountHandler uses ContributionsLookup::getContributionCount() which does a COUNT(*) on the revision table.

Change 763262 merged by jenkins-bot:

[mediawiki/core@master] api-testing: Disable failing contributions count tests

https://gerrit.wikimedia.org/r/763262

So my patch for forcing deferred updates to happen immediately didn't fix the issue with ContributionsCount.js.

Looking at the code some more, this seems plausible: user contributions are computed from the revision table(*), which is always updated immediately by an edit. Updates to links tables and such are deferred, and updating the search index may be done via the job queue, but the revision table should always have the latest data immediately.

This leaves replication lag as a possible cause - but we are not using a DB replica on the text instance.

So we must be hitting a cache somewhere. But where?

(*) use_editcount exists and it's updated deferred, by UserEditCountUpdate. But ContributionsCountHandler uses ContributionsLookup::getContributionCount() which does a COUNT(*) on the revision table.

Is ChronologyProtector somehow involved?

Is ChronologyProtector somehow involved?

I don't see how. ChronologyProtector prevents a user seeing inconsistent state because of replication lag. But we don't have DB replication on the CI system. Besides, ChronologyProtector would be insufficient to prevent inconsistencies between multiple users, which would be needed for consistent testing.