Page MenuHomePhabricator

Fix flaky unit test "TextPassDumperTest::testCheckpointGzip"
Open, MediumPublic

Description

The computer in question was definitely not too fast, what's actually going on?

  1. TextPassDumperTest::testCheckpointPlain

expected more than 1 checkpoint to have been created. Checkpoint interval is 0.5 seconds, maybe your computer is too fast?
Failed asserting that 1 is greater than 1.

/srv/vagrant/mediawiki/tests/phpunit/maintenance/backupTextPassTest.php:397
/srv/vagrant/mediawiki/tests/phpunit/maintenance/backupTextPassTest.php:406
/srv/vagrant/mediawiki/tests/phpunit/MediaWikiTestCase.php:133
/srv/vagrant/mediawiki/tests/phpunit/MediaWikiPHPUnitCommand.php:42
/srv/vagrant/mediawiki/tests/phpunit/phpunit.php:160

  1. TextPassDumperTest::testCheckpointGzip

expected more than 1 checkpoint to have been created. Checkpoint interval is 0.5 seconds, maybe your computer is too fast?
Failed asserting that 1 is greater than 1.

/srv/vagrant/mediawiki/tests/phpunit/maintenance/backupTextPassTest.php:397
/srv/vagrant/mediawiki/tests/phpunit/maintenance/backupTextPassTest.php:423
/srv/vagrant/mediawiki/tests/phpunit/MediaWikiTestCase.php:133
/srv/vagrant/mediawiki/tests/phpunit/MediaWikiPHPUnitCommand.php:42
/srv/vagrant/mediawiki/tests/phpunit/phpunit.php:160


Also seen at https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/37/console


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 3:39 AM
bzimport set Reference to bz68653.
bzimport added a subscriber: Unknown Object (MLST).

Seems to be a standalone issue, hence not blocking Bug 67216 - Have unit tests of all wmf deployed extensions pass when installed together.

Nemo_bis set Security to None.
Nemo_bis added a subscriber: awight.
gerritbot added a subscriber: gerritbot.

Change 190173 had a related patch set uploaded (by Krinkle):
backupTextPassTest: Disable checkpointHelper test

https://gerrit.wikimedia.org/r/190173

Patch-For-Review

Change 190173 merged by jenkins-bot:
backupTextPassTest: Disable checkpointHelper test

https://gerrit.wikimedia.org/r/190173

Krinkle renamed this task from TextPassDumperTest::testCheckpointGzip expected more than 1 checkpoint to have been created to Fix flaky unit test "TextPassDumperTest::testCheckpointGzip".Feb 13 2015, 2:12 AM
Krinkle lowered the priority of this task from High to Medium.
Krinkle removed subscribers: Krinkle, Unknown Object (MLST).
Krinkle added a subscriber: Krinkle.

Test has been disabled. The Continuous-Integration-Infrastructure blocked has been solved. Retriaged as MediaWiki core issue to fix this unit test.

Reassigning per QChris:

I did not even remember I did :-D I'll only find time to look at it over the weekend.

Please unassign if you don't have time for it. :)

Thanks @Krinkle and @QChris , I was rather conservative turns out the test has been failing quite often so indeed it is safer to disable for now.

Change 190953 had a related patch set uploaded (by QChris):
Fix and re-enable Dumps' checkpoint tests

https://gerrit.wikimedia.org/r/190953

Patch-For-Review

Change 191814 had a related patch set uploaded (by QChris):
Allow to set stub read buffer size for TextPassDumper

https://gerrit.wikimedia.org/r/191814

Patch-For-Review

Change 191814 merged by jenkins-bot:
Allow to set stub read buffer size for TextPassDumper

https://gerrit.wikimedia.org/r/191814

Change 190953 merged by jenkins-bot:
Fix and re-enable Dumps' checkpoint tests

https://gerrit.wikimedia.org/r/190953

Failing again.

https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/4787/console
For unrelated patch set https://gerrit.wikimedia.org/r/#/c/204426/

02:55:39 There was 1 failure:
02:55:39 
02:55:39 1) TextPassDumperDatabaseTest::testCheckpointGzip
02:55:39 Skipping past end of siteinfo
02:55:39 Failed asserting that false is true.
02:55:39 
02:55:39 /mnt/jenkins-workspace/workspace/mediawiki-phpunit-zend/src/tests/phpunit/maintenance/DumpTestCase.php:178
02:55:39 /mnt/jenkins-workspace/workspace/mediawiki-phpunit-zend/src/tests/phpunit/maintenance/backupTextPassTest.php:331
02:55:39 /mnt/jenkins-workspace/workspace/mediawiki-phpunit-zend/src/tests/phpunit/maintenance/backupTextPassTest.php:441
02:55:39 /mnt/jenkins-workspace/workspace/mediawiki-phpunit-zend/src/tests/phpunit/MediaWikiTestCase.php:131

Change 204447 had a related patch set uploaded (by Krinkle):
backupTextPassTest: Disable testCheckpointGzip test

https://gerrit.wikimedia.org/r/204447

Change 204447 merged by jenkins-bot:
backupTextPassTest: Disable testCheckpointGzip test

https://gerrit.wikimedia.org/r/204447

QChris added a subscriber: QChris.

Failing again.

The Skipping past end of siteinfo log message means that no </siteinfo> could get found in the dump file.

Did this happen more often recently or was this just a one-off fluke?

(I tried to reproduce locally, but couldn't. Tried to reproduce through WMF's Jenkins in https://gerrit.wikimedia.org/r/#/c/204645/ but couldn't either)

I saw it happen on two commits in one day. Two isn't much, but I'm applying zero tolerance. A few sources of flakiness recently: MediaWiki core, HHVM, Wikimedia Labs networking and disk I/O, Jenkins deadlocks, Zuul deadlocks, Gerrit event stream, npm/composer cache corruption, Xvfb conflicts, corrupt DNS resolution. This makes for an unstable platform that is impossible to maintain (for CI admins) and unpleasant to work with (for MediaWiki developers and SWAT deployers).

I am happy to help put in place additional debugging for when it fails again. Race conditions are hard. But it's only one of dozens of race conditions we dealt with lately. I'm working on making the wider platform stable as we're yet to have a 24 hour period without CI interruption of some sort (since early 2014).

Here's another one:

https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm/10886/consoleFull

1) TextPassDumperDatabaseTest::testCheckpointPlain
Skipping past end of siteinfo
Failed asserting that false is true.

I'm going to ping @QChris again since he's around and active.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM
Krinkle changed the subtype of this task from "Production Error" to "Task".