Page MenuHomePhabricator

Wikibase REL1_44 gate-and-submit always times out
Closed, ResolvedPublic

Description

I’ve tried to merge this backport four times, and each time two jobs in the gate-and-submit build timed out (quibble-composer-mysql-php81-noselenium and quibble-composer-mysql-php82-noselenium; the php83 version just barely finished in time when it wasn’t success-cached already). The failing builds reach between 81% and 98% of the “PHPUnit extensions suite (with database)” stage. During the last two builds, Zuul was practically empty (completely empty for the last build), so it’s not due to high load on CI in general.

We need to find a solution for this – we can’t have a release branch that we’re unable to backport fixes to without force-merging them.

Event Timeline

As far as I can tell, these tests don’t run in parallel (presumably we didn’t enable parallel test runs on non-master branches yet), which leaves me with the horrible suspicion that our tests got so much slower (or we added so many more tests?) that they’re now unable to complete when not parallelized – which sounds like a problem in general, because it implies we can no longer safely turn off parallel tests on the master branch either…

(From the PHP 8.3 tests that passed)

01:26:03 You should really speed up these slow tests (>100ms)...
[…]
01:26:03  2. 6298ms to run Wikibase\\Repo\\Tests\\Api\\SetClaimTest::testAddClaim
[…]
01:26:03  4. 4187ms to run Wikibase\\Repo\\Tests\\Api\\RemoveQualifiersTest::testRequests

These tests run much faster on my machine (which is probably slower than the CI VMs and I'm on PHP 8.1 only):

1. 1733ms to run Wikibase\\Repo\\Tests\\Api\\SetClaimTest::testAddClaim
1. 1330ms to run Wikibase\\Repo\\Tests\\Api\\RemoveQualifiersTest::testRequests

Given these tests are integration tests that do page edits, I suspect that one of the extension enabled during the test run considerable slows down editing (via a hook or in deferred updates).

The pulled in extensions for Wikibase are extensive:

["mediawiki/core", "mediawiki/extensions/AbuseFilter", "mediawiki/extensions/AntiSpoof", "mediawiki/extensions/ArticlePlaceholder", "mediawiki/extensions/BetaFeatures", "mediawiki/extensions/CentralAuth", "mediawiki/extensions/CheckUser", "mediawiki/extensions/CirrusSearch", "mediawiki/extensions/Cite", "mediawiki/extensions/CodeEditor", "mediawiki/extensions/CommunityConfiguration", "mediawiki/extensions/CommunityConfigurationExample", "mediawiki/extensions/ConfirmEdit", "mediawiki/extensions/DiscussionTools", "mediawiki/extensions/Echo", "mediawiki/extensions/Elastica", "mediawiki/extensions/EmailAuth", "mediawiki/extensions/EventBus", "mediawiki/extensions/EventLogging", "mediawiki/extensions/EventStreamConfig", "mediawiki/extensions/FlaggedRevs", "mediawiki/extensions/Flow", "mediawiki/extensions/Gadgets", "mediawiki/extensions/GeoData", "mediawiki/extensions/GlobalBlocking", "mediawiki/extensions/GlobalPreferences", "mediawiki/extensions/Graph", "mediawiki/extensions/GrowthExperiments", "mediawiki/extensions/GuidedTour", "mediawiki/extensions/IPInfo", "mediawiki/extensions/IPReputation", "mediawiki/extensions/JsonConfig", "mediawiki/extensions/Kartographer", "mediawiki/extensions/Linter", "mediawiki/extensions/LoginNotify", "mediawiki/extensions/MobileApp", "mediawiki/extensions/MobileFrontend", "mediawiki/extensions/OATHAuth", "mediawiki/extensions/PageImages", "mediawiki/extensions/PageViewInfo", "mediawiki/extensions/ParserFunctions", "mediawiki/extensions/PdfHandler", "mediawiki/extensions/Popups", "mediawiki/extensions/PropertySuggester", "mediawiki/extensions/Renameuser", "mediawiki/extensions/Scribunto", "mediawiki/extensions/SecurePoll", "mediawiki/extensions/SiteMatrix", "mediawiki/extensions/SpamBlacklist", "mediawiki/extensions/SyntaxHighlight_GeSHi", "mediawiki/extensions/TemplateData", "mediawiki/extensions/TextExtracts", "mediawiki/extensions/Thanks", "mediawiki/extensions/TimedMediaHandler", "mediawiki/extensions/TorBlock", "mediawiki/extensions/UniversalLanguageSelector", "mediawiki/extensions/VisualEditor", "mediawiki/extensions/WikiEditor", "mediawiki/extensions/Wikibase", "mediawiki/extensions/WikibaseCirrusSearch", "mediawiki/extensions/WikibaseLexeme", "mediawiki/extensions/WikibaseLexemeCirrusSearch", "mediawiki/extensions/WikibaseMediaInfo", "mediawiki/extensions/WikibaseQualityConstraints", "mediawiki/extensions/WikimediaBadges", "mediawiki/extensions/WikimediaEvents", "mediawiki/extensions/WikimediaMessages", "mediawiki/extensions/cldr", "mediawiki/skins/MinervaNeue", "mediawiki/skins/Vector"]

Off the top of my head, I'd be suspicious of the AbuseFilter > CheckUser > GrowthExperiments > CirrusSearch etc. chain; that feels like Wikibase really doesn't care about much of that.

As far as I can tell, these tests don’t run in parallel (presumably we didn’t enable parallel test runs on non-master branches yet),

That is correct! The PHPUnit parallelism is not enabled for release branches. From integration/config:

# Enable parallel PHPUnit runs for MW ecosystem, except:
if (
    # ... temporarily exclude extensions that have issues
    # with parallel tests
    params["ZUUL_PROJECT"] not in [
        # DonationInterface uses a different branching model. Its master
        # branch is tested with mediawiki/core fundraising/REL1_43 branch
        # which does not have the parallel work.
        "mediawiki/extensions/DonationInterface",
    ]
    # ... exclude on REL_ branches (not yet tested/patched),
    and "ZUUL_BRANCH" in params


    vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
    and not params["ZUUL_BRANCH"].startswith("REL1") # <------ this rejects REL1_44
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


    # Exclude fundraising branches and specific jobs
    and not params["ZUUL_BRANCH"].startswith("fundraising") 
    and not job.name.startswith("quibble-fundraising")
):
    params['QUIBBLE_PHPUNIT_PARALLEL'] = '1'
    params['MW_RESULTS_CACHE_SERVER_BASE_URL'] = \
        'https://phpunit-results-cache.toolforge.org/results'

The parallelism code is in MediaWiki core and it has not been backported to release branches, but maybe it should. Maybe that can be rested via another task (and cc @kostajh and @ArthurTaylor).

The code is however available in REL1_44; let's just enable it there for now?

Change #1146012 had a related patch set uploaded (by Hashar; author: Jforrester):

[integration/config@master] Zuul: Enable parallel PHPUnit runs for MW ecosystem REL1_44

https://gerrit.wikimedia.org/r/1146012

Change #1146012 merged by jenkins-bot:

[integration/config@master] Zuul: Enable parallel PHPUnit runs for MW ecosystem REL1_44

https://gerrit.wikimedia.org/r/1146012

OK, with parallel PHPUnit enabled they now pass CI and will merge.

However, I'm a little worried about what this suggests — the point of parallel PHPUnit was to run the same tests faster so people wait less for CI to finish, not to allow teams to balloon the number of tests still further without triggering the timeout. REL1_43 didn't have this parallel feature, and merges fine still. Is this a sign that too many big/slow tests have been written/adjusted in the last six months' development?

There are more tests and there is also more extension dependencies being added. Timo wrote a problem statement at T389998 which ultimately would remove the recursion when processing extension dependencies.

There were some progress made, I even wrote a script ( https://gerrit.wikimedia.org/r/c/integration/config/+/1132644 ) to assist in mass running jobs to verify the jobs still pass after removing recursion. I have kind of lost track of the initial push in early April.

I think T389998 is a good mid term solution.

For this task T393869, enabling parallel testing for REL1_44 is the immediate fix this one, and I am inclined to mark it resolved now that CI passes :)

A_smart_kitten subscribed.

For this task T393869, enabling parallel testing for REL1_44 is the immediate fix this one, and I am inclined to mark it resolved now that CI passes :)

I agree - this task reports timeouts on Wikibase's REL1_44's gate-and-submit, and (from looking at the results of this Gerrit search) I (thankfully) haven't noticed any further instances of jobs in that repo+branch timing out after 1hr.
Any follow-up tasks can be filed as desired :)