Page MenuHomePhabricator

CI performance issues
Closed, DuplicatePublic

Description

https://integration.wikimedia.org/zuul/ is currently showing puppet CRs queued up to 18 minutes on the test-prio queue.
This performance issue is blocking the normal workflow of the SRE team.

Details

Related Gerrit Patches:
mediawiki/core : masterMake LocalisationCache a service
mediawiki/core : REL1_34Make LocalisationCache a service
integration/config : masterMove puppet jobs to dedicated small node
mediawiki/core : masterRevert "Make LocalisationCache a service"

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2019, 1:44 PM
Joe triaged this task as Unbreak Now! priority.Aug 26 2019, 1:46 PM
Joe added a subscriber: Joe.

For context, the actual time to run the tests for operations/puppet is under one minute for most patches.

Either Zuul or jenkins are broken, and this has been a constant pain in the last few weeks for everyone involved.

Triaging to UBN! because this is effectively an outage of the service.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptAug 26 2019, 1:46 PM

Change 532399 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] Revert "Make LocalisationCache a service"

https://gerrit.wikimedia.org/r/532399

Change 532437 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Move puppet jobs to dedicated small node

https://gerrit.wikimedia.org/r/532437

Change 532399 merged by jenkins-bot:
[mediawiki/core@master] Revert "Make LocalisationCache a service"

https://gerrit.wikimedia.org/r/532399

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptAug 27 2019, 7:22 AM
hashar added a subscriber: hashar.

The root cause was a faulty patch merged in mediawiki/core on Friday. It roughly doubled the time it takes to run the tests so that eg Wikibase changes occupied execution slot for up to an hour. In turn that starved the very thin pool of executors we currently have which thus delayed execution of jobs for non MediaWiki repo.

Anyway, that has been fixed by reverting the faulty code.

Change 532679 had a related patch set uploaded (by simetrical; owner: simetrical):
[mediawiki/core@master] Make LocalisationCache a service

https://gerrit.wikimedia.org/r/532679

Change 532437 merged by jenkins-bot:
[integration/config@master] Move puppet jobs to dedicated small node

https://gerrit.wikimedia.org/r/532437

Change 532679 merged by jenkins-bot:
[mediawiki/core@master] Make LocalisationCache a service

https://gerrit.wikimedia.org/r/532679

Change 541624 had a related patch set uploaded (by Jforrester; owner: simetrical):
[mediawiki/core@REL1_34] Make LocalisationCache a service

https://gerrit.wikimedia.org/r/541624

Change 541624 merged by jenkins-bot:
[mediawiki/core@REL1_34] Make LocalisationCache a service

https://gerrit.wikimedia.org/r/541624