MediaWiki gate takes 20 minutes for extensions tests and 1.5 hour for at least a patch
Closed, ResolvedPublic

Description

Over 1.5h to merge this patch -- https://gerrit.wikimedia.org/r/#/c/269083/

mediawiki-extensions-php53 SUCCESS in 20m 55s
mediawiki-extensions-hhmv SUCCESS in 13m 17s

From our graph that track amount a time a change spend in gate-and-submit ( https://grafana.wikimedia.org/dashboard/db/releng-kpis?panelId=2&fullscreen ):

hashar created this task.Feb 8 2016, 9:52 PM
hashar updated the task description. (Show Details)
hashar raised the priority of this task from to Needs Triage.
hashar added subscribers: hashar, bd808.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 8 2016, 9:52 PM
hashar added a comment.Feb 8 2016, 9:58 PM

Part of the slowness can be explained by the addition of Scribunto to the shared job. The change https://gerrit.wikimedia.org/r/267458 has been merged on Sat Feb 6 18:04:45 2016 UTC (was T125050).

Another reason that Timo was mentioning is that we have only four Precise slaves that runs the php53 and those jobs are throttled to only one build per node.

The low hanging fruit is to add more Precise slaves. Maybe look at whether Scribunto should really be part of the shared job.

Later ideas:

  • only trigger the shared job for changes to mediawiki itself
  • for changes to extensions, teach MediaWiki the ability to only run structure tests and that specific extensions tests (i.e. not the tests from dependent extensions

Pooled four more Precise instances with 2 CPU, will have two executors and that will let Jenkins to spread some jobs from 4 to 8 slots.

Nodes:

integration-slave-precise-1011
integration-slave-precise-1012
integration-slave-precise-1013
integration-slave-precise-1014

It is going to take time to have puppet to fully provision them.

I have applied role::ci::slave::labs and puppet is running on all four instances.

I have added the instances as Jenkins slaves and put them offline.

Whenever puppet is done, we can mark them online in Jenkins then monitor the jobs running on them are working properly.

Change 269327 had a related patch set uploaded (by Legoktm):
Revert "[Scribunto] Add template extension-gate to Scribunto"

https://gerrit.wikimedia.org/r/269327

Not sure if this is related: https://gerrit.wikimedia.org/r/#/c/269171/ "only" took 43 minutes to go through the zuul queue, but was V-1ed because https://integration.wikimedia.org/ci/job/mediawiki-extensions-php53/738/console timed out (failed to finish within 30 minutes).

Change 269327 merged by jenkins-bot:
Revert "[Scribunto] Add template extension-gate to Scribunto"

https://gerrit.wikimedia.org/r/269327

hashar added a comment.Feb 9 2016, 1:17 AM

So in short:

  • 4 new 2 CPU Precise slaves have been added to help processing the php53 jobs
  • Scribunto has been removed from the mediawiki-extensions-* jobs. That just took too long

We really want Wikibase tests to run when a patch is proposed on mediawiki/core. So it is left around.

Bunch of patches got force merged which deadlock Zuul entirely for 5 minutes per merged commit ( T93812 ). I have live hacked Zuul on gallium by editing /usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py:

- replication_timeout = 300
+ replication_timeout = 10

Immediate issue fixed but the snowball effect of doom will happen again eventually.

greg added a subscriber: greg.Feb 9 2016, 6:16 AM

Updated graph:

Later ideas:

  • only trigger the shared job for changes to mediawiki itself

The nature of a Mediawiki extensions is that a change to most extensions can make the php tests of core or another extension fail, with appropriate hook API design and accompanying integration tests this can be avoided, but I assume we are not anywhere close.

A good way to decouple code is to move it to a component.

An easy way to speed up the tests is to parallelize them. I.e. run multiple phpunit processes each running a different part of the tests, this can also be done over multiple machines if a deterministic sharding function is used.

Perhaps our tests can be optimized to be faster.

Anomie added a subscriber: Anomie.Feb 9 2016, 4:32 PM
  • Scribunto has been removed from the mediawiki-extensions-* jobs. That just took too long

Well, you did add 20% more tests (it looks like the total went from 10679 to 12980 for mediawiki-extensions-hhvm).

Another thing you might try is excluding the LuaStandalone group. That stops about half of Scribunto's tests, but that half seems to take about 75% of the time.[1] Unless the core change being tested somehow impacts shelling out to external processes, it's unlikely to affect LuaStandalone but not LuaSandbox.

It also wouldn't hurt to have the CI infrastructure is configured to use $wgScribuntoDefaultEngine = 'luasandbox'; if it's not already, to speed up any tests (e.g. parser tests) that use Scribunto without explicitly specifying an engine.

[1]: When I ran Scribunto's tests locally just now, it took 2.26 minutes for all tests but only 34.67 seconds with --exclude-group LuaStandalone and 1.69 minutes with --group LuaStandalone.

greg triaged this task as High priority.Feb 17 2016, 5:17 PM
greg lowered the priority of this task from High to Normal.

Setting to Normal as we fixed the problem for now, but we're working to get scribunto back into the pipeline (see blockers)[

hashar closed this task as Resolved.Nov 4 2016, 9:02 AM
hashar claimed this task.

A lot of the issue is fixed by:

modules/zuul/manifests/server.pp:    $gerrit_event_delay = '5',

For Scribunto there is T126670, and we have other tasks around regarding the PHPUnit tests being just painfully slow.