Page MenuHomePhabricator

mwext-mw-selenium is slowing down gate-and-submit pipeline
Closed, ResolvedPublic

Description

The value that CI brings to developers is the fast feedback loop. If the slowdown of selenium is acceptable to the mobile team, that is their choice. Due to the test pipeline being independent, this does not affect other repositories.

However the gate-and-submit pipeline is blocking. The way mwext-mw-selenium is currently configured makes it participate in the global blocking "mediawiki" queue for the gate-and-submit pipeline.

This job is blocking all merges in all branches of all mediawiki-* repositories.

This is in my opinion unacceptable and unrealistic when compared to the reality of limited time in a day. We can't be spending 20-30 minutes to merge a single change.

I'm recommending immediate disabling of this job in the gate pipeline (test and postmerge are fine). We should also establish a time-based performance budget for how long the blocking queue is allowed to take. We cannot be endlessly adding jobs and filing technical debt to improve it at some point. I propose zero-tolerance and revert of jobs that exceed this budget. We have to draw a line somewhere.

It is then up to the stakeholders of mwext-mw-selenium to iterate further so it abides by this budget. A few different ideas for how to do this:

  1. Make it faster (< 8 minutes on average). Though unlikely in the short term.
  2. Make it enforced socially instead of technically. This can be done by having only in the "test" pipeline and the "postmerge" pipelines (which are asynchronous). This way it's still commented to Gerrit and results in V-2 (if the reviewer waits for it). But not in the gate-and-submit pipeline. This way it doesn't block global merges.
  3. Disable "Dependent Pipeline" functionality of gate pipeline. This functionality has very added value at a very high cost (one repo's merges blocking merges in others). It's nice in theory, but should not have been enabled until our workflow and job configurations are compatible with that paradigm. (T94322)
  4. Leave dependant gate pipeline in place, but exclude mediawiki-extensions-MobileFrontend from it (by prefixing jobs, thus creating a separate global queue).
  5. .. something else that results in mobile merges not blocking the global mediawiki gate queue from being blocked for over 10 minutes per commit.

Details

Related Gerrit Patches:

Event Timeline

Krinkle created this task.Sep 23 2015, 5:58 PM
Krinkle raised the priority of this task from to Unbreak Now!.
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: Krinkle, Jdforrester-WMF, greg and 2 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2015, 5:58 PM
Legoktm updated the task description. (Show Details)Sep 23 2015, 6:53 PM
Legoktm set Security to None.
Legoktm added a subscriber: Legoktm.

(Changed bullets to numbers to make them easier to reference)

#2 seems like the best solution for now, though I'm still in favor of #3.

hashar renamed this task from [Regression] mwext-mw-selenium is slowing down developer productivity to mwext-mw-selenium is slowing down developer productivity.Sep 23 2015, 7:01 PM
hashar lowered the priority of this task from Unbreak Now! to Medium.
dduvall renamed this task from mwext-mw-selenium is slowing down developer productivity to mwext-mw-selenium is slowing down gate-and-submit pipeline.Sep 23 2015, 7:19 PM

The value that CI brings to developers is the fast feedback loop. If the slowdown of selenium is acceptable to the mobile team, that is their choice. Due to the test pipeline being independent, this does not affect other repositories.
However the gate-and-submit pipeline is blocking. The way mwext-mw-selenium is currently configured makes it participate in the global blocking "mediawiki" queue for the gate-and-submit pipeline.

That is the whole purpose of the Zuul gate system. Make sure that other repositories do not have weird side effect with another repo. What we envision eventually is to run the whole set of integration browser tests we have on each changes pretending to be eligible for merge. So as an example a change to the EventLogging extension would have to honor the contract defined by the browser tests of all repositories. Zuul has been made for that and since it tests changes in parallel that largely reduce the delay.

This job is blocking all merges in all branches of all mediawiki-* repositories.
This is in my opinion unacceptable and unrealistic when compared to the reality of limited time in a day. We can't be spending 20-30 minutes to merge a single change.

I am not sure there are many reasons to have a change to merge asap. At worth a developer can cherry pick locally the changes that are needed. If it is an urgent production issue, the standard is to cherry pick directly on tin and deploy i.e. bypass CI entirely.

Nonetheless, you are right the 15 - 20 minutes is too long in our current situation. From the various discussions we had, the idea was for the job to take less than the 10 minutes of mediawiki-phpunit-zend job, this way it would have no impact on the current delay for a change to merge.

I'm recommending immediate disabling of this job in the gate pipeline (test and postmerge are fine). We should also establish a time-based performance budget for how long the blocking queue is allowed to take. We cannot be endlessly adding jobs and filing technical debt to improve it at some point. I propose zero-tolerance and revert of jobs that exceed this budget. We have to draw a line somewhere.

We have the KPI T108750: Implement "Jenkins/Zuul queue wait" KPI to cover the CI delay and the technical details are in T70114: Track and graph mean time to merge. A preliminary result is the dashboard at https://grafana.wikimedia.org/#/dashboard/db/releng-zuul , one of the graph shows times a mediawiki/core change spend in Zuul queue.

Example for the last two days:

The green line shows the time a mediawiki/core change in gate-and-submit. You can see the overload from the last few hours related to a lot of changes being proposed (I noticed 6 REL1_25 being +2ed in a row).

The idea is to get the duration in the yellow band (6mins - 10mins) and keep that stable while still adding more tests such as browser tests.

The new job is quite recent, lets give it time to improve. By looking at one of the last long run, there are most definitely low hanging fruits to speed it up (such as a scenario taking a good minute before executing the first feature).

As for the budget/priority. We are going to Nodepool / disposable instances. It is in production, we are now starting to migrate jobs. That is going to give us a dedicated instance per job which would reduce the I/O and CPU contention we currently experience when several jobs run on the same instance. Using the labs infrastructure, and provided we have enough hardware, it would be fairly easy to scale it up to hundreds of nodes.


  1. Make it faster (< 8 minutes on average). Though unlikely in the short term.
  1. Make it enforced socially instead of technically. This can be done by having only in the "test" pipeline and the "postmerge" pipelines (which are asynchronous). This way it's still commented to Gerrit and results in V-2 (if the reviewer waits for it). But not in the gate-and-submit pipeline. This way it doesn't block global merges.

We tried already. Almost nobody care of the postmerge jobs that can be left broken for ages. And if broken, nobody has clue who actually broke it so it tends to be ignored. The only way we have is to prevent the change from being merged.

And the changes are not blocked. They are merely delayed until the gate process them.

  1. Disable "Dependent Pipeline" functionality of gate pipeline. This functionality has very added value at a very high cost (one repo's merges blocking merges in others). It's nice in theory, but should not have been enabled until our workflow and job configurations are compatible with that paradigm. (T94322).

I really wish I had noticed the change that made all repos to share a common set of job and thus end up sharing the same queue. That broke Zuul assumption that two repos have different queue if they share no jobs in common.

But for MediaWiki and extensions, they all share mediawiki/core. Albeit the different branches we have causes a bit of troubles. There is little reason for a changes on master and REL1_25 to share the same queue. Then we don't send so many patches on release branches.

We already have the testextensions jobs that match the paradigm. Have to further improve it to test more repositories. We haven't worked on that one since February as we are rethinking scap / staging|beta cluster and working on CI scaling.

I am vetoing the removal of the dependent pipeline as I already stated on T94322.

  1. Leave dependant gate pipeline in place, but exclude mediawiki-extensions-MobileFrontend from it (by prefixing jobs, thus creating a separate global queue).

We talked about that. The problem is that extensions not sharing the same queue would cause troubles to the mwext-mw-selenium jobs which is rather unfair.

  1. .. something else that results in mobile merges not blocking the global mediawiki gate queue from being blocked for over 10 minutes per commit.

For the non ranting part:

Lets make the mwext-mw-selenium job faster, i.e. less than mediawiki-phpunit-zend which is 10 minutes.

We have the build time history at: https://integration.wikimedia.org/ci/job/mwext-mw-selenium/buildTimeTrend

I have picked one that took 22 minutes https://integration.wikimedia.org/ci/job/mwext-mw-selenium/1370/console and some scenarii takes a minute to start (time relative to start of build in HH:mm:ss.msec ):

00:18:43.124   @smoke @integration
00:18:43.125   Scenario: Check existence of important UI components on other pages. # features/ui_links.feature:19
00:19:51.902     Given the page "Selenium UI test" exists                           # features/step_definitions/create_page_api_steps.rb:50

More than a minute.

I think @dduvall, @zeljkofilipin and @Jdlrobson would be the interested parties. We would need:

  • remove the @integration tag from slowest scenarii
  • identify the cause of slowdown in the context of mwext-mw-selenium

Should I fill sub tasks for each?

greg added a comment.Sep 23 2015, 8:33 PM

Should I fill sub tasks for each?

Yes please, let's keep things going forward positively.

Related:
T101908: gate-and-submit should not block mediawiki-config changes on mediawiki changes | https://gerrit.wikimedia.org/r/#/c/217188/

There is another one to get integration/config out of the main queue. Could not find it though.

Change 240595 had a related patch set uploaded (by Dduvall):
Remove mwext-mw-selenium from gate-and-submit

https://gerrit.wikimedia.org/r/240595

Change 240595 merged by jenkins-bot:
Remove mwext-mw-selenium from gate-and-submit

https://gerrit.wikimedia.org/r/240595

hashar closed this task as Resolved.Oct 2 2015, 1:17 PM
hashar claimed this task.

mwext-mw-selenium job is no more in the gate-and-submit thanks to @dduvall

dduvall added a comment.EditedOct 5 2015, 7:11 PM

It's worth pointing out that we're still seeing sharp peaks in max and mean gating and there doesn't seem to be a correlation between these peaks and MobileFrontend + Gather gating, before or after we disabled mwext-mw-selenium. Rather, max and mean gating are tied to general gate-and-submit activity and likely compounded by the logic of the Zuul dependent pipeline that calculates a transitive dependency based on jobs in common[1][2]—we're possibly seeing the effects of the latter more prominently now that we're running more generalized jobs.

This particular case was very likely the observation of the aforementioned gating behavior and the fact that Gerrit Cleanup Day on Sep 23 (the day this task was created) increased the number of concurrent merges by such a large degree.

[1] http://docs.openstack.org/infra/zuul/zuul.html (see DependentPipelineManager)
[2] T94322: Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit