Page MenuHomePhabricator

Consider moving browser based tests (Selenium and QUnit) to a non-voting pipeline
Open, NormalPublic

Description

To improve developer productivity, this is a proposal to move browser-based tests (Selenium and QUnit) to a non-voting pipeline, so that flaky browser tests cannot block merge or verification +2 of patches.

If this step is taken, then it is up to the reviewing developer or the developer who gave a +2 to inspect the console output and judge for themselves whether a possible regression is involved with the patch. While not ideal, this seems much more preferable to the current status where patches are regularly blocked on merge (after 30-60 minutes of waiting) only for a "recheck" comment to allow them to succeed (most of the time).

related tasks:

Related Objects

Mentioned In
T233143: Quibble should fatal out on clone/fetch failure"ERROR:zuul.Repo:Unable to initialize repo for npm-test.git"
T220212: Wikimedia Technical Conference 2019: Discussion
T225940: [Discussion] Disable whitelisted-user automated testing, unless requested.
T225871: Selenium and PHPUnit: Stop execution on failure
T225730: Reduce runtime of MW shared gate Jenkins jobs to 5 min
Mentioned Here
T233143: Quibble should fatal out on clone/fetch failure"ERROR:zuul.Repo:Unable to initialize repo for npm-test.git"
T232759: Move CI selenium/qunit tests of mediawiki repository to a standalone job
T118212: Take heat off day before the weekly branch-cut?
T215217: deployment-prep: Code stewardship request
T225871: Selenium and PHPUnit: Stop execution on failure
T204125: Run less tests when a patch is in WIP mode in Gerrit
T182412: Investigate if WebdriverIO `sync: false` would be useful to us and document how to use it
T225496: Improve caching in CI tests
T87781: Split mediawiki tests into unit and integration tests
T225184: CirrusSearch\SearcherTest::testSearchText PHPUnit tests take a while and runs for everyone
T196960: Quibble should have a way for extensions to opt out of core selenium browser tests
T67773: Auto retry failed browser tests to reduce false negatives
T94212: Accommodate flaky tests flapping
T225218: Consider httpd for quibble instead of php built-in server
T225162: Fix flapping tests in gated extensions as of 2019-06-06
T225193: CI monitoring to detect flapping tests, especially in unrelated gated extensions
T212521: RFC: Let's stop running QUnit unit tests as integration tests
T224673: Automate identifying flaky tests

Event Timeline

kostajh created this task.Jun 6 2019, 8:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 6 2019, 8:11 PM
kostajh updated the task description. (Show Details)Jun 6 2019, 8:14 PM

In the last week, Kosta measured about 200 times when "recheck" was used on a patch. While it can be used for many reasons, 100% of my usages are because of unreliable browser tests.

A quick back-of-the-napkin estimate: 200 rechecks * 2 people waiting on average * 20 minutes per build is equal to about 133 hours... in one week. That's the equivalent of 3.3 full time employees at 40h per week.

Some rechecks may be done for other reasons and people are not staring at their screen doing nothing while jenkins re-run. However, some tests have affected a large number of patches. They have needed to be disabled and zuul has been backed up to a point where patches can only run 2 hours later.

It's hard to qualify the exact impact on productivity but the frustration and perception of time being wasted are real.

I like this proposal, but I'm concerned that non-voting tests will become meaningless, wasting CI resources for little value, and discouraging developers from writing browser tests in the future. Here are a few alternative approaches we might consider,

  • Make it easier to run browser tests locally, especially the full gate-and-submit suite which helps find unwanted interactions between extensions. Just as we shouldn't be pushing patches without linting and running unit tests locally, we should also be able to do these more complex tests. Quibble is a big step towards this goal, we might just be a small amount of extra glue away from a simple, local "test everything".
  • Run CI browser tests in parallel. (Task already filed?)
  • Better social conventions around flapping tests. I'm pretty sure most of us were being inconvenienced by the same flaky tests. Rather than recheck, maybe the first response should be to mark the test as broken. This would be aided by a hit list of the top offending tests, or other monitoring. (T224673, T225193, T225162)
  • Hourly or daily regression suite which is capable of bisecting or something, to identify which specific patches broke the build even after merge. This is less than ideal since it lets us deploy broken code.

QUnit tests are not browser tests, those are selenium tests.

If tests are flaky, then CI maintainers should be pulling them from the shared test stack or disabling them. Unfortunately we haven't been as aggressive as we should be.

Any proposal to disable tests has to adequately cover what these tests are protecting against. There's obviously a limit to how much developer time is being wasted, but if that is protecting against end-user regressions, then so be it.

I am in favor of leaving them as voting, especially if the primary driver to changing that is their flaky nature. Although I do understand the issue at hand, I'd like to work to find a different solution to the problem of flaky tests. Monitoring and acting upon flaky tests hasn't been a priority in the past, but we are moving to bring more focus to that in the near future.

I think this proposal is a good catalyst to solve a real problem. I'm just hesitant to do so in this manor at this juncture. I'm open to being convinced otherwise, but to Legoktm's point, the intent of these tests being part of a voting pipeline is to provide some degree of protection.

I'll take a closer look into this problem to see what other things we could do to lessen the impact of flaky tests.

Thank you everyone for the discussion!

that non-voting tests will become meaningless, wasting CI resources for little value, and discouraging developers from writing browser tests in the future

For me personally, based on my experience with the frequency of flaky Selenium tests, I'm already discouraged from writing any. And it's not that I don't see the potential usefulness of these types of tests; it's just that I don't want to add to the already long build times with something that might potentially fail for reasons unrelated to the code.

Better social conventions around flapping tests. I'm pretty sure most of us were being inconvenienced by the same flaky tests. Rather than recheck, maybe the first response should be to mark the test as broken.

Yes, strongly agree

Run CI browser tests in parallel. (Task already filed?)

I've tried this type of thing before and it's not great, because the site under test in an already non-optimized web stack – we're running PHP's (single-threaded?) built-in webserver with SQLite – starts to get bogged down with the heavier request load, and then there tend to be even more timeout problems. I do think using Apache would speed things up a bit (T225218) but I'm not sure if it would help with the reliability issue.

QUnit tests are not browser tests, those are selenium tests.

I included QUnit because as I understand it, we currently run them with a browser, but I am mostly talking about Selenium tests, and we could focus the discussion on that.

If tests are flaky, then CI maintainers should be pulling them from the shared test stack or disabling them. Unfortunately we haven't been as aggressive as we should be.

Agreed. And I have been guilty of this too. Unfortunately it's just a lot easier to type "recheck" and get on with your day than to file a task, clone the repo, scan the console logs to find the failing test name, disable the test, submit the patch, get a +2.

Any proposal to disable tests has to adequately cover what these tests are protecting against. There's obviously a limit to how much developer time is being wasted, but if that is protecting against end-user regressions, then so be it.

Anecdotally, in the last year I think I've encountered one instance of a regression caught by a failing Selenium test. Everything else, from what I can tell, has been a flaky test that "recheck" has caused to pass. That said, it seems like the problem has been worse in the last few months.

Maybe a better starting point than my subjective complaints is to try to collect some data and trends of how many legitimate regressions have been prevented by the Selenium tests in comparison to the number of flaky tests. :) I will try to see how feasible it is to do that

I think this proposal is a good catalyst to solve a real problem. I'm just hesitant to do so in this manor at this juncture. I'm open to being convinced otherwise, but to Legoktm's point, the intent of these tests being part of a voting pipeline is to provide some degree of protection.

Perhaps when a branch is cut for the train, there could be a step in CI that involves running all Selenium tests for that new branch? Or we could run the Selenium tests on a beta cluster type site every few hours?

kostajh renamed this task from Proposal to move browser based tests (Selenium and QUnit) to a non-voting pipeline to Consider moving browser based tests (Selenium and QUnit) to a non-voting pipeline.Jun 7 2019, 2:02 AM
kostajh updated the task description. (Show Details)Jun 7 2019, 2:35 AM
mmodell added a subscriber: mmodell.Jun 7 2019, 3:18 AM

Maybe a better starting point than my subjective complaints is to try to collect some data and trends of how many legitimate regressions have been prevented by the Selenium tests in comparison to the number of flaky tests. :) I will try to see how feasible it is to do that

Would you like to join us to discuss this at the next Data³ meeting?

I think this proposal is a good catalyst to solve a real problem. I'm just hesitant to do so in this manor at this juncture. I'm open to being convinced otherwise, but to Legoktm's point, the intent of these tests being part of a voting pipeline is to provide some degree of protection.

Perhaps when a branch is cut for the train, there could be a step in CI that involves running all Selenium tests for that new branch? Or we could run the Selenium tests on a beta cluster type site every few hours?

I think this has been proposed in the past. It seems like a good compromise to me, FWIW.

I thought I had written this down in a task somewhere, but I can't find it now. Anyways, I think we should split the selenium browser tests into a similar "structure" and "core" set like we do for PHPUnit.

Most of the MediaWiki core PHPUnit tests are just run on core patches, but there's a small subset that verify integrity or basic things (tests/phpunit/structure) that run during extension patches too. We should move most of the browser tests to just run on core patches (e.g. the rollback test that's somewhat flaky), and just run a small very core subset on extension patches (plus the extension's tests too!).

This would also let us probably undo workarounds like T196960: Quibble should have a way for extensions to opt out of core selenium browser tests.

Change 515702 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[integration/jenkins@master] Set cache for mediawiki to APC for faster browser tests

https://gerrit.wikimedia.org/r/515702

Ladsgroup added a comment.EditedJun 8 2019, 4:32 PM

(Warning. Long comment ahead)
This week, I have been so frustrated with slow jenkins that I spent the last two days trying to make it faster. I learned a lot about mediawiki's caching actually which is surprising to me. While I share the same concern, I think making browser tests non-voting defies the point of having them in the first place (we did a similar thing with Wikibase, we moved most of browser tests to a daily jenkins job. Guess what happened. The tests have been red for around two years now.
So what should we do?

  • Find and fix slow phpunit tests. We have report of tests run time which is pretty useful in finding weird cases.
  • Move browser tests to another parallel job.
  • Separate unit tests from integration tests (T87781: Split mediawiki tests into unit and integration tests) It doesn't just make the tests faster, it has another fun benefit too. quibble is really smart, at first it runs phpunit excluding database group (we should start honoring "database" tag in our codebase, I don't think we do). Then it runs the browser tests and then at last runs database phpunit tests. This is a smart way to stop early when the build is going to fail anyway, run fast ones first and abort when they fail. By separating unittests, we can put them in front of quibble queue, making it abort jobs that fail in unittests before running lots of other tests.
  • Improve caching in our tests. I thought phpunit uses the default caching (database) but it actually sets all of caches to 'hash' (opcache) in all types of tests. If we do this for browser tests too, it makes them blazing fast. Number of database queries if you set caching to database is unbelievable (Try it). I set browsertests caching to APCu but it hasn't been merged yet: https://gerrit.wikimedia.org/r/515702 That will improve things drastically. But still we might want to consider using persistent cache in our unittests because for example making Language object is pretty expensive but they are cached.
  • Speaking of Language object, can this be more mockable or split? Most of slow phpunit tests are slow because they try to make the object. In my (rather slow) computer it takes around 50ms to just construct and do a basic use of Language object (you can see the difference by changing the language so it reuses the cache)

That's what I've got in the last two days :) I think TechCom should involve in this discussion. Not super sure and I leave the decision to the author.

I'll comment more on Amir's note above on Monday, but for now wanted to add two other ideas:

  • using --stop-on-failure for PHPUnit (and whatever the equivalent is for QUnit and Selenium) for gate-and-submit might be interesting as a means to speed up the entire cycle when a flaky build happens
  • Maybe we could we consider running one set of Selenium tests in the test and gate-and-submit pipelines, and a second set of tests in a non-voting pipeline? The idea would be to have a Selenium critical test "budget" for each extension / subsystem of core, and these critical tests (which would be annotated with @critical or be in a subdirectory of tests/selenium/critical), which we could limit to maximum runtime of N (two?) minutes, would run as voting tests in test and gate-and-submit pipelines, while the non-critical tests could run in a non-voting pipeline.

Change 515702 abandoned by Hashar:
Set cache for mediawiki to APC for faster browser tests

Reason:
The MediaWiki configuration snippets in integration/jenkins.git are almost no more used. They have been migrated partly to Quibble for some back compatibility but really their canonical location is now in mediawiki/core.git in includes/DevelopmentSettings.php :-]

https://gerrit.wikimedia.org/r/515702

Would you like to join us to discuss this at the next Data³ meeting?

@mmodell If I'm not too late, sure, I'd be interested to join the next one!

QUnit tests are not browser tests, those are selenium tests.

I included QUnit because as I understand it, we currently run them with a browser, but I am mostly talking about Selenium tests, and we could focus the discussion on that.

We've been using QUnit in MediaWiki since 2010, and as part of on-commit CI since ~ 2013 with little to no flakiness. I don't recall any specific issues, but over the whole decade we've maybe had a handful of freak issues that weren't deterministic ("flaky"). And I suspect most (if not all) had more to do with the CI hardware than the tests (e.g. resource exhaustion and things like that).

We spawn a single browser process spawned (Headless Chromium with Karma, back then PhantomJS with Grunt) and run the tests to completion. It's similar to PHPUnit in terms of registration, hierarchy, and process overhead. They generally run to completion orders of magnitude faster than the Selenium tests, despite having a hundred times more tests to run.

The Selenium stack on the other hand is still very young. We have only ~ 1 year of experience with all of the following:

  • Using WebdriverIO on Node.js. (previously Ruby and the selenium-webdriver gem)
  • Using node-fibers. (previously ES6 promises or jQuery deferred)
  • Writing tests with Mocha. (previously QUnit)
  • Running browser tests for each commit. (previously daily)
  • Running browser tests against a temporary and basic install of MW within Jenkins. (previously only against the prod-like Beta Cluster)
  • Having browser tests be "voting" jobs in the merge pipeline. (previously post-merge, once a daily)
  • Running browser tests with multiple extensions installed, and having the same tests work both plainly against core, with the subject extension, and with unrelated extensions installed. (previously tests only had to work against Beta Cluster and/or production, which are similar enough for most purposes)

In addition, we've started doing all these at the same time. This has naturally lead to many mistakes being made while we get used to all this.

I would recommend we don't disable them or make them non-voting, however I do think we need to be very strict about disabling tests that aren't deterministic, and we should start documenting best practices for how (not) to write browser tests.

In addition to adopting some best practices, we should also look into why they are slow (when they are working). I used to think this was due to limited hardware resources in CI (CPU, disk availability, etc.), but there are two reasons I no longer believe this.

  1. Running them locally, they are equally slow.
  2. Other jobs in CI that interact with MediaWiki from a Chromium process are not slow. (Fresnel, QUnit)

In working on Fresnel one can see that our WDIO tests have no excuse for performing so badly. The Fresnel jobs use the same CI hardware, the same Docker images, the same method of installing MediaWiki (Quibble), and the same browser (Chromium). Yet, where Fresnel jobs take less than 30 seconds to fully load the browser dozens of times and fully load several different web pages from the MediaWiki application - The wdio tests have only just started thinking about what to do.

A couple of thoughts for how to make the tests faster and more reliable:

  • Test framework:
    • I suspect there is something wrong with how we authenticate to MediaWiki from our tests. The MWBot class is authenticating way more often over the API than I'd expect. Within a test scenario where two things are done with the API, it should not need to authenticate twice.
    • There might be something about the password hashing algo used by default in MW that is perhaps unsuitable for the virtualised environment we use in Jenkins. This might explain why the login request is so extremely slow. In chatting on IRC, some people suggested using a much weaker algo here (like md5) given these are already sandboxed by other means, and password security is not the point of these tests.
    • I would recommend getting rid of the "node-fibers" setup we use for WDIO right now (comes from wdio-mocha/wdio-sync). Instead we can manage asynchronous code the normal way we already do for our production JavaScript code: with promises, or with async-await. The Fibers project exists to make it "easier" to write async code. However, at least in the way that it is utilised by WDIO, it is proven to be highly counter-intuitive. A testament to this is the fact that every single wdio test I've seen written over the 12 months and uses asynchronicity (including my own tests), gets this wrong at least once. This alone has been a major source of test flakiness.
  • Test conventions:
    • The tests currently rely almost completely on random values not conflicting. This is a departure from how we've written all other tests over the past 10-15 years and is imho a mistake. It makes tests harder to reason about, debug and understand. It also leaves behind a mess on any wiki it is run against. I'd recommend using constant values here like we do for other tests already. For CI this would work as-is given each run time has a fresh database. For running against Beta Cluster and locally there are a number of ways we can accomodate this. One would be to have a setup and teardown step run a maintenance script that clears all known fixtures from the database. Another way could be have the setup step create a temporary clone of the database, that can be triggered in LocalSettings via a cookie, or header, or user-agent suffix, or something else. There's other ways this can work as well, but these are just two rough ideas.

Another note and observation: Running tests in HHVM is extremely slower than php7. phpunit excluding database takes 2 minutes in hhvm and 40 seconds in php7. Selenium takes 4 minutes on hhvm and 1 minute on php7. These jobs hopefully will be dropped soon.

BTW. Given T225496: Improve caching in CI tests is merged, Selenium is considerably faster now

  • I would recommend getting rid of the "node-fibers" setup we use for WDIO right now (comes from wdio-mocha/wdio-sync). Instead we can manage asynchronous code the normal way we already do for our production JavaScript code: with promises, or with async-await. The Fibers project exists to make it "easier" to write async code. However, at least in the way that it is utilised by WDIO, it is proven to be highly counter-intuitive. A testament to this is the fact that every single wdio test I've seen written over the 12 months and uses asynchronicity (including my own tests), gets this wrong at least once. This alone has been a major source of test flakiness.

I've considered that in T182412: Investigate if WebdriverIO `sync: false` would be useful to us and document how to use it.

  • using --stop-on-failure for PHPUnit (and whatever the equivalent is for QUnit and Selenium) for gate-and-submit might be interesting as a means to speed up the entire cycle when a flaky build happens

I want to +1 this as a great idea. Maybe there's a good reason to show as many errors as possible, but I think fail-fast is consistent with our CI infrastructure. As soon as your code fails one test, we can be certain that it's not mergeable, and any further testing is a waste of resources.

Another thought--I often have intermediate patchsets that I want to upload just to make a rebase more obvious, or WIP patches that are pushed mostly for human input or to keep my work safe when I leave for the night, etc. T204125: If there were some way to disable tests if the summary line includes "WIP", or the patch is marked WIP in gerrit, I might use that often. Relying on developers to do special things to prevent tests is not a complete solution, but it might win us back some spare resources.

I want to +1 this as a great idea. Maybe there's a good reason to show as many errors as possible, but I think fail-fast is consistent with our CI infrastructure. As soon as your code fails one test, we can be certain that it's not mergeable, and any further testing is a waste of resources.

@awight I've created T225871: Selenium and PHPUnit: Stop execution on failure to follow up on this one.

MaxSem added a subscriber: MaxSem.Jun 16 2019, 5:17 AM

Run CI browser tests in parallel

with SQLite

The problem here is that SQLite is not designed for parallelism. Best case, one thread will hang until another thread writing to the database unlocks it. Worst case, it will time out and MW will get an SQL exception. We'll have to use MySQL for parallelism.

T204125: disable tests if the summary line includes "WIP"

Proposal: we go one step further, and make the default behavior for all patchsets follow un-whitelisted author rules, roughly speaking:

  • When pushing a new patchset, no tests are run regardless of who the author is.
  • When a trusted user CR+1 and/or leaves the gerrit comment "check all", "check lint", etc., run those tests.
  • When CR+2 a patchset, run gate-and-submit.

T204125: disable tests if the summary line includes "WIP"

Proposal: we go one step further, and make the default behavior for all patchsets follow un-whitelisted author rules, roughly speaking:

  • When pushing a new patchset, no tests are run regardless of who the author is.
  • When a trusted user CR+1 and/or leaves the gerrit comment "check all", "check lint", etc., run those tests.
  • When CR+2 a patchset, run gate-and-submit.

This kinda misses the point: this ticket is about our CI being needlessly annoying. I don't see how forcing everyone to make extra steps just to run tests is going to make it less annoying.

Wondering if anyone else has thoughts on this. Maybe it could be a topic at techconf.

Today I'm reviewing a patch for Echo, but the Selenium tests fail, so I need to disable the test, merge that, then rebase the patch I'm reviewing.

For Echo, each extension cloned as part of the CI build will have its Selenium tests run. The end result is that https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/32205/consoleFull takes 19 minutes, 10 minutes of which are Selenium.

In practice this looks like:

  • 20 minutes with "recheck" to see if the test was flaky or not
  • a few seconds to disable the check
  • 20 minutes for Jenkins to give a +2 verified to the patch which verified it (it got a CR +2 in the meantime)
  • Another 20 minutes to wait for the patch to merge
  • Next I'll rebase, and wait 20 minutes for the patch I'm reviewing to get +2 Verified
  • Finally I'll merge the Echo patch I'm reviewing

If Selenium tests were in their own pipeline (voting or not) at least we'd get some earlier feedback in Gerrit. Maybe Selenium tests for dependencies could be separated out into another pipeline. Other thoughts?

I don't (yet) have opinions on the proposal, but do have some history bits:

  1. The different kinds of tests were actually separate, concurrent, jobs before (phpunit db-ed, phpunit dbless, qunit, selenium). They were merged only "recently" (2017?) as part of the migration to Quibble.
    • I don't recall exactly why. I believe decision was made based on work from Antoine and/or Dan Duvall due to the limited number of VMs allocated to CI right now. In other words a trade-off between 1) handling more patches concurrently with possibly better average turnaround for devs and better minimum turnaround for SWAT and for non-MW repos; or 2) handling few patches at once but with with better minimum turnaround when it's not busy in Gerrit land. The first one was chosen.
  2. Something about consuming less CPU time (albeit more human time) if combined, because we'd no longer re-do common steps as often such as fetching from Git and installing stuff. It might be time to re-evaluate this, especially because various people have already sliced up the way Quibble runs in CI ("composertest" was separated). We could split "QUnit" and Selenium as well, and they don't need to run for multiple PHP flavours.
  3. Selenium jobs used to be separate and non-voting. Before that, they were not on-patch at all but rather run daily. Before that, they were even maintained in a separate repo and ran only against Beta/prod. They have been made voting and integrated (for the few actively maintained extensions that partake in the shared extension gate) so as to reduce regressions that we otherwise only catch through manual testing in production. This is something we should aim to systemically avoid at all costs.

I think therefore that given our highly fragile, uncoordinated and undisciplined quality practices, we must not regress further from this relatively low standard by making Selenium tests "non-voting" (and QUnit we already established isn't an issue here). These Selenium tests must be required to pass a some point between "merging into master" and deploying to "production". For new branch, and also for cherry-picks.

Today, we do that towards the beginning of that spectrum. Mainly because we don't have any other workflows that involve CI elsewhere on this spectrum. But, perhaps such workflow could be created? How would it work? What would it look like?

See also:

  • Deployment Pipeline™.
  • Practically zero QA, peer, or product review for merges on weekends, Mondays or Tuesdays (basically, only merges on Wednesday-Friday enjoy meaningful exposure) – see T118212.
  • Beta Cluster is up for grabs – T215217.
  • Teams don't sign off on train deployment groups (e.g. changes in train schedules, or team absence/vacations/forgetfulness means changes go live untested by their QA).

Current work happening in the Quibble codebase will make it possible to run jobs in parallel again, but with better control over dependencies and with less redundant work than if Jenkins were responsible for scheduling each stage. For example, we don't want to run browser or PHPUnit tests until a patch has passed basic lint checks, which is easy to implement in Quibble.

My +1 is to keep the browser tests voting, because they serve an important purpose of preventing user-facing regressions, and to keep running gated extension tests during the merge stage, because we do have interactions between extensions (and core), and because the alternative is that we find ourselves fixing multiple, silently broken tests in the future.

There are two motivations behind this task, one is that some tests are flaky, which has the simple solution of skipping flaky tests. The other is that the tests are slow, which can hopefully be addressed with optimizations. We should separate those two concerns into their own investigations, and maybe set performance goals for "how long is too long?"

Krinkle added a comment.EditedWed, Sep 11, 6:23 PM

[..] For example, we don't want to run browser or PHPUnit tests until a patch has passed basic lint checks, which is easy to implement in Quibble.

I think we should reconsider this. Most static checks for MW commits already run as independent jobs (phan, node10-npmtest, composer-test). Should Quibble repeat these in its own job before starting PHPUnit, QUnit or WDIO tests?

It is unclear to me what value that adds, except that it might make it so that if there is a syntax error, the developer will not get a confusing error from a PHPUnit test. But instead get the same (nice) php-lint error twice, once from composertest-php72 and a second time from quibble-php72. The cost of that convenience is delaying the jobs by several minutes, plus to consume more CI infra capacity as all php-variants do the same thing again as well. We can also look at the usability of the Gerrit comments that Jenkins submits, to see if we can more clearly indicate in which order they are best looked at. E.g. to look at "composer-test FAILED" before looking at "quibble-php72 FAILED".

On second thought, it seems this is already not the case. The quibble-php7x jobs explicitly have --skip composer-test set and they've never included Phan (too slow). The only step they duplicate currently is "npm-test" (same as the node10-npmtest job). Which we might want to remove as well?

Another avenue we could explore is short-circuiting the Zuul pipeline in case certain jobs have failed. For example, we could say that if composer-test or node10-npmtest fails, to abort the other jobs. Right now, when the composer-test job has failed (usually within 1-2 minutes) the Gerrit patch is not informed about this until after all the php variant jobs have been able to get an executor slot and failed the same way, which can take much longer.


EDIT: Looks like it was already agreed in April 2019 to skip npm-test from Quibble. But, the skipping of it stopped working due to T233143.

The QUnit tests should definitely be tested on a per patch basis. I would prefer to keep the Selenium tests as well since experience has shown that they would quickly fail and end up be mostly useless.

For Selenium/Qunit, we should be able to extract them to a standalone job. T232759