Page MenuHomePhabricator

Make MediaWiki Wdio tests less slow (Sept 2019)
Open, Needs TriagePublic

Description

Task to track a couple of things I want to try out (and revert/abandon if not compatible and faster):

Test runner:

  • Upgrade from Wdio v4 to v5. T213268
  • Use native async-await instead of the wdio-sync/Fibers hack.
  • Let Wdio use the devtools protocol instead of the webdriver protocol to connect with browsers locally and in CI. This is a new feature in Wdio 5 and should be significantly faster for test execution. It also removes the need for the chromedriver executable.

Test configuration:

  • Determine what overhead video recording is having on test execution.
  • If more than a little bit, consider disabling by default and opt-ing in via a "check <something>" command in Gerrit instead (which Quibble could detect via an environment variable like ZUUL_PIPELINE to decide to run something like npm run selenium-video, which we'd create as variant to npm run selenium).

Test execution:

  • Figure out why the API requests we run with MWBot from Wdio tests are slow (5 seconds is not unusual). This makes no sense to me because MediaWiki in CI has been hyper-optimised, as demonstrated by Fresnel and QUnit tasks running MediaWiki with super low latencies (almost everything in pure RAM, full MW page loads in Fresnel with Chrome take under 150ms, incl load.php etc.).
  • Talked with @kostajh and @Reedy about this, they suggested that maybe 1) the login session keeps getting lost between test cases and 2) login might be very slow because of too strong password crypto requirements by default which are intentionally CPU intensive.

Event Timeline

Change 539441 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] build: Upgrade from webdriver v4 to v5

https://gerrit.wikimedia.org/r/539441

Change 539691 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] selenium: Remove "RunJobs" wait from specialrecentchanges test

https://gerrit.wikimedia.org/r/539691

Change 539865 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] selenium: Use async-await for wdio-mediawiki/Api internals

https://gerrit.wikimedia.org/r/539865

Change 539866 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] selenium: Fix inefficient use of MWBot in specs/page.js

https://gerrit.wikimedia.org/r/539866

Change 539441 merged by jenkins-bot:
[mediawiki/core@master] selenium: Upgrade from webdriver v4 to v5

https://gerrit.wikimedia.org/r/539441

Change 539691 merged by jenkins-bot:
[mediawiki/core@master] selenium: Remove "RunJobs" wait from specialrecentchanges test

https://gerrit.wikimedia.org/r/539691

Change 539931 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/core@master] [DNM] package.json: Drop @wdio/sync

https://gerrit.wikimedia.org/r/539931

Change 539865 merged by jenkins-bot:
[mediawiki/core@master] selenium: Use async-await for wdio-mediawiki/Api internals

https://gerrit.wikimedia.org/r/539865

Change 539866 merged by jenkins-bot:
[mediawiki/core@master] selenium: Fix inefficient use of MWBot in specs/page.js

https://gerrit.wikimedia.org/r/539866

Change 539935 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/core@master] Selenium: Remove browser.setLocalStorage call

https://gerrit.wikimedia.org/r/539935

Change 539942 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/core@master] Selenium: Hide VE welcome dialog

https://gerrit.wikimedia.org/r/539942

Change 539935 abandoned by Kosta Harlan:
Selenium: Remove browser.setLocalStorage call

Reason:
in favor of https://gerrit.wikimedia.org/r/c/mediawiki/core/ /539942

https://gerrit.wikimedia.org/r/539935

Change 539942 merged by jenkins-bot:
[mediawiki/core@master] Selenium: Hide VE welcome and beta dialogs

https://gerrit.wikimedia.org/r/539942

Change 539965 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/core@master] WDIO: Use pageLoadStrategy=eager with Chromedriver 77

https://gerrit.wikimedia.org/r/539965

Change 540004 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] selenium: Fix more inefficient MWBot use and simplify wdio-mediawiki Api

https://gerrit.wikimedia.org/r/540004

Change 539965 abandoned by Kosta Harlan:
WDIO: Use pageLoadStrategy=eager with Chromedriver 77

Reason:
This needs to wait for Chrome/Chromium 77 in Debian

https://gerrit.wikimedia.org/r/539965

Have folks considered using child_process.exec() to replace usages of bot.edit()? Seems like that would be considerably faster and accomplish the same goal of generating content for the tests to make assertions against.

Change 540004 merged by jenkins-bot:
[mediawiki/core@master] selenium: Fix more inefficient MWBot use and simplify wdio-mediawiki Api

https://gerrit.wikimedia.org/r/540004

Have folks considered using child_process.exec() to replace usages of bot.edit()? Seems like that would be considerably faster and accomplish the same goal of generating content for the tests to make assertions against.

@Krinkle I meant to write, "using child_process.exec() with php maintenance/edit.php to replace usages of bot.edit()".

Some preliminary tests locally didn't show a performance improvement so I abandoned exploring this.

Change 583481 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] Experiment with a fast wgPasswordDefault for CI/Selenium.

https://gerrit.wikimedia.org/r/583481

Change 583481 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] Experiment with a fast wgPasswordDefault for CI/Selenium.

https://gerrit.wikimedia.org/r/583481

Status quo ($wgPasswordDefault = 'pbkdf2', slow):

  • wmf-quibble-selenium-php72-docker SUCCESS in 10m 54s
  • mediawiki-quibble-selenium-vendor-docker SUCCESS in 2m 37s

Alternate ($wgPasswordDefault = 'A, uses md5, fast):

  • wmf-quibble-selenium-php72-docker SUCCESS in 10m 22s
  • mediawiki-quibble-selenium-vendor-docker SUCCESS in 2m 20s

This isn't a major compounding difference, as I would expect if the pwd hashing was a major time sink during the dozens of login requests. Both ways can be repeated and are within the regular swings it seems.

Does anyone have any further ideas on addressing this? It's by far the biggest issue around CI responsiveness.

I can reproduce it locally as well. I don't understand why it is so slow.

It definitely isn't MediaWiki, or PHP, or Docker, or Chromium. This because those are also involved in Fresnel performance jobs, using the exact same quibble-provisioned MW setup, inheriting the same docker image and software packages, and this runs several orders of maginitude faster. Loading dozens of MW articles and special pages, perform user actions, take screenehots, record debug timelines, and calling through devtools protocol to observe/assert results.

That only leaves me to think it's one or more of these:

  • Wdio overhead.
  • The specific cli frags Wdio passes to Chromium.
  • Bad code in our test cases.
  • Bad code in other methods we call in-between browser methods, such as the API calls we make (via node-mwbot).

The main point I've been suspecting so far is the API calls. Those definitely do stand out when debugging things locally, but haven't very strongly verified exactly where the time is spent (e.g. the time could be spend in our own code before or after the API call is logged). In particular, the fact that we are doing a huge amount of logging-in over and over again. I have reduced the number of needless logins and consolidated several to re-use the Bot instance in Core and Vector tests. But, while that has reduced the number of logins and API requests, they are still incredibly slow.

I'm doing some investigation on this and hope I can present something good but in the meantime, there's a huge low hanging fruit here. One third of all the time of browser tests is being spent on running TwoColConflict browser tests. It's consistently between 6-7 minutes and that's crazy. Wikibase with huge browser test suite (including client, data-bridge, tainted ref) is 4 minutes and core is 1.5 minutes. The situation is so bad that I suggest we temporarily remove TCC from gated extensions.

I'm not saying TCC browser tests are bad and should be removed. I feel there's a huge bug there that no one has noticed before (maybe something like waiting ten seconds between each api query or something). I try to look at it.

Specially picking from https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/94431/consoleFull

09:08:32 INFO:backend.PhpWebserver:[Thu May 13 08:08:32 2021] 127.0.0.1:56722 [200]: //api.php?format=json
10:07 AM 
09:08:44 INFO:backend.PhpWebserver:[Thu May 13 08:08:32 2021] 127.0.0.1:56724 [200]: //api.php?format=json

There is 12 seconds of dead space between these API requests for example.
This is consistent with what is seen in T281122, but it looks like the grafana dashboard showing memory compaction events has vanished, so I can't back this up currently.

Change 690887 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[integration/config@master] Temporary remove TwoColConflict from gated extensions

https://gerrit.wikimedia.org/r/690887

Change 690887 merged by jenkins-bot:

[integration/config@master] Zuul: Temporarily remove TwoColConflict from gated extensions

https://gerrit.wikimedia.org/r/690887

Mentioned in SAL (#wikimedia-releng) [2021-05-15T18:38:55Z] <James_F> Zuul: Temporarily remove TwoColConflict from gated extensions T234002 T282935.

zeljkofilipin raised the priority of this task from Medium to Needs Triage.Fri, Sep 17, 9:27 AM