Page MenuHomePhabricator

Optimize speed of selenium tests, December 2024
Open, Needs TriagePublic

Description

This task aims to collect possible performance improvements that would also speed up our selenium suite. Extracted from T225730 to keep it more focused.

I have generated updated profile for the gate jobs, using Amir's patch above and uploaded them on people.wm.org. See T225730#9033209 for how to visualize them, e.g. https://performance.wikimedia.org/excimer/speedscope/#profileURL=https://people.wikimedia.org/~daimona/T225730-profile-2024-12-09/selenium-wmf-cpu-profile.log.

I am particularly interested in the selenium jobs, now that PHPUnit is run in parallel in most places. I haven't analyzed them in depth for now, but at first glance it does look like we're spending a decent amount of time in load.php, and particularly parsing LESS files. I'm not sure if that is expected, but 10% of the total time seems a lot. I'm also somewhat surprised to see APCUBagOStuff::doGet amongst the functions with the larger self times. At first I thought it might just be that we call it a lot (no way of knowing from a sampled profile), but then it shows that there seems to be a single caller responsible for the majority of the spent time, and that is ExtensionRegistry::loadFromQueue. I'm not sure what to make of it, for the time being.

Thanks for the fresh profiles!

  • I already spend some time in our (forked) Less.php library and managed to improve it quite a bit, most notably via https://gerrit.wikimedia.org/r/939717. I'm not surprised it's still one of the most expensive things we do at startup. I might have another look.

I think it makes sense that parsing LESS would take some time, but I'm surprised by this being 10% of the overall time spent in tests. My main (implicit) question is whether we are parsing the same things over and over again, or if there are otherwise missed caching opportunities. If not, and parsing LESS is just this slow, then I do agree that it would be worth improving its performance.

  • I also noticed that Wikimedia-Minify shows up relatively high. This is not surprising as it's also an actual parser that processes every bit of JavaScript we have one character at a time. While it's heavily optimized already there is certainly still something to improve.

Yup, here too I mostly wanted to confirm that we're making good use of caches.

  • Why is ConvertibleTimestamp::setTimestamp so expensive? This doesn't seem right.

I suppose it might be called a lot of times; we can't know with Excimer. It's not a trivial setter either, so I could see it taking some time if it's called a lot. Worth looking into this though, to see what the number of calls really is, and if anything can be done about it.

Event Timeline

Change #1101905 had a related patch set uploaded (by Daimona Eaytoy; author: Thiemo Kreuz (WMDE)):

[mediawiki/libs/less.php@master] Optimize Less_Visitor::visitObj for performance

https://gerrit.wikimedia.org/r/1101905

Change #1101905 merged by jenkins-bot:

[mediawiki/libs/less.php@master] Optimize Less_Visitor::visitObj for performance

https://gerrit.wikimedia.org/r/1101905

I wanted more data to look at, especially for selenium. Now, the problem with selenium is that some tests are brittle, and in general it can be hard to run tests consistently. So, I ran tests for the CampaignEvents extension only: they're complex enough but not too much, I know that they pass on my machine, and I know what to do if they fail.

The first thing that came out of it is actually super simple, and something I've always suspected to be the case, but surprisingly never verified: selenium is much slower in CI! For comparison:

Local
» /tests/selenium/specs/editEventRegistration.js
4 passing (14.6s)
Running: Chrome (v131.0.0.0) on linux
» /tests/selenium/specs/enableEventRegistration.js
3 passing (8.6s)
Running: Chrome (v131.0.0.0) on linux
» /tests/selenium/specs/eventPage.js
3 passing (13.4s)
Running: Chrome (v131.0.0.0) on linux
» /tests/selenium/specs/myEvents.js
1 passing (3.9s)
Spec Files:	 4 passed, 4 total (100% completed) in 00:00:50
CI
» /tests/selenium/specs/editEventRegistration.js
4 passing (22.1s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/enableEventRegistration.js
3 passing (14s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/eventPage.js
3 passing (19.1s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/myEvents.js
1 passing (5.1s)
Spec Files:	 4 passed, 4 total (100% completed) in 00:01:07

Note that there's no concurrency here, and excimer was enabled both locally and in CI. So the next thing I did was reproduce a CI run locally. I chose https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74/44175, that ran in gate-and-submit for r1101886 today. This is easy enough by following the instructions in https://doc.wikimedia.org/quibble/.

The results are consistent with the run above, and they confirm that CI is much slower than local:

Local
» /tests/selenium/specs/editEventRegistration.js
4 passing (18.7s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/enableEventRegistration.js
3 passing (10.6s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/eventPage.js
3 passing (14.6s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/myEvents.js
1 passing (3.4s)
Spec Files:	 4 passed, 4 total (100% completed) in 00:00:52
CI
» /tests/selenium/specs/editEventRegistration.js
4 passing (25s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/enableEventRegistration.js
3 passing (14.2s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/eventPage.js
3 passing (20.1s)
Running: Chrome (v90.0.4430.212) on linux
» /tests/selenium/specs/myEvents.js
1 passing (4.8s)
Spec Files:	 4 passed, 4 total (100% completed) in 00:01:12

So, maybe the first thing should be figuring out why this is the case.


With that out of the way, I ran the CampaignEvents selenium suite a few times, directly on my machine without quibble interfering. The slowest functions (both self and total) are different than those seen in CI. This is partly explained by me running tests for just an extension, and not all of them. But there are some similarities still:

Local, self > 0.5%Local, total
image.png (1×1 px, 238 KB)
image.png (1×1 px, 402 KB)

I still have to analyze these though.

Change #1102252 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/core@master] Optimize expensive StaticArrayWriter for speed

https://gerrit.wikimedia.org/r/1102252

Thanks for the fresh profiles! [… ResourceLoader load.php requests …]

Beware that these excimer profiles, afaik, display all samples linearly in a way that implies that the total is a sampled timeline of how the CI run spent its wall-clock time. However, even when using wall-clock timers, this is not the case because:

  1. nearly all load.php requests run in parallel with at least 2 or 3 other index.php or load.php requests. And,
  2. load.php requests are generally not in the critical path for HTML rendering and thus would not be blocking Selenium test progression.

Change #1102252 merged by jenkins-bot:

[mediawiki/core@master] StaticArrayWriter: Optimize expensive logic for speed

https://gerrit.wikimedia.org/r/1102252

Change #1113541 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] resourceloader: Use hashtable to check recursion in StartUpModule

https://gerrit.wikimedia.org/r/1113541

Timing breakdown for a random CheckUser patch in gate-and-submit (from https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/12352/consoleFull), rounding to integer seconds.

Chronological/hierarchical breakdown:

  • Total time: 21m 36s
    • Env setup (until zuul clone excluded): 16s
      • 10s spent in rsync
    • Zuul clone: 33s
    • Git submodule update: 9s
    • Check success cache: 2s
    • Composer install: 10s
    • Start MySQL: 2s
    • npm install: 13s (in parallel with MW install which takes 8s)
    • Actually running tests: 17m 58s
      • CheckUser: 50s
        • npm install: 9s
        • selenium: 40s (5 parallel)
      • core: 1m 7s
        • npm install: 10s
        • selenium: 55s (10 parallel)
      • AbuseFilter: 40s
        • npm install: 5s
        • selenium: 34s (3 parallel)
      • CampaignEvents: 1m 29s
        • npm install: 6s
        • selenium: 1m 22s (4 specs, NOT parallel)
      • Cite: 1m 36s
        • npm install: 15s
        • cypress: 1m 20s
      • FileImporter: 16s
        • npm install: 8s
        • selenium: 7s (1 spec)
      • GrowthExperiments: 2m 53s
        • npm install: 20s
        • cypress: 2m 32s
      • GuidedTour: 33s
        • npm install: 15s
        • cypress: 18s
      • IPInfo: 41s
        • npm install: 5s
        • selenium: 35s (2 parallel)
      • Math: 12s
        • npm install: 6s
        • selenium: 6s (1 spec)
      • PageTriage: 40s
        • npm install: 9s
        • selenium: 30s (2 parallel)
      • ProofreadPage: 40s
        • npm install: 6s
        • selenium: 32s (3 specs NOT parallel)
      • VisualEditor: 48s
        • npm install: 16s
        • selenium: 31s (2 parallel)
      • Wikibase: 4m 1s
        • npm install: 1m 37s
        • selenium repo: 2m 13s (9 specs NOT parallel)
        • selenium client: 6s (1 spec)
      • MinervaNeue: 26s
        • npm install: 11s
        • selenium: 12s (2 parallel)
    • Save success cache: 2s
    • PostBuild: 2m 10s

Grouped breakdown:

  • Total time: 21m 36
    • CI setup: 1m 24s (of which 13s of npm install)
    • Actually running tests: 17m 58s
      • npm install: 3m 58s
      • selenium/cypress: 12m 33s
    • CI teardown: 2m 13s

Potential action items / things to look into from the above:

  • Parallelizing the remaining tests (T226869) could help. Considering the timing of the slowest spec and a concurrency of 4, we would save ~50s in CampaignEvents, ~15s in ProofreadPage, and ~75s in Wikibase (keeping in mind that this is for just the repos that are tested together with CheckUser). I don't know about Wikibase, but my experience in CampaignEvents is that parallel tests seem more flaky, and debugging failures in CI is basically impossible due to T344754: Browser tests video capture is shared between tests.
  • We spend a lot of time in npm install. Are we actually caching npm packages? Are we only caching them for core, but not individual extensions? Is there a way in npm to print whether the cache was used? A quick local test with GrowthExperiments shows that npm install takes 19s uncached, and 7s from cache. So, either CI is not using the npm cache, or it's ~3x slower than local; either way, this is bad. Filed T393218.
  • Both npm install and the actual tests seem slower when cypress is used instead of webdriverio. It would be amazing if we used the same test framework everywhere but oh well. Maybe look into parallelizing cypress? The docs have something about it: https://docs.cypress.io/cloud/features/smart-orchestration/parallelization. Certainly it would be easier if we used the same test framework everywhere. That being said, maybe cypress isn't slower per se, and it just so happens that repos using cypress have more tests; maybe it would be worth measuring. Oh and did I mention that it would be easier if we used the same framework everywhere? I don't care if it's wdio, cypress, oak, or whatever, as long as it's the same everywhere and it's reasonably fast.
  • The entire PostBuild section is slow, and this is especially problematic because these are all things that happen after the patch has been tested. We spend 31s verifying the npm cache, 35s in castor-save-workspace-cache rsyncing the cache, and 54s cleaning up the workspace [1]. Is there anything we can do about it? Ideally, I imagine we'd do these asynchronously without blocking the main job, but I suppose that isn't possible. Still, anything else to reduce impact? Filed T393220.
  • Obviously, we can also improve the tests themselves, for example at the MW level, which was previously the focus in this task.
  • Likely something at the infrastructure level. As noted in T381895#10395616, local runs are much faster than in CI. I just tried again with PHP 8.3 on MediaWiki-Docker and it seems to be roughly twice as fast as CI.

(As an aside, another action item would be to improve the sections in CI output, as currently some sections are missing and others are nested while they shouldn't be. I guess that's T303056.)

[1] - That is, exec docker run --entrypoint=/usr/bin/find --user=root --volume /srv/jenkins/workspace/wmf-quibble-selenium-php81:/workspace --security-opt seccomp=unconfined --init --rm --label jenkins.job=wmf-quibble-selenium-php81 --label jenkins.build=12352 --env-file /dev/fd/63 docker-registry.wikimedia.org/buster:latest /workspace -mindepth 1 -delete

Both npm install and the actual tests seem slower when cypress is used instead of webdriverio. It would be amazing if we used the same test framework everywhere but oh well. Maybe look into parallelizing cypress? The docs have something about it: https://docs.cypress.io/cloud/features/smart-orchestration/parallelization. Certainly it would be easier if we used the same test framework everywhere. That being said, maybe cypress isn't slower per se, and it just so happens that repos using cypress have more tests; maybe it would be worth measuring. Oh and did I mention that it would be easier if we used the same framework everywhere? I don't care if it's wdio, cypress, oak, or whatever, as long as it's the same everywhere and it's reasonably fast.

I agree that it'd be good to standardize around one framework instead of two. Having multiple frameworks, languages, etc. in general is a source of complexity and technical debt. The idea of having MediaWiki devs know a couple of core languages and frameworks and then be able to quickly and easily work on other MediaWiki repos is appealing. I believe Selenium/wdio is the one that's standard in Wikimedia technical spaces?