Page MenuHomePhabricator

Bring up two copies of the CirrusSearch browser integration env in cloud
Closed, ResolvedPublic5 Estimated Story Points

Description

The existing instance of integration testing for CirrusSearch fell over a month or three ago. This infrastructure is critical for the upcoming upgrade to 7.10. The existing instance should either be restored or replaced, and a second instance should be stood up running elastic 7.10, gating it's checking/voting to the es7 branch.

Instructions: https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/tests/integration/README.wmv-wmcs.md

Event Timeline

In the past when rebuilding the integration env I often forget to save the ssh keys from the instance, and then have to spend extra time getting that all reset. This time around will try to remember to keep those.

Gehel set the point value for this task to 5.Nov 22 2021, 4:20 PM

So far i've setup a new instance, cirrus-integ02.search.eqiad1.wikimedia.cloud, and gotten it passing the test suite with a few consecutive V+1 for the repair patch. I've moved those changes over to the existing instance, cirrus-integ.eqiad.wmflabs, and have also gotten one V+1 from it, will let it loop a few more times.

Issues resolved so far:

  • I forgot most of how to work with the browser automation, but it turns out remote debugging from the chrome inspector to chrome-driver on the browser-bot has greatly improved since we built this and helped things along.
  • Development environment has error_reporting turned on by default, but deprecations caused Special:Random to return warnings instead of a redirect, causing tests that start on a random page (most) to fail. Disabled error_reporting in settings.d
  • barrybot.py was reading stdin before stderr, which allowed a deadlock when the stderr buffer filled up. Redirected stderr to stdout to avoid having to select() or use threads to read in parallel. This unfortunately took days to track down as it wasn't clear why the node process would finish the test suite but not exit.
  • Parallelization fails with errors related to the browser not becoming available, or browser windows unexpectedly closing. Looked into it but no answer, instead disabled parallelization which increased runtime from ~8min to ~13min.
  • The default automation browser width is 800x600, which triggers css media queries related to the search box. This caused difficulty with the automation trying to interact with elements that aren't currently visible. Tests could be reworked to focus correct elements, but for now increased browser window to 1200x800 to get the traditional search box that is always present.
  • The search box now features a submit button that comes in and out of display, the automation was intermittently trying to click the button when it wasn't currently displayed. Avoid by sending the enter key to the form input instead of clicking a button that may or may not be there.
  • File uploads weren't showing up in cirrus within the 30s we typically wait. Increased timeout to 45s, still didn't get there. Increased jobrunner parallelizm from 1 to 4, and increased container memory limit. Seems to have worked.
  • vagrant provision on the new instance didn't detect it as a labs instance, likely because it's been renamed to .wikimedia.cloud. Hacked the appropriate domains into place so the external proxies for debugging work.
  • The test that searches for África only manages to input frica into the search box, but only on cirrus-integ.eqiad.wmflabs and not the new instance. Unclear what it was, fixed by destroying and rebuilding the vagrant instance
  • Intermittent error, seen once and not solved yet, where Template:Template Test got the rendered content (not the revision text) intended for Catapult/adsf (the following edit). Liberal application of action=purge fixed the results, not clear what this is. action=purge fixing the problem suggests the mediawiki parser cached/returned the wrong content somehow.

Future thoughts:

  • Our team doesn't own a UI, can we remove the browser from the test suite? In the past we moved most of the testing in this suite to use the api instead, i suspect some of the remaining cases can be converted (go tests could use the nearmatch api, etc.). Various problems with the environment that accumulated since we last used it amount to changes to parts of the UI. This will continue to break in the future, as the teams that make UI changes will not be fixing this test suite.

Change 753189 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Repair browser bot integration

https://gerrit.wikimedia.org/r/753189

Change 753189 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Repair browser bot integration

https://gerrit.wikimedia.org/r/753189

Change 756581 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@es68] Repair browser bot integration

https://gerrit.wikimedia.org/r/756581

Change 756581 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@es68] Repair browser bot integration

https://gerrit.wikimedia.org/r/756581

Packages are now available for es68. Currently deployed and voting:

hostbranch
cirrus-integ.eqiad.wmflabsmaster
cirrus-integ02.eqiad1.wikimedia.cloudes68

When ready the es68 instance can instead vote against master by changing the --branch es68 line in run-cindy.sh.

With the es68 branch now in prod i've created an es710 branch and upgraded cirrus-integ.eqiad.wmflabs to run against it.