Page MenuHomePhabricator

Move Selenium tests away from beta cluster: Selenium builds regularly failing due to failed API calls and timeouts
Open, Needs TriagePublic

Description

Background

We have had false positives in the Slack #performance-alerts channel recently flagged by failing builds in https://integration.wikimedia.org/ci/view/Readers%20Teams/
They seem to happen at times when the beta cluster is unstable (e.g. down or read only or API issues - often due to bot scraping)

A recent error looked liked this:

Error: invalidjson: No valid JSON response
    at /src/node_modules/mwbot/src/index.js:254:31
    at tryCatcher (/src/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/src/node_modules/bluebird/js/release/promise.js:547:31)
    at Promise._settlePromise (/src/node_modules/bluebird/js/release/promise.js:604:18)
    at Promise._settlePromise0 (/src/node_modules/bluebird/js/release/promise.js:649:10)
    at Promise._settlePromises (/src/node_modules/bluebird/js/release/promise.js:729:18)
    at _drainQueueStep (/src/node_modules/bluebird/js/release/async.js:93:12)
    at _drainQueue (/src/node_modules/bluebird/js/release/async.js:86:9)
    at Async._drainQueues (/src/node_modules/bluebird/js/release/async.js:102:5)
    at Async.drainQueues [as _onImmediate] (/src/node_modules/bluebird/js/release/async.js:15:14)
    at process.processImmediate (node:internal/timers:483:21)

and

Unhandled Rejection at: Promise {
12:49:51 [0-0]   <rejected> WebDriverRequestError: WebDriverError: Request failed with error code UND_ERR_HEADERS_TIMEOUT when running "url" with method "GET"
12:49:51 [0-0]       at FetchRequest._libRequest (file:///src/node_modules/webdriver/build/node.js:1941:13)
12:49:51 [0-0]       at async FetchRequest._request (file:///src/node_modules/webdriver/build/node.js:1967:20)
12:49:51 [0-0]       at async Browser.wrapCommandFn (/src/node_modules/@wdio/utils/build/index.js:982:23) {
12:49:51 [0-0]     url: URL {

The API calls in particular seem to be problematic as they need to be called at the beginning of each run (and often are unnecessary)
Could these jobs run against a more stable dedicated environment?

User story

  • Add user story in the format: “As a [persona], I want to [X], so that [Y]”

Requirements

  • Add task requirements. Requirements should be user-centric, well-defined, unambiguous, implementable, testable, consistent, and comprehensive

BDD

  • For QA engineer to fill out

Test Steps

  • For QA engineer to fill out

Design

  • Add mockups and design requirements

Acceptance criteria

  • Add acceptance criteria

Communication criteria - does this need an announcement or discussion?

  • Add communication criteria

Rollback plan

  • What is the rollback plan in production for this task if something goes wrong?

This task was created by Version 1.2.0 of the Web team task template using phabulous

Event Timeline

Discussed this in refinement for QS-Automation today:

  • Catalyst needs to support building extensions similar to how Wikifunctions is handled to allow more ephemeral environments to be used

@thcipriani Is this something your team could assist with?

Once that work is done, QS-Automation can help with adjusting the job to target Catalyst environments, and engage with REaders to determine what test data needs to be piped in for the tests to work.

This error (from the task description) is because of mwbot is deprecated, and the wdio-mediawiki version needs upgraded from 4.1.3 to 6+ for Related articles and Popups - mwbot is removed in the newer versions of wdio-mediawiki. I will make tasks and make those upgrades.

Error: invalidjson: No valid JSON response
    at /src/node_modules/mwbot/src/index.js:254:31
    at tryCatcher (/src/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/src/node_modules/bluebird/js/release/promise.js:547:31)
    at Promise._settlePromise (/src/node_modules/bluebird/js/release/promise.js:604:18)
    at Promise._settlePromise0 (/src/node_modules/bluebird/js/release/promise.js:649:10)
    at Promise._settlePromises (/src/node_modules/bluebird/js/release/promise.js:729:18)
    at _drainQueueStep (/src/node_modules/bluebird/js/release/async.js:93:12)
    at _drainQueue (/src/node_modules/bluebird/js/release/async.js:86:9)
    at Async._drainQueues (/src/node_modules/bluebird/js/release/async.js:102:5)
    at Async.drainQueues [as _onImmediate] (/src/node_modules/bluebird/js/release/async.js:15:14)
    at process.processImmediate (node:internal/timers:483:21)

Still though ... I agree that we do need to migrate from betacluster. What are the top contenders for initial migration for daily runs from betacluster to a catalyst env @Jdlrobson-WMF @SLong-WMF? Maybe Related articles and Popups? A sortable list of automated tests that run daily jobs is available (and wdio-mediawiki versions) here

I think we are missing a strategy for the test we run as @daily or it isn't clear to me? In your list @vaughnwalters (I love that list!) we have extensions that runs the exact same tests as daily even though the test is in the gated job (Math for example). For Echo we run all tests as gated and then a subset of those as daily. For example for the test that runs as gated for an extension I would say "We run these tests to make sure our core functionality isn't broken by changes in the list of extensions that runs in gated". What would we say for the test that we run against beta?

What are the top contenders for initial migration for daily runs from betacluster to a catalyst env

From my perspective, I'd say Popups, RelatedArticles in that order. Out of these Popups may be the trickiest since it needs the page transform service so has hidden dependencies beyond MediaWiki setup.

FWIW the tests in Minerva were recently noted by the maintainers as redundant so those are likely to be removed at some time in the future.

I'd agree. Popups -> RelatedArticles. It's also good to know the Minerva tests are going to get removed. Does the list you created automatically remove things that are no longer showing up, @vaughnwalters?

When the tests are removed they will automatically no longer display in the test list and it will display as having zero tests. Then I will have to manually remove it from the repos.txt file that chooses which repos to scan.

Does the list you created automatically remove things that are no longer showing up, @vaughnwalters?