Page MenuHomePhabricator

Lexeme daily browser tests (against beta) flaky [timebox 8h]
Open, Needs TriagePublic

Description

The daily lexeme browser tests (against beta) appear to be rather flaky. One recent failure gave an interesting error message which may help to make this more stable.

Could not save due to an error. The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. The System administrator who locked it offered this explanation: $1

as seen in this screenshot

I created a separate ticket despite recent T222449 because we are not dealing with consistent failure any more but have a concrete error message (not consistently but it could be a hint).

Time box 8 hours: Investigate what is the culprit that leads to those failures. Fix should be estimated separately once the cause is better understood.

Event Timeline

Flakiness is strong with this one recently. Maybe it is a coincidence, but the manually triggered (and thus differently timed) reruns seem to fail at a much lower rate than the scheduled runs. Maybe changing the schedule (getting rid of what ever maybe happens at the same time) for when those tests run could help reduce how often this goes red.

Looking at the last 5 failures as of now, it seems to be covered by just two errors:

3x Can't call click on element with selector "#wpLoginAttempt" because element wasn't found
2x element (".representation-widget_representation-value-input") still existing after 10000ms (this is the error mentioned in the ticket description)

A potential way forward is to start by investigating Beta Logstash as this is assumed to be about an unstable environment more than a problem with the tests.

I recently has a discussion with @Tarrow on a similar but seemingly related ticket / issue. (it was talking about CI not beta) T266861: Wikibase browser tests are flaky

One of the hypothesis there (yet to be tested i believe) was that perhaps network requests to load.php for loading JS are failing, this results in various elements not properly loading and not be usable.
This could also cover XHR requests for edits etc failing.
At the time we didn't believe that such failed requests within the browser would expose themselves when the browser tests were running.
This could potentially be an underlying issue that we have failed to spot for our years of flakey browser tests, that we have never investigated (or perhaps things like this do surface? and this is not the problem?)

Indeed, I made an attempt to investigate the equivalent strategy for CI (the debug logs) but after spending some time on it I didn't discover anything useful in there.

I’m not sure if now is a good time to investigate this… Beta has been generally unstable in the past few days (T267561), so the most recent failures are most likely due to that. (And the latest build succeeded.)

Seems like the last three failures have all been that login error:

Can't call click on element with selector "#wpLoginAttempt" because element wasn't found

It also always seems to happen on one specific test:

[Chrome 73.0.3683.75 linux #0-7] Lexeme:Lemma
[Chrome 73.0.3683.75 linux #0-7] ✓ can be edited
[Chrome 73.0.3683.75 linux #0-7] ✓ can be edited multiple times
[Chrome 73.0.3683.75 linux #0-7] ? can not save lemmas with redundant languages
[Chrome 73.0.3683.75 linux #0-7] ✖ "before each" hook for "can not save lemmas with redundant languages"

Curious – why should that test not be able to log in when other tests work fine? The stack trace indicates the login logic is in a LoginPage page object from wdio-mediawiki, not specific to this test.

Lucas_Werkmeister_WMDE renamed this task from Lexeme daily browser tests (against beta) flaky to Lexeme daily browser tests (against beta) flaky [timebox 8k].Nov 17 2020, 2:54 PM
Lucas_Werkmeister_WMDE renamed this task from Lexeme daily browser tests (against beta) flaky [timebox 8k] to Lexeme daily browser tests (against beta) flaky [timebox 8h].

I notice there’s a slight difference between the first two tests in the file:

can be edited
browser.call( () => LexemeApi.get( id )
    .then( ( lexeme ) => {
        assert.equal( 1, Object.keys( lexeme.lemmas ).length, 'No lemma added' );
        // eslint-disable-next-line dot-notation                                                                                                                     
        assert.equal( 'test lemma', lexeme.lemmas[ 'en' ].value, 'Lemma changed' );
    } ).catch( assert.fail )
);
can be edited multiple times
browser.call( () => LexemeApi.get( id ).then( ( lexeme ) => {
    assert.equal( 1, Object.keys( lexeme.lemmas ).length, 'No lemma added' );
    assert.equal( 'another lemma', lexeme.lemmas[ 'en-gb' ].value, 'Lemma changed' );
} ) );

The second test doesn’t have the .catch( assert.fail ) inside the browser.call(), and I don’t know if browser.call() does anything with failed promises by default. Maybe this means that the “can be edited multiple times” test is actually failing, but we don’t notice it due to the missing .catch( assert.fail ), and that’s what makes the setup of the next test (“redundant languages”) break unexpectedly?

One way to test this theory would be to write something like

browser.call( () => LexemeApi.get( id ).then( ( lexeme ) => {
    assert.equal(1, 2 );
} ) );

and see how the browser tests behave then. Unfortunately, I’m currently unable to run the browser tests locally… maybe someone else can try this out?

Change 641732 had a related patch set uploaded (by Noa wmde; owner: Noa wmde):
[mediawiki/extensions/WikibaseLexeme@master] Move 'beforeEach' hook into individual tests

https://gerrit.wikimedia.org/r/641732

(Side note about my earlier comments: @noarave determined that assertion failures in a browser.call() without .catch( assert.fail ) are still properly reported as failures, so that’s probably not the cause of the problem, and we could most likely remove the .catch( assert.fail ) snippets.)

Change 641732 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Move 'beforeEach' hook into individual tests

https://gerrit.wikimedia.org/r/641732

I'm moving this to stalled instead of closing it since the probability of this issue not being solved is high

As this is still occurring and the time box has been exhausted with no actionable outcome, moving this to test (versification) to consider what the best way forward might be.

Addshore edited projects, added wdwb-tech (legacy-backlog); removed wdwb-tech.
Addshore moved this task from Incoming to Active on the wdwb-tech (legacy-backlog) board.