Lexeme daily browser tests (against beta) flaky [timebox 8h]
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• Pablo-WMDE
	Oct 20 2020, 8:09 AM

Description

The daily lexeme browser tests (against beta) appear to be rather flaky. One recent failure gave an interesting error message which may help to make this more stable.

Could not save due to an error. The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. The System administrator who locked it offered this explanation: $1

as seen in this screenshot

I created a separate ticket despite recent T222449 because we are not dealing with consistent failure any more but have a concrete error message (not consistently but it could be a hint).

Time box 8 hours: Investigate what is the culprit that leads to those failures. Fix should be estimated separately once the cause is better understood.

Details

	Subject	Repo	Branch	Lines +/-
	Move 'beforeEach' hook into individual tests	mediawiki/extensions/WikibaseLexeme	master	+9 -2

Customize query in gerrit

Related Objects

Mentioned Here: T277862: selenium-daily-beta-WikibaseLexeme broken since mid feb 2021
T267561: Beta needs to be upgraded to Varnish 6
T266861: Wikibase browser tests are flaky
T222449: Make existing daily selenium nodejs tests for WikibaseLexeme green again

Event Timeline

• Pablo-WMDE created this task.Oct 20 2020, 8:09 AM

Restricted Application added a project: Wikidata. · View Herald TranscriptOct 20 2020, 8:09 AM

Flakiness is strong with this one recently. Maybe it is a coincidence, but the manually triggered (and thus differently timed) reruns seem to fail at a much lower rate than the scheduled runs. Maybe changing the schedule (getting rid of what ever maybe happens at the same time) for when those tests run could help reduce how often this goes red.

Screenshot_2020-10-26 selenium-daily-beta-WikibaseLexeme [Jenkins].png (770×369 px, 73 KB)

Michael subscribed.Oct 26 2020, 12:08 PM

Looking at the last 5 failures as of now, it seems to be covered by just two errors:

3x Can't call click on element with selector "#wpLoginAttempt" because element wasn't found
2x element (".representation-widget_representation-value-input") still existing after 10000ms (this is the error mentioned in the ticket description)

WMDE-leszek added a project: Wikidata-Campsite.Nov 3 2020, 1:59 PM

WMDE-leszek moved this task from Incoming to Prioritized Wikidata Tech Backlog (prioritised from top to bottom) on the Wikidata-Campsite board.

WMDE-leszek moved this task from Prioritized Wikidata Tech Backlog (prioritised from top to bottom) to Wikidata-Campsite-Iteration-∞ (On Hold) on the Wikidata-Campsite board.Nov 3 2020, 2:47 PM

WMDE-leszek edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.

WMDE-leszek updated the task description. (Show Details)

A potential way forward is to start by investigating Beta Logstash as this is assumed to be about an unstable environment more than a problem with the tests.

I recently has a discussion with @Tarrow on a similar but seemingly related ticket / issue. (it was talking about CI not beta) T266861: Wikibase browser tests are flaky

One of the hypothesis there (yet to be tested i believe) was that perhaps network requests to load.php for loading JS are failing, this results in various elements not properly loading and not be usable.
This could also cover XHR requests for edits etc failing.
At the time we didn't believe that such failed requests within the browser would expose themselves when the browser tests were running.
This could potentially be an underlying issue that we have failed to spot for our years of flakey browser tests, that we have never investigated (or perhaps things like this do surface? and this is not the problem?)

Maintenance_bot moved this task from incoming to in progress on the Wikidata board.Nov 6 2020, 1:15 PM

Indeed, I made an attempt to investigate the equivalent strategy for CI (the debug logs) but after spending some time on it I didn't discover anything useful in there.

I’m not sure if now is a good time to investigate this… Beta has been generally unstable in the past few days (T267561), so the most recent failures are most likely due to that. (And the latest build succeeded.)

Seems like the last three failures have all been that login error:

Can't call click on element with selector "#wpLoginAttempt" because element wasn't found

It also always seems to happen on one specific test:

[Chrome 73.0.3683.75 linux #0-7] Lexeme:Lemma
[Chrome 73.0.3683.75 linux #0-7] ✓ can be edited
[Chrome 73.0.3683.75 linux #0-7] ✓ can be edited multiple times
[Chrome 73.0.3683.75 linux #0-7] ? can not save lemmas with redundant languages
[Chrome 73.0.3683.75 linux #0-7] ✖ "before each" hook for "can not save lemmas with redundant languages"

Curious – why should that test not be able to log in when other tests work fine? The stack trace indicates the login logic is in a LoginPage page object from wdio-mediawiki, not specific to this test.

Lucas_Werkmeister_WMDE renamed this task from Lexeme daily browser tests (against beta) flaky to Lexeme daily browser tests (against beta) flaky [timebox 8k].Nov 17 2020, 2:54 PM

Lucas_Werkmeister_WMDE renamed this task from Lexeme daily browser tests (against beta) flaky [timebox 8k] to Lexeme daily browser tests (against beta) flaky [timebox 8h].

I notice there’s a slight difference between the first two tests in the file:

can be edited

browser.call( () => LexemeApi.get( id )
    .then( ( lexeme ) => {
        assert.equal( 1, Object.keys( lexeme.lemmas ).length, 'No lemma added' );
        // eslint-disable-next-line dot-notation                                                                                                                     
        assert.equal( 'test lemma', lexeme.lemmas[ 'en' ].value, 'Lemma changed' );
    } ).catch( assert.fail )
);

can be edited multiple times

browser.call( () => LexemeApi.get( id ).then( ( lexeme ) => {
    assert.equal( 1, Object.keys( lexeme.lemmas ).length, 'No lemma added' );
    assert.equal( 'another lemma', lexeme.lemmas[ 'en-gb' ].value, 'Lemma changed' );
} ) );

The second test doesn’t have the .catch( assert.fail ) inside the browser.call(), and I don’t know if browser.call() does anything with failed promises by default. Maybe this means that the “can be edited multiple times” test is actually failing, but we don’t notice it due to the missing .catch( assert.fail ), and that’s what makes the setup of the next test (“redundant languages”) break unexpectedly?

One way to test this theory would be to write something like

browser.call( () => LexemeApi.get( id ).then( ( lexeme ) => {
    assert.equal(1, 2 );
} ) );

and see how the browser tests behave then. Unfortunately, I’m currently unable to run the browser tests locally… maybe someone else can try this out?

Change 641732 had a related patch set uploaded (by Noa wmde; owner: Noa wmde):
[mediawiki/extensions/WikibaseLexeme@master] Move 'beforeEach' hook into individual tests

https://gerrit.wikimedia.org/r/641732

gerritbot added a project: Patch-For-Review.Nov 18 2020, 2:17 PM

noarave moved this task from To Do (prioritised from top to bottom) to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Nov 18 2020, 2:24 PM

zeljkofilipin subscribed.Nov 19 2020, 11:14 AM

(Side note about my earlier comments: @noarave determined that assertion failures in a browser.call() without .catch( assert.fail ) are still properly reported as failures, so that’s probably not the cause of the problem, and we could most likely remove the .catch( assert.fail ) snippets.)

Change 641732 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Move 'beforeEach' hook into individual tests

https://gerrit.wikimedia.org/r/641732

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.20; 2020-12-01).Nov 19 2020, 2:00 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 19 2020, 2:10 PM

I'm moving this to stalled instead of closing it since the probability of this issue not being solved is high

As this is still occurring and the time box has been exhausted with no actionable outcome, moving this to test (versification) to consider what the best way forward might be.

Addshore moved this task from Inbox to legacy-backlog on the [DEPRECATED] wdwb-tech board.Jan 23 2021, 12:00 AM

Addshore edited projects, added [DEPRECATED] wdwb-tech (legacy-backlog); removed [DEPRECATED] wdwb-tech.

Addshore moved this task from Incoming to Active on the [DEPRECATED] wdwb-tech (legacy-backlog) board.

Addshore edited projects, added [DEPRECATED] wdwb-tech; removed [DEPRECATED] wdwb-tech (legacy-backlog), MW-1.36-notes (1.36.0-wmf.20; 2020-12-01), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)).Feb 1 2021, 12:28 PM

Addshore moved this task from Inbox to Research on the [DEPRECATED] wdwb-tech board.

Addshore moved this task from Research to Investigate & Discuss on the [DEPRECATED] wdwb-tech board.Jul 16 2021, 8:39 AM

Closed in favour of the target task as T277862#7212305

	F32414179: Screenshot_2020-10-26 selenium-daily-beta-WikibaseLexeme [Jenkins].png
	Oct 26 2020, 7:54 AM

Lexeme daily browser tests (against beta) flaky [timebox 8h]Closed, DuplicatePublicActions

Description

Details

Related Objects

Event Timeline

Lexeme daily browser tests (against beta) flaky [timebox 8h]
Closed, DuplicatePublic
Actions