Page MenuHomePhabricator

selenium-daily-beta-WikibaseLexeme broken since mid feb 2021
Open, Needs TriagePublic13 Estimated Story Points

Description

Initial spotted as T277859, this was then fixed and some issues remained

Most recent failure can be seen here https://integration.wikimedia.org/ci/job/selenium-daily-beta-WikibaseLexeme/

Acceptance criteria🏕️🌟x2:

  • The daily test should be green
  • A trimmed down suite for running against beta has been created (see this comment for an example)

Details

ProjectBranchLines +/-Subject
mediawiki/extensions/WikibaseLexemeREL1_36+2 -0
mediawiki/extensions/WikibaseLexememaster+13 -2
mediawiki/extensions/WikibaseLexememaster+12 -11
mediawiki/extensions/Wikibasemaster+109 -41
mediawiki/extensions/WikibaseLexememaster+5 -3
mediawiki/extensions/WikibaseLexememaster+3 -3
mediawiki/extensions/WikibaseLexememaster+29 -22
mediawiki/extensions/WikibaseLexememaster+27 -1
mediawiki/extensions/WikibaseLexememaster+26 -0
mediawiki/extensions/WikibaseLexememaster+9 -5
mediawiki/extensions/WikibaseLexemeREL1_36+6 -78
mediawiki/extensions/WikibaseLexemeREL1_36+35 -29
mediawiki/extensions/WikibaseLexememaster+35 -29
integration/configmaster+1 -1
mediawiki/extensions/WikibaseLexememaster+6 -78
mediawiki/extensions/WikibaseLexememaster+84 -37
mediawiki/extensions/WikibaseLexememaster+57 -180
mediawiki/coremaster+32 -24
mediawiki/extensions/WikibaseLexememaster+2 -0
mediawiki/extensions/WikibaseLexememaster+50 -3
mediawiki/extensions/WikibaseLexememaster+38 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Still waiting on T274579: Upgrade WebdriverIO to v7 in all repositories
Thought we could bring this upgrade into our backlog to see if it helps.
Thoughts @Michael ?

Still waiting on T274579: Upgrade WebdriverIO to v7 in all repositories
Thought we could bring this upgrade into our backlog to see if it helps.
Thoughts @Michael ?

Maybe. After looking at it, it seems the two tests that are currently persistently failing in the patch to upgrade Webdriver in Lexeme to v7 are the same ones that were fixed(?) by upgrading Webdriver in Lexeme to v6 on REL1_35. It is interesting that another major upgrade breaks them again, but I do not know what to make of that.

In an attempt to un-stall this I've spent some time today trying running this locally against beta and it's reminding me a bit of playing the lottery.

With or without the new 7.4.6 version it still randomly fails on every run, the full suite did not pass one time. I don't think an upgrade will be the magic bullet here.

I tried bumping the timeouts because it's currently set to 5000 ms for the waitForTimeout and 60 seconds for a single test, that seems a bit optimistic atleast for the waitForTimeout

	mochaOpts: {
		timeout: 120 * 1000
	},
	waitforTimeout: 20 * 1000,

This resolved some of the flakyness but still other tests were failing left and right. At two occations the site didn't even respond which could just be my connection being crap but doesn't instill a lot of confidence.

Like @Michael said these two

15:38:30 [Chrome 90.0.4430.93 linux #0-1]    ✖ FormId counter is not decremented when addForm is undone
15:38:30 [Chrome 90.0.4430.93 linux #0-1]    ✖ FormId counter is not decremented when old revision is restored

Switching to the spec reporter also helps to make any sense out of the logs.

On another note: Looking at this today and comparing with other repositories the second AC of this task is looking more and more tasty. Why are there so many browser tests in Lexeme, can't we just get rid of some of them(or at least mark them as skipped when running against beta and keep them if they serve a purpose in CI)?

I know in the old ruby browser tests we could mark different browser tests to be for beta or CI only. I think it makes sense here if possible. I haven't looked into it yet.

Why are there so many browser tests in Lexeme, can't we just get rid of some of them(or at least mark them as skipped when running against beta and keep them if they serve a purpose in CI)?

Having to many selenium tests is an anti-pattern. If anything could be tested with a different test (unit, api...) it should not be a selenium test.

Having a subset of stable tests running daily and passing is better than having all tests running and failing. There are examples in core on how to do it.

In package.json update selenium-daily to only run tests marked @daily.

Example:

"selenium-daily": "npm run selenium-test -- --mochaOpts.grep @daily"

If you want to include only some files, you can create a suite in wdio.conf.js.

Examples:

"selenium-daily": "npm run selenium-test -- --suite daily --mochaOpts.grep @daily"
	suites: {
		daily: [
			'tests/selenium/wdio-mediawiki/specs/*.js',
			'tests/selenium/specs/page.js',
			'tests/selenium/specs/user.js'
		]
	}

I'm available for reviews or pairing.

toan updated the task description. (Show Details)

Moving this back to the todo column as we wanna make an attempt at running a trimmed down suite against beta instead of running all the specs.

Moving this back to the todo column as we wanna make an attempt at running a trimmed down suite against beta instead of running all the specs.

I looked at what tests are failing. This is the list made from the last ten failures:

chrome.90_0_4430_212.linux.Lexeme_Header.can_edit_the_lexical_category_of_a_Lexeme
chrome.90_0_4430_212.linux.Lexeme_Header.can_edit_the_lexical_category_of_a_Lexeme
chrome.90_0_4430_212.linux.Lexeme_Header.before_each_hook_for_Lexeme_Header
chrome.90_0_4430_212.linux.Lexeme_Header.before_each_hook_for_Lexeme_Header
chrome.90_0_4430_212.linux.Lexeme_Header.before_each_hook_for_Lexeme_Header
chrome.90_0_4430_212.linux.Form_Header.before_all_hook_for_Form_Header


chrome.90_0_4430_212.linux.Special_Watchlist.before_each_hook_for_Special_Watchlist

chrome.90_0_4430_212.linux.with_default_LexemeLanguageCodePropertyId_configured.can_create_lexeme_with_language_item_bearing_language_code_statement
chrome.90_0_4430_212.linux.for_different_lemma_languages.can_create_lexeme_with_language_item_not_bearing_language_code_statement_and_en_lemma_language

chrome.90_0_4430_212.linux.Lexeme_Forms.before_all_hook_for_Lexeme_Forms

chrome.90_0_4430_212.linux.Lexeme_Lemma.can_be_edited
chrome.90_0_4430_212.linux.Lexeme_Lemma.can_not_save_lemmas_with_redundant_languages

chrome.90_0_4430_212.linux.Lexeme_Senses.before_all_hook_for_Lexeme_Senses
chrome.90_0_4430_212.linux.Lexeme_Senses.before_all_hook_for_Lexeme_Senses
chrome.90_0_4430_212.linux.Lexeme_Senses.before_all_hook_for_Lexeme_Senses
chrome.90_0_4430_212.linux.Lexeme_Senses.before_all_hook_for_Lexeme_Senses

chrome.90_0_4430_212.linux.Lexeme_Senses.can_not_save_senses_with_redundant_languages
chrome.90_0_4430_212.linux.Lexeme_Senses.Removes_Gloss

chrome.90_0_4430_212.linux.Lexeme_Statements.before_each_hook_for_Lexeme_Statements
chrome.90_0_4430_212.linux.Lexeme_Statements.before_each_hook_for_Lexeme_Statements

The "before" ones are basically login attempts.

Change 709989 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/WikibaseLexeme@master] Stop trying to login in each browser test

https://gerrit.wikimedia.org/r/709989

Change 709989 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Stop trying to login in each browser test

https://gerrit.wikimedia.org/r/709989

Mh, the last builds seem to have timed out :/

Change 713623 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[integration/config@master] jjb: Increase timeout of daily selenium jobs to an hour

https://gerrit.wikimedia.org/r/713623

Change 713623 merged by jenkins-bot:

[integration/config@master] jjb: Increase timeout of daily selenium jobs to an hour

https://gerrit.wikimedia.org/r/713623

I increased the timeout time, let's wait to see if it's fixed now.

Well, it looks like the tests aren’t timing out anymore, but they’re still failing. In the latest run, it looks like a core test failed to log in:

Error Message
Can't call click on element with selector "#wpLoginAttempt" because element wasn't found
Standard Error

Error: Can't call click on element with selector "#wpLoginAttempt" because element wasn't found
    at LoginPage.login (/src/node_modules/wdio-mediawiki/LoginPage.js:19:20)
    at Context.<anonymous> (/src/tests/selenium/specs/special/watchlist.js:24:13)

In https://integration.wikimedia.org/ci/job/selenium-daily-beta-WikibaseLexeme/1127/console there were 2 failures.


19:35:49 [0-15] RUNNING in chrome - /tests/selenium/specs/special/watchlist.js
19:36:12 [0-15] FAILED in chrome - /tests/selenium/specs/special/watchlist.js

This matches up with the comment by @Lucas_Werkmeister_WMDE a couple of days ago T277862#7301860

https://github.com/wikimedia/mediawiki-extensions-WikibaseLexeme/blob/master/tests/selenium/specs/special/watchlist.js

But there is not artefact to allow us to see what is happening?


The second is:

19:12:05 [0-1] Error in "Lexeme:Forms FormId counter is not decremented when addForm is undone"
19:12:05 Error: element (".representation-widget_representation-value-input") still existing after 5000ms
19:12:05     at LexemePage.addForm (/src/tests/selenium/pageobjects/lexeme.page.js:289:5)
19:12:05     at Context.<anonymous> (/src/tests/selenium/specs/form.edit.js:224:14)

These tests should up regularly in the central browser test monitoring that I do in T277205: Centrally look for flakey browser tests

image.png (164×2 px, 45 KB)

It generally looks like everything is going just fine in this test, only that more time is needed for the API calls to be made and the UI to update.

Change 715767 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/WikibaseLexeme@master] selenium: Introduce longer and varied browser test timeouts

https://gerrit.wikimedia.org/r/715767

But there is not artefact to allow us to see what is happening?

Looks like that one failed in before each hook. At the moment we don't take screenshots or videos in the hooks. Tracked as T285078.

See https://integration.wikimedia.org/ci/job/selenium-daily-beta-WikibaseLexeme/1127/testReport/

Change 715767 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] selenium: Introduce longer and varied browser test timeouts

https://gerrit.wikimedia.org/r/715767

Change 719081 had a related patch set uploaded (by Addshore; author: Amir Sarabadani):

[mediawiki/extensions/WikibaseLexeme@REL1_36] Stop trying to login in each browser test

https://gerrit.wikimedia.org/r/719081

Change 719082 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/WikibaseLexeme@REL1_36] selenium: Introduce longer and varied browser test timeouts

https://gerrit.wikimedia.org/r/719082

20:29:36 TypeError: Cannot read property '$' of undefined
20:29:36     at SensePage.startEditingNthSense (/src/tests/selenium/pageobjects/sense.page.js:50:24)
20:29:36     at SensePage.addGlossToNthSense (/src/tests/selenium/pageobjects/sense.page.js:120:8)
20:29:36     at Context.<anonymous> (/src/tests/selenium/specs/sense.edit.js:127:13)

hmmmm

This has been failing consistently every day for the last two+ weeks.

Seems like we still have

  • 60 second timeouts happening
  • Cannot read property '$' of undefined
  • Cannot read property 'waitForClickable' of undefined

The latter two sounds like something we should not fail on, would it make sense to un-stall this back in todo and have a closer look as to why we are seemingly using things that aren't available?

I don't know if it's related or not but I actually couldn't save an edit due to this error:

Could not save due to an error.

The tag "wikidata-ui" is not allowed to be manually applied.

Let me create the tags, it might fix it.

Change 719082 abandoned by Addshore:

[mediawiki/extensions/WikibaseLexeme@REL1_36] selenium: Introduce longer and varied browser test timeouts

Reason:

This was only a maybe, and CI says no, so not going to invest time in this

https://gerrit.wikimedia.org/r/719082

Change 719081 abandoned by Addshore:

[mediawiki/extensions/WikibaseLexeme@REL1_36] Stop trying to login in each browser test

Reason:

This was only a maybe, and CI says no, so not going to invest time in this

https://gerrit.wikimedia.org/r/719081

I don't know if it's related or not but I actually couldn't save an edit due to this error:

Could not save due to an error.

The tag "wikidata-ui" is not allowed to be manually applied.

Let me create the tags, it might fix it.

Seems to have done something we are now done to 1 failure as of yesterday https://integration.wikimedia.org/ci/job/selenium-daily-beta-WikibaseLexeme/1143/

Many of them are still errors about something being undefined; however, the one in the latest build is interesting:

readonly: The database has been automatically locked while the replica database servers catch up to the primary

(One of the errors on the 18th was the same.) This seems like something we should detect and ignore, but in a central way, somehow…

Maybe we should invest some time to see what's making beta cluster go readonly this much. I have a feeling something in update.php is at fault here.

update.php on beta cluster took 17 seconds which basically locks the whole beta cluster during the time:

  • It purges object cache. Maybe we can reduce that.
  • it tries to remove default value of lots of fields in each run because the check for default value is against the result being false which it never is (it's empty string if the default value is null, empty string or even zero):
ladsgroup@deployment-deploy01:~$ mwscript eval.php --wiki=enwiki
> $db = wfGetDB( DB_PRIMARY );

> var_dump( $db->fieldInfo( 'revision', 'rev_sha1')->defaultValue() );
string(0) ""

> var_dump( $db->fieldInfo( 'revision', 'rev_parent_id')->defaultValue() );
string(0) ""

> var_dump( $db->fieldInfo( 'revision', 'rev_deleted')->defaultValue() );
string(0) ""

I'm going to check other bits too

Took the binlog of the primary db and ordered them by the number of invocation (it's not a great metric but better than nothing):

MediaWiki\Extension\OATHAuth\Hook\LoadExtensionSchemaUpdates\UpdateTables::switchTOTPScratchTokensToArray www-data@deploy... 14784
GrowthExperiments\Maintenance\UpdateMenteeData::execute www-data@deploy... 6933
Wikimedia\DependencyStore\SqlModuleDependencyStore::storeMulti 5733
DatabaseUpdater::purgeCache www-data@deploy... 2160
User::saveSettings 2032

I try to see what I can fix there to reduce the load.

The first one is fixed by nerd-snipping a friend: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OATHAuth/+/723279

Currently working on nerd-sniping someone else for GE

With the first one merged, the time to run update.php is reduced to 7 seconds ^^

Moving to stalled to see if the changes had any effect on reducing the readonly errors.

The read-only part is mostly cleaned up but the rest is not fixed.

Exciting news: on 2021-10-09, the WikibaseLexeme build actually finished successfully (#1167).

Screenshot from 2021-10-11 11-38-46.png (196×1 px, 73 KB)

Presumably this was sheer luck, but I guess it’s still nice to know that it’s possible for these tests to all succeed at the same time.

Change 736486 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikibaseLexeme@master] Selenium: speed up test by not creating new category and language

https://gerrit.wikimedia.org/r/736486

Change 736496 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikibaseLexeme@master] Selenium: add performance logging

https://gerrit.wikimedia.org/r/736496

Change 736486 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Selenium: speed up test by not creating new category and language

https://gerrit.wikimedia.org/r/736486

Change 736496 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Selenium: add performance logging

https://gerrit.wikimedia.org/r/736496

Ok, so the patch from yesterday that added some primitive performance logging, did provide some insights:

Search for 0: test started in yesterday's logs to find the output for the test we wanted to look at closer.

Some observations:

  • Creating a new Item via the API takes around 10 seconds
  • Creating a new Lexeme via the API takes around 20 seconds

The latter point I confirmed via the API Sandbox on beta wikidata:

image.png (911×1 px, 115 KB)

It is unclear why it takes such unreasonable amounts of time. Interestingly, creating the Lexeme via the Special:NewLexeme is much faster! Though, it is no surprise now that we are seeing a lot of timeout errors in our daily Lexeme Browser tests.

I'm not yet sure how to debug this further. Is it possible to somehow profile requests on beta?


There is also a failing test around login problems that is unrelated to timeouts. I'm still looking into that one.

xhgui for beta sounds lovely, I don't know how it can be implemented. It might be quite a bit of work. You can look at server-side logs in beta cluster https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Log_Files

And if it's an issue in databases or replication, you can look at binlogs. If arclamp works in beta cluster, we can possibly look at flamegraphs as well.

xhgui for beta sounds lovely, I don't know how it can be implemented. It might be quite a bit of work. You can look at server-side logs in beta cluster https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Log_Files

Thanks, I looked at the logs and couldn't find any entries that would seem related.

And if it's an issue in databases or replication, you can look at binlogs.

Mh, creating a Lexeme via the API is much slower than creating it through the Special page. I think, if the DB would be the cause, I would expect them to be equally slow?

If arclamp works in beta cluster, we can possibly look at flamegraphs as well.

Sure, that would certainly be helpful. If that data were to exist, where would I find it? It doesn't seem to be at https://performance.wikimedia.org/arclamp/logs/daily/

My current favorite speculation is that on beta something like AbuseFilter sits in front of the Api requests via some hook and takes a long time, but doesn't affect Special pages.

There are a bunch of extensions that are enabled on beta, but not on test wikidata:

  • ChessBrowser
  • PageAssessments
  • StopForumSpam
  • RelatedArticles
  • Dashiki
  • IPInfo
  • Newsletter
  • ORES
  • Popups
  • TheWikipediaLibrary

Anything that rings a bell as a possible cause? I would speculate that maybe PageAssessments or StopForumSpam could have some hooks to look at page creations via the API or something?

Change 737344 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikibaseLexeme@master] Selenium: make lougout work with CentralAuth

https://gerrit.wikimedia.org/r/737344

Change 737344 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Selenium: make lougout work with CentralAuth

https://gerrit.wikimedia.org/r/737344

Change 737394 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikibaseLexeme@master] Selenium: extract waiting for replication and add waits

https://gerrit.wikimedia.org/r/737394

Change 737394 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Selenium: extract waiting for replication and add waits

https://gerrit.wikimedia.org/r/737394

Change 737640 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikibaseLexeme@master] Selenium: work around \"badtoken: Invalid CSRF token\" error

https://gerrit.wikimedia.org/r/737640

Change 737640 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Selenium: work around \"badtoken: Invalid CSRF token\" error

https://gerrit.wikimedia.org/r/737640

Change 737900 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikibaseLexeme@master] Selenium: consolidate test and guard against replication lag

https://gerrit.wikimedia.org/r/737900

Change 737900 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Selenium: consolidate test and guard against replication lag

https://gerrit.wikimedia.org/r/737900

Change 738378 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikibaseLexeme@master] Selenium: initialize WikibaseApi with ChronologyProtector

https://gerrit.wikimedia.org/r/738378

Change 738421 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/Wikibase@master] Use new wdio-wikibase

https://gerrit.wikimedia.org/r/738421

Change 738421 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] bridge: Use new wdio-wikibase

https://gerrit.wikimedia.org/r/738421

Change 738378 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Selenium: initialize WikibaseApi with ChronologyProtector

https://gerrit.wikimedia.org/r/738378

Change 739539 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikibaseLexeme@master] Verify Lexeme creation preconditions on primary

https://gerrit.wikimedia.org/r/739539

Change 739539 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Verify Lexeme creation preconditions on primary

https://gerrit.wikimedia.org/r/739539

Change 759957 had a related patch set uploaded (by Reedy; author: Ladsgroup):

[mediawiki/extensions/WikibaseLexeme@REL1_36] Wait for two seconds before opening RC so replication catches up

https://gerrit.wikimedia.org/r/759957

Change 759957 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@REL1_36] Wait for two seconds before opening RC so replication catches up

https://gerrit.wikimedia.org/r/759957

Michael removed a project: User-Michael.

The most recent builds seem to be green by default, though I do not know what caused this. They are also much faster:

image.png (584×296 px, 51 KB)

I don't intend to work on this in the near future, so I'm unassigning myself.