Page MenuHomePhabricator

mwselenium-quibble-docker test broken for WikibaseLexeme
Closed, ResolvedPublicPRODUCTION ERROR

Description

See for example:
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexeme/+/496994
and
https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/9960/

The failures seem to be:

22:43:38 Failing Scenarios:
22:43:38 cucumber features/forms.feature:26 # Scenario: Add grammatical feature
22:43:38 cucumber features/forms.feature:36 # Scenario: Remove grammatical feature
22:43:38 cucumber features/forms.feature:73 # Scenario: Edit statements on new Form
22:43:38 cucumber features/lexeme_header.feature:10 # Scenario: Update lexeme language
22:43:38 cucumber features/lexeme_header.feature:19 # Scenario: Update lexical category

Event Timeline

Smalyshev created this task.

Looking at some of the screenshots of https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/10025/ it looks like the API is returning no results, as the UI says no match was found (eg https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/10025/artifact/log/Lexeme%20Page%20Header%3A%20Update%20lexical%20category.png)

In the www logs I correctly see the item being searched for being inserted into the blob store as Q24:

[DBQuery] INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES ('120','Q24','','0','1','0.514590107990','20190318130858','0','0')
[DBQuery] SELECT  page_latest  FROM `page`    WHERE page_id = '46'  LIMIT 1  
[DBQuery] startAtomic: entering level 1 (MediaWiki\Revision\RevisionStore::insertRevisionOn)
[DBQuery] SELECT  comment_id  FROM `comment`    WHERE comment_hash = '-1188182935' AND comment_text = '/* wbeditentity-create-item:0| */' AND comment_data IS NULL  LIMIT 1  
[DBQuery] INSERT  INTO `revision` (rev_page,rev_parent_id,rev_minor_edit,rev_timestamp,rev_deleted,rev_len,rev_sha1,rev_content_model,rev_content_format,rev_user,rev_user_text) VALUES ('46','0','0','20190318130858','0','298','1au848lswv85oa13edpkutac3q8o4oj',NULL,NULL,'0','127.0.0.1')
[DBQuery] SELECT  MAX(ar_rev_id)  FROM `archive`     LIMIT 1  
[DBQuery] SELECT  MAX(slot_revision_id)  FROM `slots`     LIMIT 1  
[DBQuery] INSERT  INTO `revision_comment_temp` (revcomment_rev,revcomment_comment_id) VALUES ('60','2')
[DBQuery] INSERT  INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'{\"type\":\"item\",\"id\":\"Q24\",\"labels\":{\"en\":{\"language\":\"en\",\"value\":\"bZexKwAp\"}},\"descriptions\":{\"en\":{\"language\":\"en\",\"value\":\"TggDdmtmSKQAPWDFvgeq\"}},\"aliases\":[],\"claims\":[],\"sitelinks\":[]}','utf-8')
[DBQuery] UPDATE  `revision` SET rev_text_id = '59' WHERE rev_id = '60'

The next occurrence of this string in the logs is the start of the wbsearchentities request for the item:

Start request GET /api.php?action=wbsearchentities&search=bZexKwAp&format=json&language=en&uselang=en&type=item

Looking through the rest of the logs I do not see the data for this item being added to the wb_terms table...

I do see a couple of other calls to saveTermsOfEntity logged however:

[Wikibase] Wikibase\Lib\Store\Sql\TermSqlIndex::saveTermsOfEntity: 2 terms to insert.
[Wikibase] Wikibase\Lib\Store\Sql\TermSqlIndex::insertTerms: inserting terms for Q1
[Wikibase] Wikibase\Lib\Store\Sql\TermSqlIndex::saveTermsOfEntity: 2 terms to insert.
[Wikibase] Wikibase\Lib\Store\Sql\TermSqlIndex::insertTerms: inserting terms for Q4

Heading for a meeting now but will continue digging after

Change 497324 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@master] Revert "Split out new RefreshSecondaryDataUpdate class"

https://gerrit.wikimedia.org/r/497324

The patch in question should only chnange the behavior of *failing* DataUpdates. If no DataUpdates fail, the new behavior should be exactly the same as the old. If this is not the case, it's a bug I overlooked during review.

I guess that failing update is:

[CirrusSearch] Exception thrown while running DataSender::sendData in cluster default: Couldn't connect to host, Elasticsearch down?
#0 /workspace/src/vendor/ruflin/elastica/lib/Elastica/Request.php(193): Elastica\Transport\Http->exec(Elastica\Request, array)
#1 /workspace/src/vendor/ruflin/elastica/lib/Elastica/Client.php(688): Elastica\Request->send()
#2 /workspace/src/vendor/ruflin/elastica/lib/Elastica/Index.php(559): Elastica\Client->request(string, string, array, array)
#3 /workspace/src/vendor/ruflin/elastica/lib/Elastica/Type.php(512): Elastica\Index->request(string, string, array, array)
#4 /workspace/src/extensions/CirrusSearch/includes/DataSender.php(107): Elastica\Type->request(string, string, array, array)
#5 /workspace/src/extensions/CirrusSearch/includes/DataSender.php(128): CirrusSearch\DataSender->isAvailableForWrites()
#6 /workspace/src/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(129): CirrusSearch\DataSender->sendData(string, array)
#7 /workspace/src/extensions/CirrusSearch/includes/Job/Job.php(100): CirrusSearch\Job\ElasticaWrite->doJob()
#8 /workspace/src/includes/jobqueue/JobRunner.php(290): CirrusSearch\Job\Job->run()
#9 /workspace/src/includes/jobqueue/JobRunner.php(192): JobRunner->executeJob(CirrusSearch\Job\ElasticaWrite, Wikimedia\Rdbms\LBFactorySimple, BufferingStatsdDataFactory, integer)
#10 /workspace/src/includes/MediaWiki.php(1007): JobRunner->run(array)
#11 /workspace/src/includes/MediaWiki.php(990): MediaWiki->triggerSyncJobs(integer, MediaWiki\Logger\LegacyLogger)
#12 /workspace/src/includes/MediaWiki.php(914): MediaWiki->triggerJobs()
#13 /workspace/src/includes/MediaWiki.php(733): MediaWiki->restInPeace(string, boolean)
#14 /workspace/src/includes/MediaWiki.php(756): MediaWiki->{closure}()
#15 /workspace/src/includes/MediaWiki.php(550): MediaWiki->doPostOutputShutdown(string)
#16 /workspace/src/index.php(42): MediaWiki->run()
#17 {main}
[CirrusSearch] ElasticaWrite job reported failure on cluster default. Requeueing job with delay of 64.

But didn't finish digging yet, was going to see if the revert solved the immediate issue first.

That raises the question why that job fails.

But it also raises the question how the new RefreshSecondaryDataUpdate could have caused that. The intent behind the new code is to retry all updates that failed during saving.

Is it possible that you are seeing something fail during the retry, but the cause for the retry was something else?

Yes, that job probably shouldn't run when not linked up to a real ES service, ping @Smalyshev

@daniel I guess that is actually https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/466981/ for the retry strategy? Which doesn't appear to be merged yet.

The full logs can be dug through @ https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/10025/artifact/log/mw-debug-www.log

Is it possible that you are seeing something fail during the retry, but the cause for the retry was something else?

Perhaps?

@daniel I guess that is actually https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/466981/ for the retry strategy? Which doesn't appear to be merged yet.

Hm, I guess you are right. So the one thing RefreshSecondaryDataUpdate does differently is: it doesn't do error handling between updates. If one update fails, it aborts. This makes sense if all of them are retried, but that does not seem to be the case right now. So instead of making things more robust, this patch by itself has made things more brittle.

Want to have a poke at the second patch? It already has two 2x +1 :)

Addshore lowered the priority of this task from Unbreak Now! to High.

Hit +2 on the revert just now which should solve all of the CI problems for now.
But I guess this will need further investigation to get the change re merged..

Change 497438 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make RefreshSecondaryDataUpdate skip failing updates in doUpdate()

https://gerrit.wikimedia.org/r/497438

Before (passes)

From 4 days ago, the mwselenium-quibble job for Wikibase was passing:

It's debug log file, already contained the CirrusSearch error there. It is, as such, a pre-existing non-fatal error. It's caught and logged.

mw-debug-www.log
[CirrusSearchRequests] counting links to 1 pages against {index} took 7 millis. Requested via web for cadea08ea322087e4c6121bf08d0ac19 by executor 1002236292
[CirrusSearch] Search backend error during counting links to 1 pages after 7: unknown: Couldn't connect to host, Elasticsearch down?
[CirrusSearchChangeFailed] Links for page ids: 4
[CirrusSearch] Running sendData on cluster default 0s after insertion
[CirrusSearch] Exception thrown while running DataSender::sendData in cluster default: Couldn't connect to host, Elasticsearch down?
#0 /workspace/src/vendor/ruflin/elastica/lib/Elastica/Request.php(193): Elastica\Transport\Http->exec(Elastica\Request, array)
#1 /workspace/src/vendor/ruflin/elastica/lib/Elastica/Client.php(688): Elastica\Request->send()
#2 /workspace/src/vendor/ruflin/elastica/lib/Elastica/Index.php(559): Elastica\Client->request(string, string, array, array)
#3 /workspace/src/vendor/ruflin/elastica/lib/Elastica/Type.php(512): Elastica\Index->request(string, string, array, array)
#4 /workspace/src/extensions/CirrusSearch/includes/DataSender.php(107): Elastica\Type->request(string, string, array, array)
#5 /workspace/src/extensions/CirrusSearch/includes/DataSender.php(128): CirrusSearch\DataSender->isAvailableForWrites()
#6 /workspace/src/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(129): CirrusSearch\DataSender->sendData(string, array)
#7 /workspace/src/extensions/CirrusSearch/includes/Job/Job.php(100): CirrusSearch\Job\ElasticaWrite->doJob()
#8 /workspace/src/extensions/CirrusSearch/includes/Updater.php(228): CirrusSearch\Job\Job->run()
#9 /workspace/src/extensions/CirrusSearch/includes/Updater.php(85): CirrusSearch\Updater->updatePages(array, integer)
#10 /workspace/src/extensions/CirrusSearch/includes/Job/LinksUpdate.php(52): CirrusSearch\Updater->updateFromTitle(Title)
#11 /workspace/src/extensions/CirrusSearch/includes/Job/Job.php(100): CirrusSearch\Job\LinksUpdate->doJob()
#12 /workspace/src/includes/jobqueue/JobRunner.php(290): CirrusSearch\Job\Job->run()
#13 /workspace/src/includes/jobqueue/JobRunner.php(192): JobRunner->executeJob(CirrusSearch\Job\LinksUpdate, Wikimedia\Rdbms\LBFactorySimple, BufferingStatsdDataFactory, integer)
#14 /workspace/src/includes/MediaWiki.php(1007): JobRunner->run(array)
#15 /workspace/src/includes/MediaWiki.php(990): MediaWiki->triggerSyncJobs(integer, MediaWiki\Logger\LegacyLogger)
#16 /workspace/src/includes/MediaWiki.php(914): MediaWiki->triggerJobs()
#17 /workspace/src/includes/MediaWiki.php(733): MediaWiki->restInPeace(string, boolean)
#18 /workspace/src/includes/MediaWiki.php(756): MediaWiki->{closure}()
#19 /workspace/src/includes/MediaWiki.php(550): MediaWiki->doPostOutputShutdown(string)
#20 /workspace/src/index.php(42): MediaWiki->run()
#21 {main}
[CirrusSearch] ElasticaWrite job reported failure on cluster default. Requeueing job with delay of 64.
[DBQuery] BEGIN
[DBQuery] startAtomic: entering level 0 (Wikimedia\Rdbms\Database::onTransactionPreCommitOrIdle)
[DBQuery] SELECT  job_sha1  FROM `job`    WHERE job_sh

Currently (fails)

From today's and yesterdays' commits, we see the mwselenium-quibble job for Wikibase is failing, such as on:

The failure is:

@integration @modify_entity @save_aliases
Scenario: Edit multiple aliases                              # features/aliases.feature:130
  When I click the header edit button                        # features/step_definitions/header_steps.rb:10
  And I enter "alias1", "alias2" as new aliases              # features/step_definitions/aliases_steps.rb:24
    unable to locate element, using {:xpath=>"//div[contains(@class, 'wikibase-aliasesview')]//li[contains(@class, 'tagadata-choice-empty')]//input", :tag_name=>"input or textarea", :type=>"(any text type)"} (Watir::Exception::UnknownObjectException)
    (eval):1:in `process_watir_call'
    ./features/support/modules/alias_module.rb:41:in `block in populate_aliases'
    ./features/support/modules/alias_module.rb:40:in `each'
    ./features/support/modules/alias_module.rb:40:in `/^I enter (.+) as new aliases$/'
    features/aliases.feature:132:in `And I enter "alias1", "alias2" as new aliases'
  And I click the header save button                         # features/step_definitions/header_steps.rb:18
  And I click the header edit button                         # features/step_definitions/header_steps.rb:10
  And I enter "alias3", "alias4" as new aliases              # features/step_definitions/aliases_steps.rb:24
  And I empty the first alias                                # features/step_definitions/aliases_steps.rb:9
  And I change the first alias to alias5                     # features/step_definitions/aliases_steps.rb:28
  And I click the header save button                         # features/step_definitions/header_steps.rb:18
  Then There should be 3 aliases in the list                 # features/step_definitions/aliases_steps.rb:76
  And List of aliases should be "alias5", "alias3", "alias4" # features/step_definitions/aliases_steps.rb:80

I do not know this feature to know what this error really signifies in the user interface.

The previous comments from @Addshore and @WMDE-leszek suggest the following theory:

  • Wikibase is doing an edit, and then queueing several data updates.
  • Among these is a known-to-be-broken update for CirrusSearch, and another update that is meant to work that does something that this Selenium test is asserting the outcome of.
  • The theory is that they are in the order: broken update, then the working update.
  • We suspect this stopped working because a1f7fd3adaa380b27 in mediawiki/core had as side-effect that updates after a broken update no longer run.

Aaron has restored this behaviour in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/497438/, but the Wikibase mwselenium job is still failing.

When comparing the before and after jobs more closely, I notice only one significant difference in its output and build artefacts: The "after" state has a non-empty mw-exception.log file. It contains over 1,000 instances of the following exception:

2019-03-18 23:24:30 18d8ed2703fb wikidb: [dbec6bb8575a6781d2ea952d] /load.php?debug=false&lang=en&modules=…jquery.wikibase.…   
MWException from line 870 of /workspace/src/includes/resourceloader/ResourceLoaderFileModule.php: ResourceLoaderFileModule::readScriptFiles: script file not found: "/workspace/src/extensions/Wikibase/view/lib/wikibase-data-values/src/dataValues.js"

This suggests the problem is that somewhere between now and 4 days ago, this file has disappeared from the CI build's perspective. Either due to a commit in Wikibase, commit in another repo checked out during the build, or due to some kind of CI infra regression causing the submodule to not be cloned or cloned incorrectly.

Build output
…
fatal: could not read Username for 'https://gerrit.wikimedia.org': No such device or address
fatal: clone of 'https://gerrit.wikimedia.org/r/wikibase/javascript-api' into submodule path '/workspace/src/extensions/Wikibase/view/lib/wikibase-api' failed
Failed to clone 'view/lib/wikibase-api' a second time, aborting
extensions/VisualEditor/.gitmodules
+ git submodule foreach git clean -xdff -q
+ git submodule update --init --recursive
…

The build should have stopped here instead of allowing the Selenium tests to start with confusing failures. We saw this before. I filed it at T215689: When git-submodule clone fails in Quibble, the error should not be ignored. It is an intermittent, and after a recheck, it went away, but the Selenium tests were still failing the same way as before. So, back to figuring out why this is failing.

I've updated @Addshore's revert at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/497324/ to now be a revert of all three patches to mediawiki/core relating to RefreshSecondaryDataUpdate. And triggered a re-test of an unmerged Wikibase commit with this reverting patch applied to it through Depends-On, to help determine whether this it is the cause or not. Looks like it is.

I've landed the revert to unblock Wikibase.

Next steps

For Wikidata (@Addshore, @WMDE-leszek):

  • Relying on this broken CirrusSearch job to not break the build seems fragile. In the short-mid term, I think we might prefer not to have to support that. So making that update work or disabling it in cases where it is not meant to work would be nice.

For Performance (@aaron, myself):

  • When re-submitting this refactor, be sure to submit a dummy patch to Wikibase with a Depends-On for the re-submission patch. Then we'll know before landing anything in core whether it is going to break the build. We normally get this for free for wmf-deployed extensions via the shared "gate" job, but for various reasons Wikibase is not yet part of that.

Setting Depends-On: I9bbea3d13460ed44755d77fc61ff23fb906cf71e seems to cure all my failing patches.

Change 497324 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] Revert "Split out new RefreshSecondaryDataUpdate class"

https://gerrit.wikimedia.org/r/497324

Change 497324 merged by jenkins-bot:
[mediawiki/core@master] Revert "Split out new RefreshSecondaryDataUpdate class"

https://gerrit.wikimedia.org/r/497324

Addshore claimed this task.

Marking as resolved as the test failure itself is solved.
Other issues will be solved separately to this? Do we need some more tickets?

Change 497438 abandoned by Aaron Schulz:
Make RefreshSecondaryDataUpdate skip failing updates in doUpdate()

Reason:
This will be squashed into a single non-broken commit

https://gerrit.wikimedia.org/r/497438

Change 497537 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make DeferredUpdates enqueue on error and improve DataUpdate handling

https://gerrit.wikimedia.org/r/497537

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM

Change 497537 merged by jenkins-bot:
[mediawiki/core@master] Add RefreshSecondaryDataUpdate and use it in DerivedPageDataUpdater

https://gerrit.wikimedia.org/r/497537