Page MenuHomePhabricator

MediaWiki core's selenium tests flaky when run as part of mwext-mw-selenium-node-composer-jessie job
Closed, ResolvedPublic

Description

Background: WikibaseLexeme extension adds node selenium tests, and needs them to be part of extension's CI build. As not all of the extension's dependencies are part of mediawiki vendor yet, running the tests require install dependencies "locally" with composer.

In order to achieve the above, we've created a dedicated job (there was no node selenium + composer combo before): mwext-mw-selenium-node-composer-jessie.
With the patch the should be green, i.e. https://gerrit.wikimedia.org/r/#/c/422931/, we're experiencing quite some flakiness of MW core tests (our tests are consistently passing).
Example of a failing job: https://integration.wikimedia.org/ci/job/mwext-mw-selenium-node-composer-jessie/11/console.

As core's node selenium job (mediawiki-core-qunit-selenium-jessie) seem to be rather reliable, it could be there there is some difference in config between mwext-mw-selenium-node-composer-jessie and mediawiki-core-qunit-selenium-jessie. I failed to find the missing bits though. Maybe @zeljkofilipin has an idea where to look at?

Core test failures have also not been reproduced when running tests locally.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 5 2018, 4:12 PM

Apologies for being slow on this, I was busy with other urgent tasks. I should have the time to take a look at this today.

zeljkofilipin triaged this task as Normal priority.Apr 11 2018, 11:09 AM

@zeljkofilipin: any luck finding out what might be going wrong there? One of my guesses has been that I stuffed up something in the config of mwext-mw-selenium-node-composer-jessie job. I couldn't really find what could have that been though.

Sorry, this week was really busy. I am looking at this. Since this is more CI/Jenkins related, @hashar might know more.

It is probably not needed for this task, but it would be of great if I could run the tests locally. I don't know which mediawiki-vagrant role to enable to get WikibaseLexeme running. wikibase_repo, wikidata...? I have took a look at the README.md file, but the required role is not mentioned.

zeljkofilipin added a comment.EditedApr 13 2018, 6:20 PM

When one job is working and another is not, it's useful to compare their outputs. That works great when it's the same job, only different runs, but it should work reasonably well for two similar jobs, like in this case.

curl https://integration.wikimedia.org/ci/job/mediawiki-core-qunit-selenium-jessie/19157/consoleText > mediawiki-core-qunit-selenium-jessie-19157
...

curl https://integration.wikimedia.org/ci/job/mwext-mw-selenium-node-composer-jessie/11/consoleText > mwext-mw-selenium-node-composer-jessie-11
...

diff mediawiki-core-qunit-selenium-jessie-19157 mwext-mw-selenium-node-composer-jessie-11
...
< INFO:zuul.CloneMapper:  mediawiki/core -> src
< INFO:zuul.CloneMapper:  mediawiki/vendor -> src/vendor
---
> INFO:zuul.CloneMapper:  mediawiki/core -> src
> INFO:zuul.CloneMapper:  mediawiki/extensions/WikibaseLexeme -> src/extensions/WikibaseLexeme
...

zeljkofilipin added a comment.EditedApr 13 2018, 6:28 PM

It's also useful to compare configuration for two jobs:

jenkins-jobs --conf etc/jenkins_jobs.ini test config/jjb/ -o output
...

cd output

diff mediawiki-core-qunit-selenium-jessie mwext-mw-selenium-node-composer-jessie
...
<     mediawiki/core mediawiki/vendor $(cat deps.txt) $(cat deps_skins.txt)
---
>     mediawiki/core $(cat deps.txt) $(cat deps_skins.txt)
...


Looking at the few recent runs of the job:

  • 104 Page should be deletable
  • 103 Echo alerts and notices are visible after logging in
  • 102 No failures
  • 101 Page should be deletable, Echo alerts and notices are visible after logging in
  • 100 Echo alerts and notices are visible after logging in

Having videos of test runs would help here. Feel free to get involved in T179188: Video recording for Selenium tests in Node.js. Reviews of 422933 are welcome.

zeljkofilipin renamed this task from MediaWiki core's node selenium tests flaky when run as part of mwext-mw-selenium-node-composer-jessie job to MediaWiki core's selenium tests flaky when run as part of mwext-mw-selenium-node-composer-jessie job.Apr 13 2018, 6:53 PM

It is probably not needed for this task, but it would be of great if I could run the tests locally. I don't know which mediawiki-vagrant role to enable to get WikibaseLexeme running. wikibase_repo, wikidata...?

sadly we didn't get to adding any vagrant role yet.
I believe what would work would be installing wikibase_repo with vagrant and than cloning the WikibaseLexeme git and enabling it with wfLoadExtension. This should be enough for setup at this point (no crazy stuff like in the case of "normal" Wikibase).

116 passed (425268/4), 118 failed (425268/5).

curl https://integration.wikimedia.org/ci/job/mwext-mw-selenium-node-composer-jessie/116/consoleText > 116

curl https://integration.wikimedia.org/ci/job/mwext-mw-selenium-node-composer-jessie/118/consoleText > 118

diff 116 118
...

I don't see any significant differences. :/

I've made https://gerrit.wikimedia.org/r/#/c/426857/ to get some videos using the great patch you're working on @zeljkofilipin.
https://integration.wikimedia.org/ci/job/mwext-mw-selenium-node-composer-jessie/138/artifact/log/User-should-be-able-to-change-preferences.mp4 is the recording of the failing test. Something seems off with logging in. If I see correctly, what happens is:

  1. User logs out
  2. User enters its credentials
  3. User is redirected to Main Page (is she was logged in?), but apparently is not logged in (e.g. Log in, instead of Log out link on the top of the pagE)
  4. User is redirected to Login Page when trying to access their user preferences.

So I believe something is not right between step 2 and 3.

The one thing that looks odd to me is this request:

Start request POST /jenkins-mwext-mw-selenium-node-composer-jessie-178/index.php/Special:UserLogin
HTTP HEADERS:
HOST: localhost:9412
CONNECTION: keep-alive
CONTENT-LENGTH: 198
CACHE-CONTROL: max-age=0
ORIGIN: http://localhost:9412
UPGRADE-INSECURE-REQUESTS: 1
USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
CONTENT-TYPE: application/x-www-form-urlencoded
ACCEPT: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
REFERER: http://localhost:9412/jenkins-mwext-mw-selenium-node-composer-jessie-178/index.php?title=Special:UserLogin
ACCEPT-ENCODING: gzip, deflate, br
ACCEPT-LANGUAGE: en-US,en;q=0.8
COOKIE: jenkins_u0_mw_session=aat4e7f32ah724timnuqcp59v65mgfsv
...
[authentication] Primary login with MediaWiki\Auth\LocalPasswordPrimaryAuthenticationProvider succeeded
[authentication] Login for WikiAdmin succeeded
[CryptRand] 0 bytes of randomness leftover in the buffer.
[session] SessionBackend "f8g9d0teu1t9k3t62kksknhlkmdpgvdo" metadata dirty due to ID reset (formerly "aat4e7f32ah724timnuqcp59v65mgfsv")
[session] SessionBackend "f8g9d0teu1t9k3t62kksknhlkmdpgvdo" data dirty due to dirty(): MediaWiki\Auth\AuthManager->continueAuthentication/MediaWiki\Auth\AuthManager->setSessionDataForUser/MediaWiki\Session\Session->resetAllTokens/MediaWiki\Session\Session->remove/MediaWiki\Session\SessionBackend->dirty
[session] SessionBackend "f8g9d0teu1t9k3t62kksknhlkmdpgvdo" metadata dirty due to user change
[session] SessionBackend "f8g9d0teu1t9k3t62kksknhlkmdpgvdo" data dirty due to dirty(): MediaWiki\Auth\AuthManager->beginAuthentication/MediaWiki\Auth\AuthManager->continueAuthentication/MediaWiki\Auth\AuthManager->setSessionDataForUser/MediaWiki\Session\Session->set/MediaWiki\Session\SessionBackend->dirty
[session] SessionBackend "f8g9d0teu1t9k3t62kksknhlkmdpgvdo" data dirty due to dirty(): MediaWiki\Auth\AuthManager->beginAuthentication/MediaWiki\Auth\AuthManager->continueAuthentication/MediaWiki\Auth\AuthManager->setSessionDataForUser/MediaWiki\Session\Session->set/MediaWiki\Session\SessionBackend->dirty
[session] SessionBackend "f8g9d0teu1t9k3t62kksknhlkmdpgvdo" save: dataDirty=1 metaDirty=1 forcePersist=0
[cookie] setcookie: "jenkins_u0_mw_session", "f8g9d0teu1t9k3t62kksknhlkmdpgvdo", "0", "/", "", "", "1"
[cookie] setcookie: "jenkins_u0_mwUserID", "1", "1526714909", "/", "", "", "1"
[cookie] setcookie: "jenkins_u0_mwUserName", "WikiAdmin", "1526714909", "/", "", "", "1"
[cookie] already deleted setcookie: "jenkins_u0_mwToken", "", "1492586909", "/", "", "", "1"
[cookie] already deleted setcookie: "forceHTTPS", "", "1492586909", "/", "", "", "1"
...
MediaWiki::preOutputCommit: primary transaction round committed
MediaWiki::preOutputCommit: pre-send deferred updates completed
[DBReplication] Wikimedia\Rdbms\ChronologyProtector::shutdownLB: DB '127.0.0.1:3306' touched

MediaWiki::preOutputCommit: LBFactory shutdown completed
[cookie] setcookie: "UseDC", "master", "1524122919", "/", "", "", "1"
[cookie] setcookie: "UseCDNCache", "false", "1524122919", "/", "", "", "1"
OutputPage::sendCacheControl: private caching;  **
[runJobs] refreshLinksPrioritized 0.6544586852278953-öäü-♠♣♥♦ rootJobTimestamp=20180419072820 useRecursiveLinksUpdate=1 triggeringUser={"userId":0,"userName":"127.0.0.1"} triggeringRevisionId=4 causeAction=edit-page causeAgent=127.0.0.1 requestId=11cd8aefe6a35966b4539bf5 (id=15,timestamp=20180419072820) STARTING
Title::getRestrictionTypes: applicable restrictions to [[0.6544586852278953-öäü-♠♣♥♦]] are {edit,move}
[ContentHandler] Created handler for wikitext: WikitextContentHandler
[caches] parser: SqlBagOStuff
Parser cache options found.
[session] Session "aat4e7f32ah724timnuqcp59v65mgfsv" requested without UserID cookie
[session] SessionBackend "aat4e7f32ah724timnuqcp59v65mgfsv" is unsaved, marking dirty in constructor
[CryptRand] 0 bytes of randomness leftover in the buffer.
[session] SessionBackend "eovsojkij5dg8fsg2o2iej46u7392jse" metadata dirty due to ID reset (formerly "aat4e7f32ah724timnuqcp59v65mgfsv")
[session] SessionBackend "eovsojkij5dg8fsg2o2iej46u7392jse" save: dataDirty=1 metaDirty=1 forcePersist=0
[cookie] setcookie: "jenkins_u0_mw_session", "eovsojkij5dg8fsg2o2iej46u7392jse", "0", "/", "", "", "1"
[cookie] setcookie: "jenkins_u0_mwUserID", "", "1492586909", "/", "", "", "1"
[cookie] already deleted setcookie: "jenkins_u0_mwToken", "", "1492586909", "/", "", "", "1"
[cookie] already deleted setcookie: "forceHTTPS", "", "1492586909", "/", "", "", "1"
ParserOutput cache found.
[DBQuery] RefreshLinksJob::runForTitle: committing on behalf of RefreshLinksJob::run.
BacklinkCache::queryLinks: got results from DB
[DBQuery] RefreshLinksJob::runForTitle: committing on behalf of RefreshLinksJob::run.
[runJobs] refreshLinksPrioritized 0.6544586852278953-öäü-♠♣♥♦ rootJobTimestamp=20180419072820 useRecursiveLinksUpdate=1 triggeringUser={"userId":0,"userName":"127.0.0.1"} triggeringRevisionId=4 causeAction=edit-page causeAgent=127.0.0.1 requestId=11cd8aefe6a35966b4539bf5 (id=15,timestamp=20180419072820) t=146 good
Request ended normally

It looks like the login changed the session, then post-send (except not really post because it apparently was able to re-set the cookies) it ran a job that wound up trying to load the pre-login session and stomped on the post-login session's cookies. In this case, that probably was ParserCache→ParserOptions→User::loadFromSession() trying to get the session.

Normally the "global" WebRequest (from $wgRequest or RequestContext) should have already had the updated SessionId cached, so something somewhere must have replaced it. I don't know what might have done so.

Change 434011 had a related patch set uploaded (by WMDE-leszek; owner: WMDE-leszek):
[integration/config@master] Do not run composer in WikibaseLexeme's node selenium job

https://gerrit.wikimedia.org/r/434011

zeljkofilipin awarded a token.

Change 434917 had a related patch set uploaded (by Zfilipin; owner: Zfilipin):
[integration/config@master] Run qunit-selenium job for WikibaseLexeme in experimental pipeline

https://gerrit.wikimedia.org/r/434917

Change 434917 merged by jenkins-bot:
[integration/config@master] Run qunit-selenium job for WikibaseLexeme in experimental pipeline

https://gerrit.wikimedia.org/r/434917

mediawiki-core-qunit-selenium-jessie/20108 failed 🤔

00:08:05.665 [chrome #0-2] Spec: /home/jenkins/workspace/mediawiki-core-qunit-selenium-jessie/src/tests/selenium/specs/user.js
00:08:05.666 [chrome #0-2] Running: chrome
00:08:05.666 [chrome #0-2]
00:08:05.666 [chrome #0-2]   User
00:08:05.666 [chrome #0-2]
00:08:05.666 [chrome #0-2]   User
00:08:05.666 [chrome #0-2]       ✓ should be able to create account
00:08:05.667 [chrome #0-2]       1) should be able to log in
00:08:05.667 [chrome #0-2]       ✓ should be able to change preferences
00:08:05.668 [chrome #0-2]
00:08:05.668 [chrome #0-2]
00:08:05.668 [chrome #0-2] 2 passing (28s)
00:08:05.669 [chrome #0-2] 1 failing
00:08:05.669 [chrome #0-2]
00:08:05.669 [chrome #0-2] 1) User should be able to log in:
00:08:05.669 [chrome #0-2] An element could not be located on the page using the given search parameters ("#pt-userpage").
00:08:05.670 [chrome #0-2] Error: An element could not be located on the page using the given search parameters ("#pt-userpage").
00:08:05.671 [chrome #0-2]     at Promise.F (node_modules/core-js/library/modules/_export.js:36:28)
00:08:05.671 [chrome #0-2]     at getText() - at Context.<anonymous> (tests/selenium/specs/user.js:41:40)

meh, looks like the similar we've been getting with the job using composer.

zeljkofilipin added a comment.EditedMay 24 2018, 2:41 PM

mediawiki-core-qunit-selenium-jessie/20109 failed different test 😱

00:05:12.053 [chrome #0-1] Spec: /home/jenkins/workspace/mediawiki-core-qunit-selenium-jessie/src/tests/selenium/specs/page.js
00:05:12.054 [chrome #0-1] Running: chrome
00:05:12.054 [chrome #0-1]
00:05:12.054 [chrome #0-1]   Page
00:05:12.054 [chrome #0-1]
00:05:12.054 [chrome #0-1]   Page
00:05:12.054 [chrome #0-1]       ✓ should be creatable
00:05:12.055 [chrome #0-1]       ✓ should be re-creatable
00:05:12.055 [chrome #0-1]       ✓ should be editable
00:05:12.055 [chrome #0-1]       ✓ should have history
00:05:12.056 [chrome #0-1]       ✓ should be deletable
00:05:12.056 [chrome #0-1]       1) should be restorable
00:05:12.057 [chrome #0-1]
00:05:12.057 [chrome #0-1]
00:05:12.057 [chrome #0-1] 5 passing (32s)
00:05:12.057 [chrome #0-1] 1 failing
00:05:12.058 [chrome #0-1]
00:05:12.058 [chrome #0-1] 1) Page should be restorable:
00:05:12.058 [chrome #0-1] An element could not be located on the page using the given search parameters ("#wpComment").
00:05:12.059 [chrome #0-1] Error: An element could not be located on the page using the given search parameters ("#wpComment").
00:05:12.059 [chrome #0-1]     at Context.<anonymous> (tests/selenium/specs/page.js:118:15)
00:05:12.060 [chrome #0-1]     at Promise.F (node_modules/core-js/library/modules/_export.js:36:28)
00:05:12.060 [chrome #0-1]     at setValue("0.5591162731345116-öäü-♠♣♥♦-restorereason") - at RestorePage.restore (tests/selenium/pageobjects/restore.page.js:14:15)
00:05:12.061 [chrome #0-1]
00:05:12.061 
00:05:20.326 [14:36:26] [S] [MWBOT] Login successful: WikiAdmin@http://localhost:9412/jenkins-mediawiki-core-qunit-selenium-jessie-20109
00:05:25.952 [14:36:32] [S] [MWBOT] Login successful: WikiAdmin@http://localhost:9412/jenkins-mediawiki-core-qunit-selenium-jessie-20109
zeljkofilipin renamed this task from MediaWiki core's selenium tests flaky when run as part of mwext-mw-selenium-node-composer-jessie job to mediawiki-core-qunit-selenium-jessie sometimes fails for extensions.May 24 2018, 3:35 PM
zeljkofilipin renamed this task from mediawiki-core-qunit-selenium-jessie sometimes fails for extensions to MediaWiki core's selenium tests flaky when run as part of mwext-mw-selenium-node-composer-jessie job.
zeljkofilipin raised the priority of this task from Normal to High.
zeljkofilipin lowered the priority of this task from High to Normal.
zeljkofilipin edited projects, added Math; removed Browser-Tests.
hashar added a comment.Jun 7 2018, 9:49 AM

Looking at mediawiki-core-qunit-selenium-jessie #20109 triggered for WikibaseLexeme https://gerrit.wikimedia.org/r/#/c/434907/ :

A webdriver.io test from mediawiki/core fails when trying to restore a page. The test is supposed to be logged in as admin to do the action, however on the screenshot one can see the user is not logged in:

it( 'should be restorable', function () {
    UserLoginPage.loginAdmin();
    browser.call( function () {
        return Api.edit( name, content );
    } );
    browser.call( function () {
        return Api.delete( name, content + '-deletereason' );
    } );
    RestorePage.restore( name, content + '-restorereason' );

So that is done over the MediaWiki api hitting api.php and at some point it seems to loose the user session.

Special:Undelete is invoked with:
COOKIE: COOKIE: jenkins_u0_mwUserName=WikiAdmin;jenkins_u0_mw_session=9k2noab5k9e1k1ocn0ot87k1m620flkj

And it is then marked as dirty (whatever it means):

[session] SessionBackend "9k2noab5k9e1k1ocn0ot87k1m620flkj" metadata dirty due to ID reset (formerly "hqom17n9i7njja6n97i44vt2bo4ff844")
[session] SessionBackend "9k2noab5k9e1k1ocn0ot87k1m620flkj" save: dataDirty=1 metaDirty=1 forcePersist=0
[CentralAuthVerbose] Loading state for global user 127.0.0.1 from DB
[CentralAuthVerbose] Loading attached wiki list for global user 127.0.0.1 from DB
[CentralAuthVerbose] Loading groups for global user 127.0.0.1
[CentralAuthVerbose] Loading CentralAuthUser for user 127.0.0.1 from cache object
[cookie] setcookie: "jenkins_u0_mw_session", "9k2noab5k9e1k1ocn0ot87k1m620flkj", "0", "/", "", "", "1"
[cookie] setcookie: "jenkins_u0_mwUserID", "", "1495636574", "/", "", "", "1"
[cookie] already deleted setcookie: "jenkins_u0_mwToken", "", "1495636574", "/", "", "", "1"
[cookie] already deleted setcookie: "forceHTTPS", "", "1495636574", "/", "", "", "1"

I also noticed a backtrace when invoking session_start:

[DBPerformance] Expectation (writes <= 0) by MediaWiki::main not met (actual: 5):
query-m: REPLACE INTO `objectcache` (keyname,value,exptime) VALUES ('X')
#0 includes/libs/rdbms/TransactionProfiler.php(219): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated('writes', 'query-m: REPLAC...', 4)
#1 includes/libs/rdbms/database/Database.php(1252): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion('query-m: REPLAC...', 1527172580.0906, true, 1)
#2 includes/libs/rdbms/database/Database.php(1147): Wikimedia\Rdbms\Database->doProfiledQuery('REPLACE INTO `o...', 'REPLACE /* SqlB...', true, 'SqlBagOStuff::s...')
#3 includes/libs/rdbms/database/Database.php(2795): Wikimedia\Rdbms\Database->query('REPLACE INTO `o...', 'SqlBagOStuff::s...')
#4 includes/libs/rdbms/database/DatabaseMysqlBase.php(522): Wikimedia\Rdbms\Database->nativeReplace('`objectcache`', Array, 'SqlBagOStuff::s...')
#5 includes/objectcache/SqlBagOStuff.php(361): Wikimedia\Rdbms\DatabaseMysqlBase->replace('objectcache', Array, Array, 'SqlBagOStuff::s...')
#6 includes/objectcache/SqlBagOStuff.php(376): SqlBagOStuff->setMulti(Array, 1527176180)
#7 includes/libs/objectcache/CachedBagOStuff.php(65): SqlBagOStuff->set('jenkins_u0_mw:M...', Array, 1527176180, 1)
#8 includes/session/SessionBackend.php(738): CachedBagOStuff->set('jenkins_u0_mw:M...', Array, 1527176180, 1) 
#9 includes/session/SessionBackend.php(596): MediaWiki\Session\SessionBackend->save()
#10 [internal function]: MediaWiki\Session\SessionBackend->MediaWiki\Session\{closure}()
#11 vendor/wikimedia/scoped-callback/src/ScopedCallback.php(76): call_user_func_array(Object(Closure), Array)
#12 vendor/wikimedia/scoped-callback/src/ScopedCallback.php(56): Wikimedia\ScopedCallback->__destruct()
#13 includes/session/SessionManager.php(886): Wikimedia\ScopedCallback::consume(NULL)
#14 includes/session/SessionManager.php(214): MediaWiki\Session\SessionManager->getSessionFromInfo(Object(MediaWiki\Session\SessionInfo), Object(FauxRequest))
#15 includes/session/PHPSessionHandler.php(236): MediaWiki\Session\SessionManager->getSessionById('kg7g5klnphs6e5k...', false)
#16 [internal function]: MediaWiki\Session\PHPSessionHandler->read('kg7g5klnphs6e5k...')
#17 [internal function]: session_start()
#18 vendor/wikimedia/at-ease/src/Wikimedia/Functions.php(71): call_user_func_array('session_start', Array)
#19 includes/session/SessionBackend.php(767): Wikimedia\quietCall('session_start')
#20 includes/session/SessionBackend.php(690): MediaWiki\Session\SessionBackend->checkPHPSession()
#21 includes/session/SessionBackend.php(607): MediaWiki\Session\SessionBackend->save()
#22 includes/session/SessionBackend.php(291): MediaWiki\Session\SessionBackend->autosave()
#23 includes/session/Session.php(127): MediaWiki\Session\SessionBackend->persist()
#24 includes/specialpage/LoginSignupSpecialPage.php(231): MediaWiki\Session\Session->persist()
#25 includes/specialpage/SpecialPage.php(522): LoginSignupSpecialPage->execute(NULL)
#26 includes/specialpage/SpecialPageFactory.php(569): SpecialPage->run(NULL)
#27 includes/MediaWiki.php(288): SpecialPageFactory::executePath(Object(Title), Object(RequestContext))
#28 includes/MediaWiki.php(863): MediaWiki->performRequest()
#29 includes/MediaWiki.php(524): MediaWiki->main()
#30 index.php(42): MediaWiki->run()
#31 {main}

No clue what they are for.

I have completed fixes for Quibble to let it run Wikibase tests ( T196013 ). The test job ended up failing with the exact same issue: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php70-docker/5410/

Dummy change: https://gerrit.wikimedia.org/r/#/c/70395/

So I guess I can try reproducing the Quibble environment locally (with all ext dependencies added), then run repeatedly the should be restorable wdio test and see what happens.

hashar added a comment.EditedJun 7 2018, 1:33 PM

I have tried locally with ./node_modules/.bin/wdio --mochaOpts.grep 'should be re-creatable' but I could not reproduce the issue. That seems to be related to the session management.

I thought about some objects not being written to the objectcache:

[caches] ... session: SqlBagOStuff

Which potentially might be subject to a race condition when multiple web queries try to write to the same table.

Another I noticed in the mw-ratelimit.log:

2018-06-07 11:54:23 c209d5d7870a wikidb-unittest_: User 'UserForTestAttemptSaveRateLimit' (IP 127.0.0.1) tripped wikidb-unittest_:limiter:edit:user:36 at 1 (limit 1 in 60s)
2018-06-07 11:54:23 c209d5d7870a wikidb-unittest_: User 'UserForTestAttemptSaveRateLimit' (IP 127.0.0.1) tripped wikidb-unittest_:limiter:edit:user:36 at 1 (limit 1 in 60s)
2018-06-07 11:54:23 c209d5d7870a wikidb-unittest_: User 'UserForTestAttemptSaveRateLimit' (IP 127.0.0.1) tripped wikidb-unittest_:limiter:edit:user:36 at 2 (limit 1 in 60s)
2018-06-07 11:54:23 c209d5d7870a wikidb-unittest_: User 'UserForTestAttemptSaveRateLimit' (IP 127.0.0.1) tripped wikidb-unittest_:limiter:create:user:36 at 1 (limit 1 in 60s)

But they come from the PHPUnit test suite.

Anomie added a comment.Jun 7 2018, 1:54 PM

And it is then marked as dirty (whatever it means):

The code tries to only save when something actually changed, instead of saving every time the session goes out of scope. "Marked as dirty" happens when something changed.

I also noticed a backtrace when invoking session_start:

[DBPerformance] Expectation (writes <= 0) by MediaWiki::main not met (actual: 5):
query-m: REPLACE INTO `objectcache` (keyname,value,exptime) VALUES ('X')
#0 includes/libs/rdbms/TransactionProfiler.php(219): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated('writes', 'query-m: REPLAC...', 4)
#1 includes/libs/rdbms/database/Database.php(1252): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion('query-m: REPLAC...', 1527172580.0906, true, 1)
#2 includes/libs/rdbms/database/Database.php(1147): Wikimedia\Rdbms\Database->doProfiledQuery('REPLACE INTO `o...', 'REPLACE /* SqlB...', true, 'SqlBagOStuff::s...')
#3 includes/libs/rdbms/database/Database.php(2795): Wikimedia\Rdbms\Database->query('REPLACE INTO `o...', 'SqlBagOStuff::s...')
#4 includes/libs/rdbms/database/DatabaseMysqlBase.php(522): Wikimedia\Rdbms\Database->nativeReplace('`objectcache`', Array, 'SqlBagOStuff::s...')
#5 includes/objectcache/SqlBagOStuff.php(361): Wikimedia\Rdbms\DatabaseMysqlBase->replace('objectcache', Array, Array, 'SqlBagOStuff::s...')

That's T154424: TransactionProfiler should not apply to SqlBagOStuff.

I have tried locally with ./node_modules/.bin/wdio --mochaOpts.grep 'should be re-creatable' but I could not reproduce the issue. That seems to be related to the session management.

From T191537#4143301, it seems to me that the issue may involve a job run post-send (due to $wgJobRunRate) during the login request trying to load the wrong session for some unknown reason and stomping on the cookies. If the "wrong" job doesn't run during that particular request the problem wouldn't be reproduced.

hashar added a comment.EditedJun 8 2018, 12:57 PM

Thank you @Anomie

I hacked mediawiki core to present the webrequest id to all debug log ( Gerrit 438232 ). That might help sorting out the logs and tracking what is going on a per web request basis.

I then made a Wikibase change to depend on that patch. A Selenium test for Echo fails due to the user being logged out. The build (kept forever): https://integration.wikimedia.org/ci/job/mediawiki-core-qunit-selenium-jessie/20305/ ( debug log, 1.81MBytes ).

The tests seems to fail rather constantly on two jobs:

mediawiki-core-qunit-selenium-jessiebuild pageNodepool / Jessie / PHP7
quibble-vendor-mysql-php70-dockerbuild pageDocker / PHP7

But quibble-vendor-mysql-hhvm-docker (build page) seems to always pass. That is essentially the same as the php70 docker container but using HHVM. Maybe it is slower and thus the race condition is not triggering the issue.

hashar added a comment.Jun 8 2018, 1:30 PM

From https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php70-docker/5475/artifact/log/mw-debug-www.log The lines come from the debug log and are prefixed with `\WebRequest::getRequestId().

A request to [Special:Preferences] made with a normal user:

9ff423f9588856afc173a0f3 Start request POST /index.php/Special:Preferences
COOKIE: wikidb_session=0js640qarj8b7c710hboopll3d99jo9l; wikidbUserID=4; wikidbUserName=User-0.8234154921968517; UseDC=master; UseCDNCache=false

The Echo selenium tests login as WikiAdmin and get wikidb_session=aj3k23n:

5fcb626ee9b60ab1fee4d5aa Start request POST /index.php/Special:UserLogin
COOKIE: wikidb_session=aj3k23nvas9kr06rc5tkvdjmfhhmcc3o
5fcb626ee9b60ab1fee4d5aa [session] SessionBackend "aj3k23nvas9kr06rc5tkvdjmfhhmcc3o" save: dataDirty=1 metaDirty=0 forcePersist=0
5fcb626ee9b60ab1fee4d5aa [authentication] Login for WikiAdmin succeeded
5fcb626ee9b60ab1fee4d5aa [session] SessionBackend "9pigdd14olusqg2e7l87t1im0hked385" save: dataDirty=1 metaDirty=1 forcePersist=0
5fcb626ee9b60ab1fee4d5aa [cookie] setcookie: "wikidb_session", "9pigdd14olusqg2e7l87t1im0hked385", "0", "/", "", "", "1"

After that the Special:Preferences request is not completed yet and run a delayed cirrusSearchLinksUpdatePrioritized job:

9ff423f9588856afc173a0f3 [runJobs] cirrusSearchLinksUpdatePrioritized 0.09196885017085155-öäü-♠♣♥♦
9ff423f9588856afc173a0f3 [session] Session "aj3k23nvas9kr06rc5tkvdjmfhhmcc3o" requested without UserID cookie
9ff423f9588856afc173a0f3 [session] SessionBackend "aj3k23nvas9kr06rc5tkvdjmfhhmcc3o" is unsaved, marking dirty in constructor
9ff423f9588856afc173a0f3 [session] SessionBackend "tj1mk3ilbvisjlh5hbblvrkhnpu7vq2m" metadata dirty due to ID reset (formerly "aj3k23nvas9kr06rc5tkvdjmfhhmcc3o")
9ff423f9588856afc173a0f3 [session] SessionBackend "tj1mk3ilbvisjlh5hbblvrkhnpu7vq2m" save: dataDirty=1 metaDirty=1 forcePersist=0
9ff423f9588856afc173a0f3 [cookie] setcookie: "wikidb_session", "tj1mk3ilbvisjlh5hbblvrkhnpu7vq2m", "0", "/", "", "", "1"

Note that:

  • the request was for a normal user but the browser meanwhile logged in as Wikiadmin
  • the session from UserLogin is marked as unsaved
  • a new cookie being set: wikidb_session=tj1mk3il.

The Echo test managed to login successfully and get redirected to the Main_Page. The browser then attempt to use the cookie tj1mk3 from Special:Preferences.

53d637019fc7ce5de29ea4e4 Start request GET /index.php/Main_Page
COOKIE: wikidbUserName=WikiAdmin; UseDC=master; UseCDNCache=false; wikidb_session=tj1mk3ilbvisjlh5hbblvrkhnpu7vq2m
53d637019fc7ce5de29ea4e4 [session] CentralAuthSessionProvider::provideSessionInfo: Not enabled, falling back to core sessions
53d637019fc7ce5de29ea4e4 [session] Session "tj1mk3ilbvisjlh5hbblvrkhnpu7vq2m" requested without UserID cookie

It is not logged in.

hashar added a comment.Jun 8 2018, 1:47 PM

For the HHVM job that passed ( https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/4849/ ), there is no other requests happening during the redirect:

$ egrep '(wikidb_session|COOKIE|Start request)' hhvm.log

5ab822f040f89e5522c663b8 Start request POST /index.php/Special:UserLogin
COOKIE: wikidb_session=r08f7bm627jjqngvvkugqjplcff1vh1i
5ab822f040f89e5522c663b8 [cookie] setcookie: "wikidb_session", "ro1ts3u1jq6muodvjtp6b102an34og87", "0", "/", "", "", "1"
5ab822f040f89e5522c663b8 [cookie] already set setcookie: "wikidb_session", "ro1ts3u1jq6muodvjtp6b102an34og87", "0", "/", "", "", "1"
5ab822f040f89e5522c663b8 [cookie] already set setcookie: "wikidb_session", "ro1ts3u1jq6muodvjtp6b102an34og87", "0", "/", "", "", "1"
5ab822f040f89e5522c663b8 [cookie] already set setcookie: "wikidb_session", "ro1ts3u1jq6muodvjtp6b102an34og87", "0", "/", "", "", "1"

a72bc18a9b72cbd823482cc5 Start request GET /index.php/Main_Page
COOKIE: wikidb_session=ro1ts3u1jq6muodvjtp6b102an34og87; wikidbUserID=1; wikidbUserName=WikiAdmin; UseDC=master; UseCDNCache=false
hashar added a comment.Jun 8 2018, 2:19 PM

So I think the POST /index.php/Special:UserLogin completes properly and then a job is processed. That happens to be a cirrusSearchLinksUpdatePrioritized that got enqueued by a previous request/test, it then mess up with the session on the server side and the user ends up logged out.

I can't even find any idea as to how to fix it :(

hashar added a comment.Jun 8 2018, 2:54 PM

Maybe I will try tonight to have all the jobs issued by a request to be run immediately. $wgJobRunRate = PHP_INT_MAX; would do it.

Though there are some sleep() calls in the MediaWiki jobrunner which might well delay the job run time :(

Anomie added a comment.EditedJun 8 2018, 4:29 PM

So I think the POST /index.php/Special:UserLogin completes properly and then a job is processed. That happens to be a cirrusSearchLinksUpdatePrioritized that got enqueued by a previous request/test, it then mess up with the session on the server side and the user ends up logged out.

That's what I think too. I'm pretty sure the actual saved session is valid, it's just the job's session cookie is overriding the session cookie from the actual login.

If PHP had a send_headers() function, the fix might be as simple as calling that near the start of MediaWiki::doPostOutputShutdown() or MediaWiki::triggerSyncJobs() so cookies from jobs can't override the real request. I suppose we could do mostly the same thing by adding a static flag to WebResponse disabling WebResponse::setCookie().

Or we might be able to fix the tests by setting $wgRunJobsAsync = true, although the underlying problem would still be there for anyone with that setting off.

hashar added a comment.Jun 8 2018, 4:54 PM

Yup, I think the flow goes like:

POST /index.php/Special:Preferences
user=User-0.8234
COOKIE: wikidb_session=0js640qarj8b7c710hboopll3d99jo9l

Several jobs get enqueued and one is left behind. That got triggered by a run of the mediawiki/core Selenium tests. The suite then close the browser entirely

In a new browser instance, the Echo suite is run and login as Wikiadmin:

POST /index.php/Special:UserLogin
user=Wikiadmin
COOKIE: wikidb_session=aj3k23nvas9kr06rc5tkvdjmfhhmcc3o

A delayed job from the mediawiki/core suite is run and triggers the setcookie for a previous and now invalid session.

The browser get the new (invalid) session, redirect to Main_Page and present the invalid session. Seems to be the jobs should never set a cookie to cover the case of them being run by a web request.

Anomie added a comment.Jun 8 2018, 5:09 PM

Seems to be the jobs should never set a cookie to cover the case of them being run by a web request.

Unfortunately it's probably something down the call stack that's defaulting to some property of the session user where the job itself doesn't actually care about that default. It probably wouldn't hurt to track those down and avoid it picking up those defaults, if possible, but that's probably not an insignificant amount of work.

Change 439289 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Disable WebResponse::setCookie() for post-send processing

https://gerrit.wikimedia.org/r/439289

From https://gerrit.wikimedia.org/r/c/mediawiki/core/+/439289

Wikibase was failing the Selenium tests which are run by the 'quibble' job at least for the php7. It might be faster and thus ended up hitting the race condition.
I have crafted a dummy change for Wikibase which depends on the mediawiki/core patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/439847 One build passed and a second is being run. On another Wikibase change the selenium test was failing constantly.
So I think it is a proper fix :]

Change 439289 merged by jenkins-bot:
[mediawiki/core@master] Disable WebResponse setters for post-send processing

https://gerrit.wikimedia.org/r/439289

hashar closed this task as Resolved.Jun 14 2018, 8:23 AM
hashar reassigned this task from zeljkofilipin to Anomie.

We have good reason to think that preventing jobs from sendinig cookie/headers is the definitive fix. https://gerrit.wikimedia.org/r/439289

I would like to publicly praise @Anomie who immediately identified the root cause on this task, stepped in to craft a patch and went the extra step to also address headers() being call in post-send jobs. Thank you Brad.

Change 445222 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/core@master] In development, run all jobs on each requests

https://gerrit.wikimedia.org/r/445222

Change 445222 abandoned by Hashar:
In development, run all jobs on each requests

Reason:
That was a dirty / quick win patch to test out the failures we had on PageTriage / recentchange job not running.

I definitely love the suggestion Timo made to have the test suite run the job / ensure all jobs get run ( T199446#4429139 ).

So I am abandoning this change in favor of https://gerrit.wikimedia.org/r/#/c/mediawiki/core/ /446543/ "selenium: invoke jobs to enforce eventual consistency"

https://gerrit.wikimedia.org/r/445222

Change 434011 abandoned by Zfilipin:
Do not run composer in WikibaseLexeme's node selenium job

Reason:
no activity in months

https://gerrit.wikimedia.org/r/434011