Page MenuHomePhabricator

Jenkins Browser tests for Wikibase/Popups etc are failing: Invalid CSRF token in Selenium browser
Closed, ResolvedPublic

Description

Browser tests in job mwext-mw-selenium-composer-jessie are failing for (probably) unrelated reason "Invalid CSRF token" in https://gerrit.wikimedia.org/r/#/c/342808/ (and others).
Failed to reproduce on local machine.

Given I am logged in                # mediawiki_selenium-1.7.3/lib/mediawiki_selenium/step_definitions/login_steps.rb:1
Invalid CSRF token. (badtoken) (MediawikiApi::ApiError)

Event Timeline

Discussion in wikidata IRC channel:

[12:58:32] <Aleksey_WMDE> hashar: Hi! We currntly have browser tests failing for unknown reason. Do you have any clue why? Here is the log https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer-jessie/2083/consoleFull
[13:00:54] <Aleksey_WMDE> hashar: I mean the reason is "Invalid CSRF token" but we have no Idea why it happens
[13:07:40] <hashar> Aleksey_WMDE: that is the MediaWiki API complaining that the given token is invalid
[13:07:57] <hashar> and really I have no clue what could cause that
[13:08:43] <hashar> supposedly mediawiki_selenium is supposed to retrieve a token with api.php?action=query&format=json&meta=tokens&type=csrf
[13:09:57] <hashar> the www debug log has some clue about that https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer-jessie/2083/artifact/log/mw-debug-www.log
[13:10:46] <hashar> either a regression in mediawiki/core or a new requirement
[13:11:47] <hashar> and https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer-jessie/ shows the last working build (yesterday around 19:00 )
[13:11:56] <hashar> vs the first failing this morning
[13:12:04] <hashar> so that is potentially a large chunk of changes
[13:14:16] <hashar> Aleksey_WMDE: pretty sure that is "Default to CACHE_ACCEL if user skips advanced settings" 1fec847c6b366ab21bb215b94837a186545096f4 https://gerrit.wikimedia.org/r/#/c/342793/
[13:14:28] <hashar> in the failing jobs we got: session: EmptyBagOStuff
[13:14:40] <hashar> so the session/token is discard between requests I guess

Zppix renamed this task from Jenkins fails to run browser tests for Wikibase to Jenkins Browser tests for Wikibase are failing.Mar 15 2017, 1:27 PM
hashar added a subscriber: Bawolff.

Specially the debug log of HTTP requests seems to indicate the session cache is EmptyBagOStuff:

[caches] cluster: EmptyBagOStuff, WAN: mediawiki-main-default, stash: db-replicated, message: EmptyBagOStuff, parser: EmptyBagOStuff, session: EmptyBagOStuff

Eg when doing /api.php?action=query&format=json&meta=tokens&type=csrf

I guess the session ends up dismissed between requests and the token is no more valid on the second request.

Might be related to https://gerrit.wikimedia.org/r/#/c/342793/ 1fec847c6b366ab21bb215b94837a186545096f4 ?

I went on the Jenkins master that hold the debug log archives ( contint1001 under /srv/jenkins/builds). I looked at another job that has a MediaWiki exposed over Apache and being hit by http requests: mediawiki-core-qunit-jessie

https://integration.wikimedia.org/ci/job/mediawiki-core-qunit-jessie/13802/ started at 05:55:07 UTC
[caches] cluster: EmptyBagOStuff, WAN: mediawiki-main-default, stash: db-replicated, message: SqlBagOStuff, parser: SqlBagOStuff, session: SqlBagOStuff

https://integration.wikimedia.org/ci/job/mediawiki-core-qunit-jessie/13803/ started at 06:40:11 UTC
[caches] cluster: EmptyBagOStuff, WAN: mediawiki-main-default, stash: db-replicated, message: EmptyBagOStuff, parser: EmptyBagOStuff, session: EmptyBagOStuff

The jobs run on ci-jessie-wikimeda images which are only updated once per day (at 14:14UTC). So the underlying infrastructure has not changed to the best of my knowledge.

That mediawiki-core-qunit-jessie only depends on mediawiki/core and mediawiki/vendor. So I am pretty confident https://gerrit.wikimedia.org/r/#/c/342793/ is the cause.

One way to test that on CI:

(replace I1234590909090909 with the Change-Id of the revert commit). That will test the patch against mediawiki/core + the revert commit.

Specially the debug log of HTTP requests seems to indicate the session cache is EmptyBagOStuff:

[caches] cluster: EmptyBagOStuff, WAN: mediawiki-main-default, stash: db-replicated, message: EmptyBagOStuff, parser: EmptyBagOStuff, session: EmptyBagOStuff

Eg when doing /api.php?action=query&format=json&meta=tokens&type=csrf

I guess the session ends up dismissed between requests and the token is no more valid on the second request.

Might be related to https://gerrit.wikimedia.org/r/#/c/342793/ 1fec847c6b366ab21bb215b94837a186545096f4 ?

What is $wgMainCacheType and $wgSessionCacheType actually set to?

Maybe some issue with how CACHE_ANYTHING is handled (?)

I guess the lamest way is to craft a patch that output the value via wfDebug() (captured by Jenkins as mwdebug-www.log iirc) or a plain old var_dump() which should show up in the console log.

// sleep time sorry

Looking at the code more closely, what might be happening.

Installer detects APC is present, chooses CACHE_ACCEL. APC really isn't present, CACHE_ACCEL becomes CACHE_NONE. Then when resolving $wgSessionCacheType CACHE_ANYTHING checks $wgMainCacheType which is CACHE_ACCEL, so chooses that, even though its really CACHE_NONE. (No idea why installer might think that CACHE_ACCEL is available when it really isn't)

Specially the debug log of HTTP requests seems to indicate the session cache is EmptyBagOStuff:

[caches] cluster: EmptyBagOStuff, WAN: mediawiki-main-default, stash: db-replicated, message: EmptyBagOStuff, parser: EmptyBagOStuff, session: EmptyBagOStuff

Eg when doing /api.php?action=query&format=json&meta=tokens&type=csrf

I guess the session ends up dismissed between requests and the token is no more valid on the second request.

Might be related to https://gerrit.wikimedia.org/r/#/c/342793/ 1fec847c6b366ab21bb215b94837a186545096f4 ?

What is $wgMainCacheType and $wgSessionCacheType actually set to?

Maybe some issue with how CACHE_ANYTHING is handled (?)

I assume $wgMainCacheType is CACHE_ACCEL or CACHE_NONE (from the installer-generated LocalSettings), and $wgSessionCacheType CACHE_ANYTHING (per the default setting).

Per https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer-jessie/2100/consoleFull#console-section-0 it seems APC is indeed not being detected.

19:56:45 PHP 5.6.30-0+deb8u1 is installed.
19:56:45 Warning: Could not find APCu, XCache or WinCache.
19:56:45 Object caching is not enabled.

So $wgMainCacheType remains CACHE_NONE, just like before https://gerrit.wikimedia.org/r/#/c/342793/ changed the installer.

However I don't know how that can end up with sesssion cache effectively being "none" (EmptyBagOStuff).

CACHE_ANYTHING just looks at the various custom cache groups (main, messages, parser, etc.) for one that is not "none" or "anything", and uses it. With fallback to CACHE_DB.

It can't end up as NONE...

So the console output does say: https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer-jessie/2083/consoleFull

Warning: Could not find APCu, XCache or WinCache.
11:18:58 Object caching is not enabled.

So the installer should be selecting CACHE_NONE. If its selecting CACHE_ACCEL, we could see this behaviour as CACHE_ANYTHING doesn't handle that case properly.

This may help figuring out T147161: Installer suggests $wgMainCacheType = CACHE_ACCEL, potentially breaking sessions, in case it's related, that is: installer detects CACHE_ACCEL when it's not available

Change 342959 had a related patch set uploaded (by Brian Wolff):
[mediawiki/core] Never use CACHE_NONE for CACHE_ANYTHING

https://gerrit.wikimedia.org/r/342959

So the console output does say: https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer-jessie/2083/consoleFull

Warning: Could not find APCu, XCache or WinCache.
11:18:58 Object caching is not enabled.

So the installer should be selecting CACHE_NONE. If its selecting CACHE_ACCEL, we could see this behaviour as CACHE_ANYTHING doesn't handle that case properly.

I hacked the job config to also store LocalSettings.php as a build artefact and re-ran from https://gerrit.wikimedia.org/r/342808:

## Shared memory settings
$wgMainCacheType = CACHE_ACCEL;

So it seems the installer ends up configuring CACHE_ACCEL for wgMainCacheType but APC isn't actually available. From there, SessionCacheType (CACHE_ANYTHING) selects the same store. But when at run time we construct it (ServiceWiring.php for LocalServerObjectCache) it'll end up selecting the CACHE_NONE fallback.

Change 342959 had a related patch set uploaded (by Brian Wolff):
[mediawiki/core] Never use CACHE_NONE for CACHE_ANYTHING

https://gerrit.wikimedia.org/r/342959

This is just the cache_anything part, still need to figure out why CACHE_ACCEL is being selected in the first place.

Change 342966 had a related patch set uploaded (by Brian Wolff):
[mediawiki/core] Properly detect if CACHE_ACCEL is available in the installer

https://gerrit.wikimedia.org/r/342966

Change 342885 had a related patch set uploaded (by Nikerabbit; owner: Aleksey Bekh-Ivanov (WMDE)):
[mediawiki/core] Revert "Default to CACHE_ACCEL if user skips advanced settings"

https://gerrit.wikimedia.org/r/342885

I eventually approved the revert patch to unblock development hitting that nasty corner case.

hashar renamed this task from Jenkins Browser tests for Wikibase are failing to Jenkins Browser tests for Wikibase/Popups etc are failing: Invalid CSRF token in Selenium browser.Mar 16 2017, 7:55 AM
hashar claimed this task.
hashar updated the task description. (Show Details)

Change 342885 merged by jenkins-bot:
[mediawiki/core] Revert "Default to CACHE_ACCEL if user skips advanced settings"

https://gerrit.wikimedia.org/r/342885

EddieGP triaged this task as High priority.Mar 16 2017, 4:15 PM

https://gerrit.wikimedia.org/r/#/c/342864/ is still failing, sadly (unless that's yet another issue, but I believed it was this one?)

https://gerrit.wikimedia.org/r/#/c/342864/ is still failing, sadly (unless that's yet another issue, but I believed it was this one?)

Seems unrelated. This task is about mwext-mw-selenium-composer-jessie consistently failing for Wikibase projects. The test does web browser interaction and its tokens are never valid because the session storage in MediaWiki wasn't configured.

Since https://gerrit.wikimedia.org/r/342885 reverted the bug in core, Wikibase is passing again (original patch was unblocked and merged https://gerrit.wikimedia.org/r/342808).

https://gerrit.wikimedia.org/r/#/c/342864/ fails because of a PHPUnit test failure (mediawiki-extensions-hhvm-jessie). Some string produced for a Wikibase special page isn't matching as expected. However this commit only triggers the shared MediaWiki pipeline, which seems to be passing on most other gerrit changes so probably not related.

Thanks all! We're merging again in reading land so at last that's some kind of progress :)

Change 342959 merged by jenkins-bot:
[mediawiki/core@master] objectcache: Never use CACHE_NONE for CACHE_ANYTHING

https://gerrit.wikimedia.org/r/342959

Change 345959 had a related patch set uploaded (by Martineznovo; owner: Brian Wolff):
[mediawiki/core@REL1_28] objectcache: Never use CACHE_NONE for CACHE_ANYTHING

https://gerrit.wikimedia.org/r/345959

Change 345960 had a related patch set uploaded (by Martineznovo; owner: Brian Wolff):
[mediawiki/core@REL1_27] objectcache: Never use CACHE_NONE for CACHE_ANYTHING

https://gerrit.wikimedia.org/r/345960

Change 345959 merged by jenkins-bot:
[mediawiki/core@REL1_28] objectcache: Never use CACHE_NONE for CACHE_ANYTHING

https://gerrit.wikimedia.org/r/345959

Change 345960 merged by jenkins-bot:
[mediawiki/core@REL1_27] objectcache: Never use CACHE_NONE for CACHE_ANYTHING

https://gerrit.wikimedia.org/r/345960

Should this be backported to REL1_23 too? The code is pretty much the same...

MediaWiki 1.27 onward doesn't use PHP builtin session storage, and having session cache set to CACHE_NONE breaks session silently (well, you know they're broken but it's very difficult to know why)

MediaWiki 1.23 doesn't have this problem with sessions.

I've tried setting $wgMainCacheType = CACHE_ACCEL; on 1.23 and I got a blank page with this error when trying to load any page, which is pretty verbose :)

[7573ec1d] / Exception from line 133 of /home/www/lib/mediawiki-1.23.11/includes/objectcache/ObjectCache.php: CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC.

Backtrace:

#0 [internal function]: ObjectCache::newAccelerator(array)
#1 /home/www/lib/mediawiki-1.23.11/includes/objectcache/ObjectCache.php(85): call_user_func(string, array)
#2 /home/www/lib/mediawiki-1.23.11/includes/objectcache/ObjectCache.php(72): ObjectCache::newFromParams(array)
#3 /home/www/lib/mediawiki-1.23.11/includes/objectcache/ObjectCache.php(44): ObjectCache::newFromId(integer)
#4 /home/www/lib/mediawiki-1.23.11/includes/GlobalFunctions.php(4018): ObjectCache::getInstance(integer)
#5 /home/www/lib/mediawiki-1.23.11/includes/Setup.php(519): wfGetMainCache()
#6 /home/www/lib/mediawiki-1.23.11/includes/WebStart.php(159): require_once(string)
#7 /home/www/lib/mediawiki-1.23.11/index.php(43): require(string)
#8 {main}

Change 342966 merged by jenkins-bot:
[mediawiki/core@master] Properly detect if CACHE_ACCEL is available in the installer

https://gerrit.wikimedia.org/r/342966

Change 352270 had a related patch set uploaded (by Hashar; owner: Brian Wolff):
[mediawiki/core@REL1_29] Properly detect if CACHE_ACCEL is available in the installer

https://gerrit.wikimedia.org/r/352270

Change 352270 merged by jenkins-bot:
[mediawiki/core@REL1_29] Properly detect if CACHE_ACCEL is available in the installer

https://gerrit.wikimedia.org/r/352270