Page MenuHomePhabricator

MediaWiki core Selenium tests fail when targeting Vagrant
Closed, ResolvedPublic

Description

Video:

Steps to reproduce:

  • change real name
  • save
  • page refreshes
  • instead of displaying new real name, old real name is displayed
  • new real name is displayed after manually refreshing the page (391260)

Also reproducible by running Selenium tests:

$ npm run selenium

> @ selenium /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki
> killall -0 chromedriver 2>/dev/null || chromedriver --url-base=/wd/hub --port=4444 & grunt webdriver:test; killall chromedriver

Starting ChromeDriver 2.33.506106 (8a06c39c4582fbfbab6966dbb1c38a9173bfb1a2) on port 4444
Only local connections are allowed.
Running "webdriver:test" (webdriver) task
pattern /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/extensions/VisualEditor/modules/ve-mw/tests/selenium/specs/**/*.js did not match any file
pattern /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/skins/*/tests/selenium/specs/**/*.js did not match any file
GVA info: Successfully connected to the Intel plugin, offline Gen7 

	Screenshot location: ./log/should-be-able-to-change-preferences.png 

------------------------------------------------------------------
[chrome #0-0] Session ID: 40972fa926e8334caf38ee9b5644e3cd
[chrome #0-0] Spec: /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js
[chrome #0-0] Running: chrome
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]       โœ“ should be able to create account
[chrome #0-0]       โœ“ should be able to log in
[chrome #0-0]       1) should be able to change preferences
[chrome #0-0]
[chrome #0-0]
[chrome #0-0] 2 passing (34s)
[chrome #0-0] 1 failing
[chrome #0-0]
[chrome #0-0] 1) User should be able to change preferences:
[chrome #0-0] '' == '0.3776885056399568'
[chrome #0-0] AssertionError [ERR_ASSERTION]: '' == '0.3776885056399568'
[chrome #0-0]     at Context.<anonymous> (/Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js:65:10)
[chrome #0-0]     at new Promise (<anonymous>)
[chrome #0-0]

GVA info: Successfully connected to the Intel plugin, offline Gen7 
------------------------------------------------------------------
[chrome #0-1] Session ID: 66d31af199cf902d5a8cc5ed596e6a6c
[chrome #0-1] Spec: /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/page.js
[chrome #0-1] Running: chrome
[chrome #0-1]
[chrome #0-1]   Page
[chrome #0-1]
[chrome #0-1]   Page
[chrome #0-1]       โœ“ should be creatable
[chrome #0-1]       โœ“ should be editable
[chrome #0-1]       โœ“ should have history
[chrome #0-1]
[chrome #0-1]
[chrome #0-1] 3 passing (20s)
[chrome #0-1]

Warning: Task "webdriver:test" failed. Use --force to continue.

Aborted due to warnings.
sh: line 1:  6238 Terminated: 15          chromedriver --url-base=/wd/hub --port=4444

should-be-able-to-change-preferences.png (1ร—1 px, 185 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald Transcript

Change 391260 had a related patch set uploaded (by Zfilipin; owner: Zfilipin):
[mediawiki/core@master] WIP MediaWiki core Selenium tests fail when targeting Vagrant

https://gerrit.wikimedia.org/r/391260

To reproduce:

$ composer update && npm i && npx wdio tests/selenium/wdio.conf.js --spec tests/selenium/specs/user.js --mochaOpts.grep preferences --mochaOpts.timeout 50000
> ComposerHookHandler::onPreUpdate
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 0 installs, 1 update, 0 removals
Generating optimized autoload files
> ComposerVendorHtaccessCreator::onEvent
up to date in 3.973s
GVA info: Successfully connected to the Intel plugin, offline Gen7 

	Screenshot location: ./log/should-be-able-to-change-preferences.png 

------------------------------------------------------------------
[chrome #0-0] Session ID: 3def56a7e8a89a64b2ec4a7256dddecf
[chrome #0-0] Spec: /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js
[chrome #0-0] Running: chrome
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]       1) should be able to change preferences
[chrome #0-0]
[chrome #0-0]
[chrome #0-0] 1 failing (22s)
[chrome #0-0]
[chrome #0-0] 1) User should be able to change preferences:
[chrome #0-0] '' == '0.7742726235072035'
[chrome #0-0] AssertionError [ERR_ASSERTION]: '' == '0.7742726235072035'
[chrome #0-0]     at Context.<anonymous> (/Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js:65:10)
[chrome #0-0]     at new Promise (<anonymous>)
[chrome #0-0]

Last known good commit (d1439a3e6746):

$ git checkout d1439a3e6746

$ composer update && npm i && npx wdio tests/selenium/wdio.conf.js --spec tests/selenium/specs/user.js --mochaOpts.grep preferences --mochaOpts.timeout 50000
> ComposerHookHandler::onPreUpdate
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 0 installs, 6 updates, 0 removals
  - Updating oojs/oojs-ui (v0.24.2 => v0.23.2): Loading from cache
  - Updating squizlabs/php_codesniffer (3.1.1 => 3.0.2): Loading from cache
  - Updating mediawiki/mediawiki-codesniffer (v14.1.0 => v13.0.0): Loading from cache
  - Updating phpunit/phpunit (4.8.36 => 4.8.35): Loading from cache
  - Updating psy/psysh (v0.8.11 => v0.8.5): Loading from cache
Writing lock file
Generating optimized autoload files
added 104 packages, removed 51 packages, updated 38 packages and moved 14 packages in 12.392s
GVA info: Successfully connected to the Intel plugin, offline Gen7 
------------------------------------------------------------------
[chrome #0-0] Session ID: b9000694a368f70241605be5b71b936e
[chrome #0-0] Spec: /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js
[chrome #0-0] Running: chrome
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]       โœ“ should be able to change preferences
[chrome #0-0]
[chrome #0-0]
[chrome #0-0] 1 passing (42s)
[chrome #0-0]

Looks like 6b2f13b055c8 (385127) introduced the problem. (Thanks @hashar for pairing on this.)

$ git bisect start
$ git bisect bad master
$ git bisect good d1439a3e6746
Bisecting: 394 revisions left to test after this (roughly 9 steps)
[84890fea50524562350d3038a046ce1b92e4f9c4] Merge "Refactor some Installer code into ExecutableFinder"
$ git bisect run bash -c 'composer update && npm i && npx wdio tests/selenium/wdio.conf.js --spec tests/selenium/specs/user.js --mochaOpts.grep preferences --mochaOpts.timeout 50000'
...
6b2f13b055c84d7672652f54deee096762c525cb is the first bad commit
commit 6b2f13b055c84d7672652f54deee096762c525cb
Author: Aaron Schulz <aschulz@wikimedia.org>
Date:   Wed Oct 18 21:03:26 2017 -0700

    objectcache: Always use interim values on WAN cache tombstones
    
    This stores values for very short times while the main
    value is a tombstone in case of particularly high traffic.
    
    Also make mutex keys expire immediately on unlock.
    
    Change-Id: I4ec5cf7f8b49239fdd2518e5d955534877a0f7ee

:040000 040000 6bfe7c2054f98c857e22d26624236876dbb2d26f b4cc5d156e5894ac6c6b2dde246b4912301d9f1d M	includes
:040000 040000 ee08962c990d3dd3204b2381b15f0c4c3f1375b4 e1759b6ab70639bcea194f38370249645b9b04b9 M	tests
bisect run success

Change 391260 abandoned by Zfilipin:
Fix for Selenium test failure when targeting Vagrant

https://gerrit.wikimedia.org/r/391260

31f87d19e56a is just before the problem was introduced.

$ git checkout 31f87d19e56a 
Note: checking out '31f87d19e56a'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 31f87d19e5... objectcache: Add more hotTTR comments to WANObjectCache
$ composer update && npm i && npx wdio tests/selenium/wdio.conf.js --spec tests/selenium/specs/user.js --mochaOpts.grep preferences --mochaOpts.timeout 50000
> ComposerHookHandler::onPreUpdate
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 0 installs, 3 updates, 0 removals
  - Updating oojs/oojs-ui (v0.24.2 => v0.24.0): Loading from cache
  - Updating phpunit/phpunit (4.8.36 => 4.8.35): Loading from cache
Writing lock file
Generating optimized autoload files
added 104 packages, removed 51 packages, updated 38 packages and moved 14 packages in 10.624s
GVA info: Successfully connected to the Intel plugin, offline Gen7 
------------------------------------------------------------------
[chrome #0-0] Session ID: 66df3f8a064ce51da1df5b222d3acfbd
[chrome #0-0] Spec: /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js
[chrome #0-0] Running: chrome
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]       โœ“ should be able to change preferences
[chrome #0-0]
[chrome #0-0]
[chrome #0-0] 1 passing (38s)
[chrome #0-0]
$ git checkout 6b2f13b055c8
Previous HEAD position was 31f87d19e5... objectcache: Add more hotTTR comments to WANObjectCache
HEAD is now at 6b2f13b055... objectcache: Always use interim values on WAN cache tombstones
$ composer update && npm i && npx wdio tests/selenium/wdio.conf.js --spec tests/selenium/specs/user.js --mochaOpts.grep preferences --mochaOpts.timeout 50000
> ComposerHookHandler::onPreUpdate
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 0 installs, 3 updates, 0 removals
  - Updating squizlabs/php_codesniffer (3.1.1 => 3.0.2): Loading from cache
  - Updating mediawiki/mediawiki-codesniffer (v14.1.0 => v13.0.0): Loading from cache
Writing lock file
Generating optimized autoload files
up to date in 4.021s
GVA info: Successfully connected to the Intel plugin, offline Gen7 

	Screenshot location: ./log/should-be-able-to-change-preferences.png 

------------------------------------------------------------------
[chrome #0-0] Session ID: 04fda2be88928de500780d8ab71fb7df
[chrome #0-0] Spec: /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js
[chrome #0-0] Running: chrome
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]       1) should be able to change preferences
[chrome #0-0]
[chrome #0-0]
[chrome #0-0] 1 failing (20s)
[chrome #0-0]
[chrome #0-0] 1) User should be able to change preferences:
[chrome #0-0] '' == '0.5402323026172136'
[chrome #0-0] AssertionError [ERR_ASSERTION]: '' == '0.5402323026172136'
[chrome #0-0]     at Context.<anonymous> (/Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js:65:10)
[chrome #0-0]     at new Promise (<anonymous>)
[chrome #0-0]

When saving the user preferences, the return page is generated with a cached and outdated User. I haven't dig into it beside hinting at git bisecting it.

Some thoughts:

  • the CI job mediawiki-core-qunit-selenium-jessie did not catch it on https://gerrit.wikimedia.org/r/#/c/385127/
  • locally with $wgMainCacheType = CACHE_NONE; I can not reproduce
  • Zeljko reproduces it 100% when using MediaWiki Vagrant

This looks like an integration issue with ChronologyProtector vs WANObjectCache.

Within a request, during and even after the master DB transaction, the web request will not see the DB changes it made (committed or not) unless it uses the DB_MASTER. Cached values always use DB_REPLICA, so they remain stale during the request. This is fine, as calling code is expected to stick to user DB_MASTER loaded data when using reads to determine writes or show things to the user on POST requests that do *not* send redirects in response. This didn't change with 6b2f13b055c8.

Via ChronologyProtector, the next HTTP requests by the user after their last one that changed stuff go through the trouble of waiting for the replica DB to catch up to where the master was right after those changes were committed. MediaWiki therefor sees the changes from that prior request in the next one. Previously, WAN cache totally disallowed* any caching for tombstoned (recently deleted) keys for HOLDOFF_TTL seconds. That meant that the keys the user causes to purge (due to DB writes) in the prior request always ignored the cache and used the replica DB itself.

However, after 6b2f13b055c8, for cached values, they are still stale due to interim keys on the next request since the up-to-date (via the above logic) replica DB is not used if there is a tombstone and interim key.

So I see what is likely happening here:

  • HTTP POST: real name updated in DB, cache purged, getWithSetCallback() for the user is called again and a stale DB_REPLICA value is cache, redirect response header sent
  • HTTP GET from redirect: stale value is seen from cache
  • Some callers use "lockTSE" to cache anyway, but that is not relevant here

The simple thing is to not set INTERIM keys in the same request that purged them. The duration of that rule would be HOLDOF_TTL so that the array holding the purged keys doesn't get too big for long-running maintenance scripts. This can be done with a HashBagOStuff nested in the WAN cache object easily enough.

That would leave open race conditions though, where it's *another* request by another user that saves back stale values.

An even simpler, more correct, way would be to disable use of INTERIM keys on any request that still has to wait for DB replicas to catch up. ChronologyProtector only keeps master positions around for a user for POSITION_TTL seconds since the last write by that user (it assumes that lag is gone by then). After that, there is no replication lag waiting anymore. The INTERIM key TTL is 1 second, so it is unlikely that a stale value would blip back for that user and then go away around this time. That said, such dependencies should be visible in the code somehow.

The question is how to do that in a way that is not super specific for being in WAN cache, in terms of separation of concern.

Change 393983 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make ChronologyProtector actually use cpPosTime cookies

https://gerrit.wikimedia.org/r/393983

I noticed a worse bug of cpPosTime cookies not being used (not related to WAN cache). The patch for that is above.

Change 393983 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make ChronologyProtector actually use cpPosTime cookies

https://gerrit.wikimedia.org/r/393983

Change 393991 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make WANObjectCache interim caching not interfere with ChronologyProtector

https://gerrit.wikimedia.org/r/393991

I noticed a worse bug of cpPosTime cookies not being used (not related to WAN cache). The patch for that is above.

Change 393983 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make ChronologyProtector actually use cpPosTime cookies

https://gerrit.wikimedia.org/r/393983

I've untagged this task from that commit, since that really only matters in multi-DC or replicated "main stash" stores, which does not apply to us atm.

Change 393991 merged by jenkins-bot:
[mediawiki/core@master] objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector

https://gerrit.wikimedia.org/r/393991

it's in our respective done columns, so is it resolved? :) :) Things left to do?

@greg the only thing I did was report the problem. It's hard for me to tell if it's resolved. As far as I am concerned, it can be removed from Release-Engineering-Team (Kanban).

Does this still occur?

Good point. ๐Ÿ˜ Tests are fixed.

This is happening again. I can reproduce it when targeting mediawiki-vagrant manually or by running selenium tests.

To reproduce:

  • Start mediawiki-vagrant.
  • In one one terminal tab/window:
chromedriver --url-base=/wd/hub --port=4444
  • In another terminal tab/window:
composer update && npm i && npx wdio tests/selenium/wdio.conf.js --spec tests/selenium/specs/user.js --mochaOpts.grep preferences --mochaOpts.timeout 50000
git bisect start

git bisect bad 38ab0b0c4cb041d9c4ff0ca4b62656d2768df79a

git bisect good e90eafdf6107d4e49c9c20f6310d1d6ee5f5c47f
Bisecting: 379 revisions left to test after this (roughly 9 steps)
[2c16290785cb2b2acee9ae8f4af64fb2b943b07e] Add @covers tags to page tests

...

git bisect run bash -c 'composer update && npm i && npx wdio tests/selenium/wdio.conf.js --spec tests/selenium/specs/user.js --mochaOpts.grep preferences --mochaOpts.timeout 50000'

...

[chrome #0-0] Session ID: 722f501e981cd006b2e667110bf5179b
[chrome #0-0] Spec: /Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js
[chrome #0-0] Running: chrome
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]
[chrome #0-0]   User
[chrome #0-0]       1) should be able to change preferences
[chrome #0-0]
[chrome #0-0]
[chrome #0-0] 1 failing (11s)
[chrome #0-0]
[chrome #0-0] 1) User should be able to change preferences:
[chrome #0-0] '' == '0.47688065784075495'
[chrome #0-0] AssertionError [ERR_ASSERTION]: '' == '0.47688065784075495'
[chrome #0-0]     at Context.<anonymous> (/Users/z/Documents/gerrit/mediawiki/vagrant/mediawiki/tests/selenium/specs/user.js:65:10)
[chrome #0-0]     at new Promise (<anonymous>)
[chrome #0-0]

d5aa846d84e33967f086ef90a532243823b7b4a3 is the first bad commit
commit d5aa846d84e33967f086ef90a532243823b7b4a3
Author: Aaron Schulz <aschulz@wikimedia.org>
Date:   Fri Dec 8 13:36:25 2017 -0800

    Replace cpPosTime cookie/parameter with cpPosIndex
    
    This handles multi-DB transactions properly, instead of causing wait
    timeouts in the WaitConditionLoop. It also is more correct in using
    a counter instead of relying on wall clocks.
    
    In addition:
    * Refactor related code in MediaWiki.php to be comprehensible.
    * Always send the cookie even the "remote wiki redirect" case.
    * Renamed ChronologyProtector field and constant to avoid any
      confusion of "wait for server X to reach Y" with "wait for Y
      to show up in position store".
    * Add an "asOfTime" field to the position keys for debugging.
    
    Bug: T182322
    Change-Id: I5c73cd07eaf664f02ba00c38fab9f49b609f4284

:040000 040000 bde198cb44c7689101741f67a1cb72139ed066fd 016e4c7600dbd00d1c73e3cbc7b95a1df4438468 M	includes
:040000 040000 bd1ea0ba5a2e2e6bcad004606f59ffdda2035f0e a2b561d689e9137efb6db2b56f23a6ce5a514368 M	tests
bisect run success

@zeljkofilipin given the cause is different, we should file separate bugs for these failures. Otherwise we would have long living "Some test is failing" tasks that are always re-opened. Also, re-opened tasks can be lost in the noise. Looks like it's still the same team/person though, at least ;)