Page MenuHomePhabricator

Fix CheckUser selenium tests so they can run concurrently
Closed, ResolvedPublic

Description

The CheckUser selenium tests are execute with each gate wmf-quibble-selenium job. Due to their current setup with 3 different specs they have the potential to reduce the job's runtime when run in parallel. I poke that first with

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/934333

but it it seems that checkuser.js spec has some issues with the concurrent setup and fails at the first execution with the other specs running in parallel.

15:41:57 [0-0] AssertionError in "CheckUser.With CheckUser user group.Verify checkuser can make checks:.Should be able to run 'Get IPs' check"
15:41:57 AssertionError [ERR_ASSERTION]: false == true
15:41:57     at Context.<anonymous> (/workspace/src/extensions/CheckUser/tests/selenium/specs/checkuser.js:56:5)
15:41:57     at runMicrotasks (<anonymous>)
15:41:57     at processTicksAndRejections (internal/process/task_queues.js:95:5)
15:42:03 [0-0] Error in "CheckUser.With CheckUser user group.Verify checkuser can make checks:.Should be able to run 'Get edits' check"
15:42:03 Error: Can't call click on element with selector "#checkuserradios input[value="subedits"]" because element wasn't found
15:42:03     at implicitWait (/workspace/src/extensions/CheckUser/node_modules/webdriverio/build/utils/implicitWait.js:34:19)
15:42:03     at async Element.elementErrorHandlerCallbackFn (/workspace/src/extensions/CheckUser/node_modules/webdriverio/build/middlewares.js:20:29)
15:42:03     at async Element.wrapCommandFn (/workspace/src/extensions/CheckUser/node_modules/@wdio/utils/build/shim.js:131:29)
15:42:03     at async Context.<anonymous> (/workspace/src/extensions/CheckUser/tests/selenium/specs/checkuser.js:63:5)
15:42:08 [0-0] Error in "CheckUser.With CheckUser user group.Verify checkuser can make checks:.Should be able to run 'Get users' check"
15:42:08 Error: Can't call click on element with selector "#checkuserradios input[value="subipusers"]" because element wasn't found
15:42:08     at implicitWait (/workspace/src/extensions/CheckUser/node_modules/webdriverio/build/utils/implicitWait.js:34:19)
15:42:08     at async Element.elementErrorHandlerCallbackFn (/workspace/src/extensions/CheckUser/node_modules/webdriverio/build/middlewares.js:20:29)
15:42:08     at async Element.wrapCommandFn (/workspace/src/extensions/CheckUser/node_modules/@wdio/utils/build/shim.js:131:29)
15:42:08     at async Context.<anonymous> (/workspace/src/extensions/CheckUser/tests/selenium/specs/checkuser.js:76:5)

Because we currently re-run failing specs at least once, the whole test job still succeeds most of the times. But there's still an issue that leads to unstable behavior.

Event Timeline

This is likely because the CheckUser each testing spec does the following:

  • Tests that the Special page cannot be accessed when using an account that does not have the checkuser group
  • Grants the checkuser group from the user
  • Tests that the Special page works as expected
  • Removes the checkuser group from the user

Some thoughts as to why this is inconsistent:

  • By running the specs in parallel, one may finish early and therefore remove the "checkuser" group before the others have had a chance to complete tests that rely on the user having said group. As such they fail because the user group was removed.
  • If the tests running in parallel all start and complete at the same time, they will succeed as no test had the wrong configuration relating to the user group assignment.
  • When the tests fail and are re-run, the code which grants the "checkuser" group is run again which fixes the issue behind why the tests failed.
  • If multiple specs have failing tests, this process repeats and could collide in the same way again. If this happens too many times, the tests fail outright.

Solutions for this are as follows:
Solution 1:

  1. Re-group the tests into two specs containg tests:
    1. Tests that rely on the current user having the "checkuser" group
    2. Tests that rely on the current user not having the "checkuser" group
  2. Do not run the specs concurrently, but if not already done make each test run concurrently with all other tests in the spec

Solution 2:

  1. Re-group the tests into six specs where each spec is split by separating tests that rely on the "checkuser" group being granted and those which do not.
  2. Allow specs that rely on the same state of the user groups assigned to run concurrently.

Solution 3:

  1. Create a new test user that is granted the "checkuser" rights
  2. Use this user to run the tests that rely on the "checkuser" group being granted

Solution 1 and 2 attempt to group the tests in a better way which makes partial concurrency possible.

Solution 3 is likely the best option. However, I am unsure if the tests would be able to be logged into two different user accounts at the same time. If this is not the case, then other tests that are logged into the normal account could fail because they are now logged into a different user.

To user Solution 3, a method similar to core's user.js spec where the cookies are cleared before each test and the login is performed in the test should fix this.

I intend to get to this soon. I start a contracting role on Monday and may be able to do this as part of my contracting.

To user Solution 3, a method similar to core's user.js spec where the cookies are cleared before each test and the login is performed in the test should fix this.

It should be sufficient to use different accounts for each spec since only the specs are run concurrently. So instead of logging in as admin in every spec it could be a new user for each spec that then get's checkuser rights by the admin account. There should be no need to clear cookies. See https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CodeMirror/+/934321 where I did the same to make sure the one of the specs is run by a different user.

Thanks for doing this.

Change 935160 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] Use separate user for selenium tests that require checkuser group

https://gerrit.wikimedia.org/r/935160

Run into an issue which I can't seem to solve with this patch when concurrency is enabled.

Each spec gets a unique username to have as their "checkuser" account, however, only one of the three will pass in one try. The other two fail with something like:
internal_api_error_DBQueryError: [XXXXXXXXXXXX] Exception caught: A database query error has occurred. This may indicate a bug in the software.

The traceback for this is as follows:

2023-07-03 21:43:54 a293bc8207a9 wikidb: [1b501de966a1fdad47b2f592] /api.php?format=json   Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction
Function: User::addToDatabase
Query: INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('User-0.3615606896732173-Iñtërnâtiônàlizætiøn','','','',NULL,'','00c36179bb3fe34b2da103d85fdd5aa5','20230703214354',0,'20230703214354')

#0 /workspace/src/includes/libs/rdbms/database/Database.php(1284): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /workspace/src/includes/libs/rdbms/database/Database.php(1258): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#2 /workspace/src/includes/libs/rdbms/database/Database.php(743): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#3 /workspace/src/includes/libs/rdbms/database/Database.php(1564): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
#4 /workspace/src/includes/user/User.php(2769): Wikimedia\Rdbms\Database->insert(string, array, string, array)
#5 /workspace/src/includes/libs/rdbms/database/Database.php(2444): User->{closure}(Wikimedia\Rdbms\DatabaseMysqli, string)
#6 /workspace/src/includes/libs/rdbms/database/DBConnRef.php(119): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure)
#7 /workspace/src/includes/libs/rdbms/database/DBConnRef.php(658): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#8 /workspace/src/includes/user/User.php(2798): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure)
#9 /workspace/src/includes/auth/AuthManager.php(1567): User->addToDatabase()
#10 /workspace/src/includes/auth/AuthManager.php(1290): MediaWiki\Auth\AuthManager->continueAccountCreation(array)
#11 /workspace/src/includes/api/ApiAMCreateAccount.php(108): MediaWiki\Auth\AuthManager->beginAccountCreation(User, array, string)
#12 /workspace/src/includes/api/ApiMain.php(1915): ApiAMCreateAccount->execute()
#13 /workspace/src/includes/api/ApiMain.php(892): ApiMain->executeAction()
#14 /workspace/src/includes/api/ApiMain.php(863): ApiMain->executeActionWithErrorHandling()
#15 /workspace/src/api.php(95): ApiMain->execute()
#16 /workspace/src/api.php(48): wfApiMain()
#17 {main}

The retry allows another spec to complete without issue, but the third always fails. I presume that the createaccount API can't handle the accounts being created all at the same time? It seems like a bug with the accountcreation API, but I may be missing something.

Based on the descriptions in those tasks, it's not possible to create multiple accounts in quick succession using the API. I will look into getting the tests to use an account with a common account that has the checkuser group. Otherwise, this will have to be dependent on a fix in T199393.

Change 935160 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Use separate user for selenium tests that require checkuser group

https://gerrit.wikimedia.org/r/935160

I misunderstood how the gap lock works in MariaDB/Innodb. I thought it locked fields based on a common prefix, but the lock is for the whole gap between existing fields. So if two users get inserted concurrently in the same gap, the later get a deadlock.

The way to workaround it is to retry the user creation, or maybe using a separate user works around it as well :)

The way to workaround it is to retry the user creation, or maybe using a separate user works around it as well :)

Unfortunately, it seems that because the deadlock causes a entry to be added to the mw-error.log, the selenium tests will still just fail even if the attempt at creation is re-tried after catching the error.

The alternatives I see are:

  1. Use the standard account (i.e. Admin) and give it checkuser rights. This is flawed as other selenium tests could rely on the information that a user has no user groups, which means that order in which the tests are run becomes important.
  2. Find some way to create a new user account and assign it checkuser rights before the specs are run in parallel.

Based on reading https://webdriver.io/docs/configurationfile/, might it be possible to define the before function in the config? If so, it might be useful to achieve the second suggestion above as the docs say:

Gets executed before test execution begins. At this point you can access to all global variables like browser. It is the perfect place to define custom commands.

If everything that is needed could be accessed, then the user creation could happen at this point and therefore not be subject to simultaneous attempts to create a user.

Find some way to create a new user account and assign it checkuser rights before the specs are run in parallel.

That would also be my favorite way forward here. It should be possible to define an extra spec that just creates the accounts sequentially and then the test runner executes that spec first and the the others afterwards in parallel. See https://webdriver.io/docs/organizingsuites/#grouping-test-specs-in-suites - I might find the time to do a POC for that these days. But feel free to play with that yourself @Dreamy_Jazz :-)

This comment was removed by Dreamy_Jazz.

@WMDE-Fisch I should have a working fix for the issue. I've made the changes in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/935162. I think that patch is changed enough that it can be considered my own work enough that you could review if desired (even though you had originally made a patch to modify maxInstances). The method followed is:

  • Create a new spec config file named wdio.pre-specs.conf.js that runs one new "pre-spec" file.
    • This "pre-spec" file is a spec file that is run before all others by being run on a separate wdio execution.
    • This pre-spec creates the CheckUser account and grants it the checkuser group
  • Update the selenium-test command to run the pre-specs first and then the normal specs.

This method means that the account is fully created and given the checkuser group before the other specs run. This also means that the other specs do not have to check and wait to see if the account exists (as it must exist). This method seems to be the fastest I've found so far. It takes around 13 to 16 seconds compared to 45+ seconds on the master branch.


Find some way to create a new user account and assign it checkuser rights before the specs are run in parallel.

That would also be my favorite way forward here. It should be possible to define an extra spec that just creates the accounts sequentially and then the test runner executes that spec first and the the others afterwards in parallel. See https://webdriver.io/docs/organizingsuites/#grouping-test-specs-in-suites - I might find the time to do a POC for that these days. But feel free to play with that yourself @Dreamy_Jazz :-)

My reading of the documentation you linked seemed to imply that by grouping the spec files it would ... be run sequentially in a single instance. This implies to me that doing this would make the specs non-concurrent and as such the patch would not improve the speed as the specs would still be run one-by-one. The method I've followed does what you suggested. If I've misread the documentation, I can try this method if you would prefer. Thanks for the pointer.

  1. Find some way to create a new user account and assign it checkuser rights before the specs are run in parallel.

That is an excellent idea! I think you can create the account over the API using an instance of MWBot, something such as:

const Api = require( 'wdio-mediawiki/Api' );

bot = await Api.bot();
Api.createAccount( bot, username, password ); // Returns a Promise resolving to the JSON response from action=createaccount

And I guess that is done in the beforeEach or before hooks which thus run concurrently. But the user creation could be added to the onPrepare hook which fires before the workers are started concurrently, the issue is the data/state from onPrepare is not shared with the workers which run in a standalone process :-\


Maybe the deadlock caused by Innodb gap lock from T199393 can be worked around by adding a prefix to the usernames, that would reduce the race. So that if there is already:

User0.11111
User0.9999

If the two specs concurrently tries to insert User-0.3333 and User-0.7777 they fall in the same gap and one is rejected:

ExistingRequest 1Request 2
User0.11111
DeadlockUser-0.33333User-0.7777
User0.9999

But if one suite uses a different prefix such as MySuiteUser-0.5555, that comes before User0.11111 and would not run in a race condition with a User-* name. There is still a race condition though if the created User comes before any existing User:

ExistingRequest 1Request 2
OKMySuiteUser-0.5555
User0.11111
OKUser-0.33333
User0.9999

Of course there is still a race condition if the request 2 in the example above insert a User coming before the first entry (User0.0000 would cause a lock with MySuiteUser-0.5555) :-(

Change 935162 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] selenium: run tests concurrently

https://gerrit.wikimedia.org/r/935162

  1. Find some way to create a new user account and assign it checkuser rights before the specs are run in parallel.

That is an excellent idea! I think you can create the account over the API using an instance of MWBot, something such as:

Thanks. I've implemented something like this in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/935162. The only problem with this approach I've not solved so far is that the username and password are now hardcoded again.

Thanks. I've implemented something like this in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/935162. The only problem with this approach I've not solved so far is that the username and password are now hardcoded again.

An alternative I guess is to create the user in the onPrepare hook, save the generated username/password on disk in a json file then have the spec files to source that json file to retrieve credentials.

I had a similar issue (creating some states to be shared between spec suites) which led me to discover https://webdriver.io/docs/shared-store-service/ . My exact use case requires to create a list of resources which can't be shared between sources (ResourcePool) but that requires webdriver.io 8.7.0+ which is why I haven't mentioned it previously.

Then, you only need to share a single user/password and it can be shared between tests as I understand it. Thus with webdriver v7 you can use that shared storage (v7 doc is at https://v7.webdriver.io/docs/shared-store-service/ ):

package.json
{
  "devDependencies": {
    "@wdio/shared-store-service": "7.<whatever>"
 }
}

And apparently it can be used as:

wdio.conf.js
export.config = {
    services: ['shared-store'],

   // Runs before worker are started
    onPrepare: async () => {
        // Set values in the store
        await browser.setValue('CheckUserUsername', 'bar');
        await browser.setValue('CheckUserPassword', 'secret');
    }],

And in the spec:

somespec.js
beforeSuite: async () => {
    // Retrieve values from the store
    const username = await browser.getValue('CheckUserUsername');
    const password = await browser.getValue('CheckUserPassword');
}

Well maybe? I don't know really, but that sounds to be the official path to share state between the main process and the worker (the alternative is a flat json file to share state).

I ended up using a JS file that loads the specs using a way that is suggested by the documentation to run the specs programmatically. This method means that the wrapper file is loaded that generates the shared random username and password and then provides this to the specs.

I ended up using a JS file that loads the specs using a way that is suggested by the documentation to run the specs programmatically.

@Dreamy_Jazz do you have a link to that, and could you include it in the commit message please?

Added it to the commit message (for the avoidance of doubt, I was doing this on my own time but happy to do this in contracting time if this is seen as useful).

Change 935162 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] selenium: run tests concurrently

https://gerrit.wikimedia.org/r/935162