Page MenuHomePhabricator

Flaky Special:RecentChanges selenium test
Closed, ResolvedPublic

Description

mediawiki/extensions/PageTriage/+/440018
quibble-vendor-mysql-hhvm-docker/6069

quibble-vendor-mysql-hhvm-docker/6069
11 passing (59.40s)
1 failing

1) Special:RecentChanges shows page creation:
'BeforeEach-name-0.9564462824203286-Iñtërnâtiônàlizætiøn☃' === '0.01864498162646111-Iñtërnâtiônàlizætiøn☃'
running chrome
AssertionError: 'BeforeEach-name-0.9564462824203286-Iñtërnâtiônàlizætiøn☃' === '0.01864498162646111-Iñtërnâtiônàlizætiøn☃'
    at Context.<anonymous> (/workspace/src/tests/selenium/specs/specialrecentchanges.js:23:10)
    at Promise.F (/workspace/src/node_modules/core-js/library/modules/_export.js:36:28)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Failing job 1
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/7180/artifact/log/mw-debug-www.log

Page title "0.8787629835863413-Iñtërnâtiônàlizætiøn"

L9120 API POST
L9153 Special:RecentChanges GET
L9187 Request ended normally
L9289 [runJobs] recentChangesUpdate (not sure what this really means)

Failing job 2
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/7165/artifact/log/mw-debug-www.log

Page title "0.6715652655280258-Iñtërnâtiônàlizætiøn"

L9180 API POST
L9217 Special:RecentChanges GET
L9148 Request ended normally
L9350 [runJobs] recentChangesUpdate

Successful job
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php70-docker/6323/artifact/log/mw-debug-www.log

Page title "0.7994121944889043-Iñtërnâtiônàlizætiøn"

L8614 API POST
L8644 Request ended normally
L8649 Special:RecentChanges GET
L8798 [runJobs] recentChangesUpdate

Interesting config variables:
$wgJobBackoffThrottling, $wgJobRunRate, $wgRunJobsAsync

It is hard to say which request "Request ended normally" belongs to, or if the delayed appearance in case of the failing jobs proves causation of the problem but it appears that for the failing jobs the GET request for Special:RecentChanges is really done before the API POST is fully digested by the system.
Assuming - if multiple processes were involved - that all processes write to the same log file and given it does not contain PIDs it is really hard to judge what is really going on. Can we be certain that $wgRunJobsAsync is false (per includes/DefaultSettings.php)?

@zeljkofilipin Subtask to not have (possibly) unrelated things intertwined.

MediaWiki::triggerJobs() is not run when MediaWiki::restInPeace() is called with $mode "fast" as it is the case for api.php - so no matter the settings (e.g. $wgJobRunRate), they will not be taken into consideration. The browser tests do the page creation as a setup step through the API. (see debug output here)

So that is similar to T191537 A previous test enqueue a job for a user Foo, then in the next test for the user Bar, the job for Foo was running and emitting a cookie which logged out the Bar user. That got solved by preventing the jobs from emitting headers / cookies.

To track it down, I have made a patch to mediawiki/core which prefix debug lines with the request id https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/438231/

I have made the PageTriage dummy change https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/PageTriage/+/445214/4 to depend on that one. That makes it possible to split the debug lines on a per request basis.

That patch also set $wgJobRunRate = PHP_INT_MAX

@Pablo-WMDE I have just read your comment about api.php not running the jobs on edit. Seems in this case we should create the page as a user would create it and that would solve it.

Pablo, I will not be around this Friday. But it seems switching from API edit to a regular edit might force the job to run.

I did a few recheck / check php on the https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/PageTriage/+/445214/4 PageTriage change that depends on a change in core that set $wgJobRunRate = PHP_INT_MAX. It is working fine apparently.

Possibly one could then edit https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/438231/ , stop setting wgJobRunRate and recheck a few times the PageTriage test change. It should eventually fail. Then the debug log attached to the build would have each line prefixed with the request id (which should be made an improvement in the mediawiki log for CI).

Next up: investigate if the updates to recentchanges are really performed by triggerJobs() or by DeferredUpdates::doUpdates() (which at first sight does not have a guard) - the differentiation between the two is still unknown to me.

DeferredUpdates only tries a Pingback for me, the updates to RecentChanges are (or would be) run through triggerJobs() in JobRunner::executeJob() after being added to the stack of todos in RecentChange::save(). Interestingly the job does not seem to be responsible for persisting information but facilitates the "cacheUpdate" for RecentChanges.

Change 445630 had a related patch set uploaded (by Pablo Grass (WMDE); owner: Pablo Grass (WMDE)):
[mediawiki/core@master] selenium: create edit for RecentChanges throgh GUI

https://gerrit.wikimedia.org/r/445630

Change 445630 merged by jenkins-bot:
[mediawiki/core@master] selenium: create edit for RecentChanges through GUI

https://gerrit.wikimedia.org/r/445630

Change 446001 had a related patch set uploaded (by Pablo Grass (WMDE); owner: Pablo Grass (WMDE)):
[mediawiki/core@master] DNM selenium: create edit for RecentChanges through GUI

https://gerrit.wikimedia.org/r/446001

Change 445222 had a related patch set uploaded (by Krinkle; owner: Hashar):
[mediawiki/core@master] In development, run all jobs on each requests

https://gerrit.wikimedia.org/r/445222

[mediawiki/core@master] In development, run all jobs on each requests
https://gerrit.wikimedia.org/r/445222

I'm worried about changing overall MediaWiki behaviour for one test. This may make tests less representative, cause confusion, or we may get conflicting preferences/requirements from multiple tests in different extensions wanting different behaviour.

Let's zoom out, back to the original issue:

We have one failing Selenium test. That test is failing because it is asserting that making an edit will result in recent changes showing the made edit, on top. We realised it fails because the insertion to the recentchanges table post-edit happens from a job - which do not run on API requests, and even on normal web requests, they run asynchronously.

Thus, the assertion is not compatible with what MediaWiki guarantee. This is expected.

The guarantee is that, when an edit is made, and jobs have run, then it will show on the recent changes page. Perhaps we can consider the option to keep default behaviour as-is (which would also make tests faster, we could even disable job processing entirely), and instead modify the test to trigger the jobs as part of its steps to gain the desired outcome.

The test could go back to using the API to edit, but more importantly, it could trigger the jobs from within that particular tests. There are various different ways we could trigger the jobs:

  1. Shell out to maintenance script. (php $MW_INSTALL_PATH/maintenance/runJobs.php --wiki $MW_DB)
  2. Or; Viewing Special:RunJobs. (http.request( env.MW_SCRIPT_PATH + '/index.php', { method: POST } ... async=0) – Would need SecretKey via ENV.

The problem with setting it unconditionally from DeveloperSettings.php is that it conflicts with local development, e.g. where you might have a JobQueue runner, or might not want all jobs to run synchronously every day when working locally.

The problem with setting it in Jenkins is that it will create different behaviour for tests than locally, causing the tests to fail when running them locally.

Are jobs only run for POST requests?

For a similar purpose, addwiki/mediawiki-api runs jobs in tests by GETing the main page repeatedly and inbetween each it checks meta=siteinfo&siprop=statistics to see if there are any jobs still to run. It seems like a similar thing could be a useful addition to MediaWikiTestCase.

Are jobs only run for POST requests?

They generally run for all requests to the index.php entry point, which includes the bulk non-POST requests (e.g. GET views). The exception is (maybe) requests ending with a fatal errors or other early exists.

It seems like a similar thing could be a useful addition to MediaWikiTestCase.

For PHPUnit test suites, we already have an in-memory JobQueue that is discarded and reset between each test case (unless explicitly run a test).

Oh right, that makes sense; and Selenium tests are run against a DB that persists between tests?

Indeed. In Jenkins we do create a fresh wiki in each job build, but between the different Selenium test suites and their test cases there is no reset. We also use the same browser-test specs to confirm functionality against an existing wiki, e.g. during development against your local wiki, and (in some cases) against Beta Cluster.

Change 446543 had a related patch set uploaded (by Pablo Grass (WMDE); owner: Pablo Grass (WMDE)):
[mediawiki/core@master] selenium: invoke jobs to enforce eventual consistency

https://gerrit.wikimedia.org/r/446543

Change 445222 abandoned by Hashar:
In development, run all jobs on each requests

Reason:
That was a dirty / quick win patch to test out the failures we had on PageTriage / recentchange job not running.

I definitely love the suggestion Timo made to have the test suite run the job / ensure all jobs get run ( T199446#4429139 ).

So I am abandoning this change in favor of https://gerrit.wikimedia.org/r/#/c/mediawiki/core/ /446543/ "selenium: invoke jobs to enforce eventual consistency"

https://gerrit.wikimedia.org/r/445222

@Krinkle Thank you for thinking out loud and explaining your point of view. I agree we should be very careful not to create an idealized test environment which does not resemble reality any more. At the same time, dealing with eventual consistency, we need a way of reliably asserting the state of the application once these promises are met (maybe, possibly also the ones in between in addition). The question is probably only about the "how" we reach this goal - at the moment I can not think of a solution that would not attract criticism one way or another.
https://gerrit.wikimedia.org/r/446543 implements (naively) what @Krinkle described. Exposing wgSecretKey (and if this is wise to begin with) to node is still unsolved.

https://gerrit.wikimedia.org/r/446543 now implements (wgSecretKey-dependent) job execution through Special:RunJobs as well as a fallback strategy which repeatedly opens the front page.
As apparent from the logs (ctrl+f "SpecialRunJobs") the fallback strategy could do with a higher wgJobRunRate (seems to work a stable 2 at the moment) to reduce the amount of requests needed and to reliably clean up all jobs before the (arbitrary) limit of 10 requests is reached.
Even though I did it I'm not a huge fan, think that avoiding the jobs to pile up would be the preferable - and still sufficiently realistic - way to go.

Hi @hashar & @Krinkle,
could you please take another look at https://gerrit.wikimedia.org/r/446543 ? I think I have nothing else to add, and we should either plunge forward with this (maybe attracting more ideas) or let it die now. I do not know which levers to pull to make this work though the special page in CI so ATM it always uses the fallback strategy (could be made less horrible by increasing $wgJobRunRate) or someone would need to jump in to set MEDIAWIKI_SECRET_KEY.

Thanks

Change 446001 abandoned by Pablo Grass (WMDE):
DNM selenium: create edit for RecentChanges through GUI

https://gerrit.wikimedia.org/r/446001

This test remains incredibly flaky. Can we disable it for now?

Change 449525 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/core@master] Disable "...able to change preferences" browser test

https://gerrit.wikimedia.org/r/449525

This test remains incredibly flaky. Can we disable it for now?

yes and thank you.

Change 449525 merged by jenkins-bot:
[mediawiki/core@master] Disable "...able to change preferences" browser test

https://gerrit.wikimedia.org/r/449525

Sorry, are we using this ticket now as sink for all kinds of selenium-related issues?
I would greatly appreciate if someone looked at https://gerrit.wikimedia.org/r/446543. Thanks

I've commented +1 on the ticket. I think we need to encourage some more review from someone either on the the ticket as a reviewer or find someone else who can help us more this forwards.

@Tarrow I took a look at the patch, but I am really not familiar with way jobs are processed. It seems to me that @Krinkle had the best insights so far. He is probably the one to decide if the patch is ready to be merged or if it needs further work.

Pablo-WMDE changed the task status from Open to Stalled.Aug 29 2018, 1:04 PM

Change 459800 had a related patch set uploaded (by Pablo Grass (WMDE); owner: Pablo Grass (WMDE)):
[mediawiki/core@master] selenium: invoke jobs to enforce eventual consistency

https://gerrit.wikimedia.org/r/459800

Addshore changed the task status from Stalled to Open.Sep 21 2018, 7:38 AM

Change 459800 merged by jenkins-bot:
[mediawiki/core@master] selenium: invoke jobs to enforce eventual consistency

https://gerrit.wikimedia.org/r/459800

Change 459800 merged by jenkins-bot:
[mediawiki/core@master] selenium: invoke jobs to enforce eventual consistency

https://gerrit.wikimedia.org/r/459800

I'm confused. We added something a bit hacky that will be slower and more difficult to debug than a simple runJobs.php shell exec. And we did that because we want to be compatible with Beta Cluster?

That sounds fair. Except, this isn't compatible with Beta Cluster. Beta uses Varnish, and has wgJobRunRate disabled, and Beta has actual users and unrelated jobs being triggered. As such, the tests that would use RunJobs probably aren't the kind of test we can or should run against Beta in that way.

Perhaps this problem can be avoided entirely?

For tests that need jobs to run, one could use a waitFor() principle. E.g. poll the RecentChanges page upto 5 times over 5 seconds until the expectation matches or the test is marked as failing. This does the same thing as the current patch, except:

  • It doesn't hard code details about RunJobs, which makes it a better blind/integration test.
  • It won't fail in Jenkins due to race conditions. By leaving the time in between each poll, it allows jobs to actually complete first. The current patch hits the main page and checks the API quick enough, it might fail before the job even starts/completes.
  • It will actually work in Beta, and might even work for production. The time delay would give jobrunners time to do their work.

For tests that need jobs to run, one could use a waitFor() principle. E.g. poll the RecentChanges page upto 5 times over 5 seconds until the expectation matches or the test is marked as failing. This does the same thing as the current patch, except:

  • It doesn't hard code details about RunJobs, which makes it a better blind/integration test.
  • It won't fail in Jenkins due to race conditions. By leaving the time in between each poll, it allows jobs to actually complete first. The current patch hits the main page and checks the API quick enough, it might fail before the job even starts/completes.
  • It will actually work in Beta, and might even work for production. The time delay would give jobrunners time to do their work.

I quite like the sound of that

Change 446543 abandoned by Pablo Grass (WMDE):
selenium: invoke jobs to enforce eventual consistency

Reason:
Ifc49bc5c06eef71da8aff1b1837c32debc2c45dd became a minimal solution to this.

https://gerrit.wikimedia.org/r/446543

@Pablo-WMDE would you be interested in trying an approach that does something similar to what is described in T199446#4609222?

@Addshore I'm suprised to find this ticket still open actually. So unless anyone pushes to patch the current RunJobs implementation out of core, which would cause some zugzwang on our side, I'm pro closing this. What do you think?

[..]

L9120 API POST
L9153 Special:RecentChanges GET
L9187 Request ended normally
L9289 [runJobs] recentChangesUpdate (not sure what this really means)

This job deals with removing rows older than 30 days, and (sometimes) updating Special:ActiveUsers. It has nothing to do with edits appearing on Special:RecentChanges.

The data for Special:RecentChanges is saved during the edit (either EditPage or API, both ways). The Revision and RecentChange are created and saved before the HTTP request gets a response back.

Change 539691 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] selenium: Remove "RunJobs" wait from specialrecentchanges test

https://gerrit.wikimedia.org/r/539691

Change 539691 merged by jenkins-bot:
[mediawiki/core@master] selenium: Remove "RunJobs" wait from specialrecentchanges test

https://gerrit.wikimedia.org/r/539691

Change 579432 had a related patch set uploaded (by Reedy; owner: Legoktm):
[mediawiki/core@REL1_31] Disable "...able to change preferences" browser test

https://gerrit.wikimedia.org/r/579432

Change 579432 merged by jenkins-bot:
[mediawiki/core@REL1_31] Disable "...able to change preferences" browser test

https://gerrit.wikimedia.org/r/579432