Page MenuHomePhabricator

Several failing tests in Wikibase CI (CentralAuthApiSessionProviderTest, CentralAuthHeaderSessionProviderTest, EditEntityActionTest, ViewEntityActionTest, HtmlPageLinkRendererEndHookHandlerTest)
Closed, ResolvedPublic

Description

Problem:
Several tests have started failing in Wikibase CI across multiple repositories (at least Wikibase, WikibaseCirrusSearch, WikibaseLexemeCirrusSearch, WikibaseQualityConstraints)

Example:
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/50856/console

Screenshots/mockups:

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

  • CI passes again

Open questions:

Event Timeline

Lucas_Werkmeister_WMDE renamed this task from Several failing tests in Wikibase CI (CentralAuthApiSessionProviderTest, CentralAuthHeaderSessionProviderTest, EditEntityActionTest, ViewEntityActionTest, HtmlPageLinkRendererEndHookHandlerTest to Several failing tests in Wikibase CI (CentralAuthApiSessionProviderTest, CentralAuthHeaderSessionProviderTest, EditEntityActionTest, ViewEntityActionTest, HtmlPageLinkRendererEndHookHandlerTest).Dec 7 2020, 5:22 PM
Lucas_Werkmeister_WMDE triaged this task as High priority.

I’m hoping that these are all due to the same underlying cause and don’t have several tests that suddenly started failing at the same time for unrelated reasons…

I can reproduce the EditEntityActionTest errors locally if I enable FlaggedRevs, but the failures don’t seem to have been caused by any recent changes in FlaggedRevs, MediaWiki core, or Wikibase: I can go some commits back in all of those repositories and still get the “Call to a member function isReviewable() on null” error.

If the error isn’t random (and so far I see no reason to assume it is), then it must’ve started to happen between ca. 13:00 UTC (successfully merged) and 14:20 UTC (main test build failed) today. But I can reproduce the error locally if I install FlaggedRevs, and I believe I haven’t pulled any Git repositories other than MediaWiki core, Wikibase, and FlaggedRevs all day. But if I revert any or all of these repositories back to an earlier commit, then the error persists locally.

So that looks like the error was there for a while – but I don’t know why it would then only start happening in CI today. The integration-config repo hasn’t had any significant changes either, nor is there anything that looks relevant in the releng SAL. I don’t understand it.

Hm, the production SAL suggests that @hashar upgraded Jenkins:

13:12 <hashar> Upgrading Jenkins 2.252 > 2.263.1 on contint2001 / contint1001 [production]

Is it at all possible that this caused the error? That seems unlikely, but it’s the only possibility I have so far.


At the PHP level, the main error seems to be:

FlaggablePageView has no context article!

This happens in FlaggablePageView::load():

	private function load() {
		if ( !$this->loaded ) {
			$this->loaded = true;
			$this->article = self::globalArticleInstance();
			if ( $this->article == null ) {
				throw new Exception( 'FlaggablePageView has no context article!' );
			}
			$this->out = $this->getOutput(); // convenience
		}
	}

Where self::globalArticleInstance() looks like this:

	public static function globalArticleInstance() {
		$title = RequestContext::getMain()->getTitle();
		if ( $title ) {
			return FlaggableWikiPage::getTitleInstance( $title );
		}
		return null;
	}

And RequestContext::getMain()->getTitle() tries to get the title from the global variable:

	public function getTitle() {
		if ( $this->title === null ) {
			global $wgTitle; # fallback to $wg till we can improve this
			$this->title = $wgTitle;
			$logger = LoggerFactory::getInstance( 'GlobalTitleFail' );
			$logger->info(
				__METHOD__ . ' called with no title set.',
				[ 'exception' => new Exception ]
			);
		}

		return $this->title;
	}

but in the debugger I can see that $wgTitle is null when the test runs.

Hm, the production SAL suggests that @hashar upgraded Jenkins:

13:12 <hashar> Upgrading Jenkins 2.252 > 2.263.1 on contint2001 / contint1001 [production]

Is it at all possible that this caused the error? That seems unlikely, but it’s the only possibility I have so far.

I would be surprised if this were the cause. Most tests don't run on the jenkins primary node. The jenkins logic is pretty low-level, I'd expect to see something failing before tests even started or in reporting but not in a specific test. Also, most tests run in docker containers that are somewhat isolated from these kinds of changes.

Jdforrester-WMF raised the priority of this task from High to Unbreak Now!.Dec 7 2020, 6:13 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Unable to merge in a prod. repo -> UBN.

Change 646790 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/FlaggedRevs@master] Skip ArticleViewHeader hook in unit tests

https://gerrit.wikimedia.org/r/646790

Change 646790 abandoned by Lucas Werkmeister (WMDE):
[mediawiki/extensions/FlaggedRevs@master] Skip ArticleViewHeader hook in unit tests

Reason:
doesn’t work very well

https://gerrit.wikimedia.org/r/646790

Change 646795 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Set wgTitle when setting wgRequest in test

https://gerrit.wikimedia.org/r/646795

There are some cache updates from today, but they seem to align with the mediawiki-codesniffer update which seems correct.

Here's all the changes between 13:00UTC and 15:00 UTC for every repo that might affect that job:

thcipriani@capella:~/MediaWiki$ for dir in ./*; do echo "$dir" && git -C "$dir" log --format='%h - %s' --after="2020-12-07T13:00:00-00:00" --before="2020-12-07T15:00:00-00:00"; done
./mediawiki
49e7981cc2 - Fixed mixed escaping in Language::translateBlockExpiry
147f8b21d0 - Use Xml::element in SpecialUserrights for sanity
e9cfc988c3 - Use Html::element in ChangeListSpecialPage for sanity
66e41027b1 - Merge "Revert "Hard-deprecate all public property access on CacheTime and ParserOutput.""
2579ca623a - build: Updating mediawiki/mediawiki-codesniffer to 34.0.0

./mediawiki-extensions-CentralAuth
f49d936e - Merge "Create CentralAuthHeaderSessionProvider"
4b51976c - Merge "Add tests for CentralAuthTokenSessionProvider"

./mediawiki-extensions-TemplateData
d08a644 - Track when (relevant) changes are made to <templatedata> tags
93b7bc0 - Drop unnecessary top-level closures

./mediawiki-extensions-UniversalLanguageSelector
4eec58bf - Update jquery.ime from upstream
./mediawiki-extensions-VisualEditor
eb9ab46d1 - Merge "ve.dm.MWExternalLinkAnnotation: Alienate malformed links"
d52097cfd - ve.dm.MWExternalLinkAnnotation: Alienate malformed links

./mediawiki-extensions-Wikibase
eb6a227378 - Merge "bridge: test appropriate MediaWiki branch in mwlibs"

It is very unlikely the Jenkins upgrade is any related. The jobs are running in Docker containers and Jenkins is really just running docker run commands over ssh to WMCS instances. I am not entirely ruling it out though, but it has a very low chance.

If the defect is related to one of the repository listed by Tyler, that would be CentralAuth. All the other repositories are running the same shared jobs wmf-quibble.* which clones them all (and others) and run every single tests. CentralAuth is the only repository which is not included.

The two builds Lucas found are a good base for the investigation. When such issue happens I do some diff between the console logs / mwdebug logs etc to try to get the culprit. I have marked both to be kept forever in Jenkins.

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/50816/
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/50850/

Next trick, download the Jenkins console log, filter for lines matching Prepared to match the informational messages emitted by zuul.cloner. That is the bit in Quibble which takes care of cloning the repositories, fetch the patches / branch. Doing a diff yields:

@@ -6 +6 @@
-INFO:zuul.Cloner.mediawiki/extensions/CentralAuth:Prepared mediawiki/extensions/CentralAuth repo with branch master at commit 75294290869b233c5e352968c40243f5d434d7d1
+INFO:zuul.Cloner.mediawiki/extensions/CentralAuth:Prepared mediawiki/extensions/CentralAuth repo with branch master at commit f49d936e0d6c88f85ccb90a8a430fa302abe9446
@@ -31 +31 @@
-INFO:zuul.Cloner.mediawiki/extensions/UniversalLanguageSelector:Prepared mediawiki/extensions/UniversalLanguageSelector repo with branch master at commit f17391c7f3a9814ee4bf42205497192fdc7c9c9d
+INFO:zuul.Cloner.mediawiki/extensions/UniversalLanguageSelector:Prepared mediawiki/extensions/UniversalLanguageSelector repo with branch master at commit 01f9a1453ded8128ea87328ab66556c2ee227f52
@@ -36 +36 @@
-INFO:zuul.Cloner.mediawiki/extensions/Wikibase:Prepared mediawiki/extensions/Wikibase repo with commit 6d06836d5133c100874e4ac9bc35da0fccc13234
+INFO:zuul.Cloner.mediawiki/extensions/Wikibase:Prepared mediawiki/extensions/Wikibase repo with commit bbeb42483fd020b58b9aa1369dba019ff51a84ef
@@ -44 +44 @@
-INFO:zuul.Cloner:Prepared mediawiki/core repo with branch master at commit 38ee9bc7bfb5c3d54010397d0e5277e313cdcedf
+INFO:zuul.Cloner:Prepared mediawiki/core repo with branch master at commit 66e41027b121b0fd04fe2bd91ce69d9609e7ab70

Or:

That is essentially the same list Tyler gave, but limited to just the commit that actually changed between the two CI builds.

The Wikibase commits are not in Gerrit, they come from the CI Zuul merger which merged the triggering change against the tip of the target branch. They correspond to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/645098 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/646681

The UniversalLanguageSelector change 01f9a1453ded8128ea87328ab66556c2ee227f52 updates jquery.uls, sounds unlikely to be related.

The mediawiki/core change 48172f794df98919249472682ffd1da961c0575b is a rollback of a ParserOutput refactoring which we had to make to unblock the train. That caused some side effect when unserializing cached ParserOutput from a previous version of mediawiki.

Which leaves us with two CentralAuth patches:

dbd3d680092fd5d99b1201dd52c5c2c0e3d25df3Add tests for CentralAuthTokenSessionProvider
6c0eb6c92f368acca15dbcb09088afd476006193Create CentralAuthHeaderSessionProvider

We can try by sending a revert commit of 6c0eb6c92 then send a dummy change against Wikibase that depends on it and see whether that fixes it.

Else one has to setup the whole stack locally and git bisect the phpunit test failure.

Change 646824 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/extensions/CentralAuth@master] Revert "Create CentralAuthHeaderSessionProvider"

https://gerrit.wikimedia.org/r/646824

Change 646826 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/extensions/Wikibase@master] (DO NOT SUBMIT) test against a CentralAuth revert

https://gerrit.wikimedia.org/r/646826

Else one has to setup the whole stack locally and git bisect the phpunit test failure.

As I said – I can still reproduce some of the test failures locally even when rolling back MediaWiki master and other repositories in Git. (CentralAuth I don’t even have enabled in LocalSettings.php. Apparently I have it cloned under extensions/, but I doubt it has an effect. That clone also hasn’t been updated since January.)

Sorry Lucas I have misread your earlier comment about the failure occurring with just Wikibase / FlaggedRevs. The test I have done reverting a CentralAuth patch doesn't help indeed, that still fails :-\

Change 646824 abandoned by Hashar:
[mediawiki/extensions/CentralAuth@master] Revert "Create CentralAuthHeaderSessionProvider"

Reason:

https://gerrit.wikimedia.org/r/646824

Maybe we’re dealing with several separate problems after all? I’ll look into some of the Wikibase errors in more detail tomorrow, but at a glance they don’t look like they should be related to the other things…

I don't have any lead myself beside the CentralAuth one (which does not seem to change anything since my proposal fails still). It is late so yeah lets check what we can do tomorrow and maybe loop other people. Thank you Luca!

I would be very unsurprised if CentralAuth was the source of at least some of the problems; it's a recurring area of concern given the complexity, tightness of integration, and impact of its results.

Change 646826 abandoned by Hashar:
[mediawiki/extensions/Wikibase@master] (DO NOT SUBMIT) test against a CentralAuth revert

Reason:

https://gerrit.wikimedia.org/r/646826

Maybe we’re dealing with several separate problems after all? I’ll look into some of the Wikibase errors in more detail tomorrow, but at a glance they don’t look like they should be related to the other things…

I wanted to look into the HtmlPageLinkRendererEndHookHandlerTest failures, but those tests all pass for me locally.

That said, the failures look like the hook decides to not run itself, and that decision is generally based on some title / request / context things, which seems to be the general area of brokenness here, so it’s plausible that this is still the same underlying cause (though I still have no clue what that cause would be).

I uploaded some tests improvements in a chain ending at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/646975, but that probably won’t have any effect on CI at the moment.

Change 646795 abandoned by Lucas Werkmeister (WMDE):
[mediawiki/extensions/Wikibase@master] Set wgTitle when setting wgRequest in test

Reason:

https://gerrit.wikimedia.org/r/646795

Okay, to answer at least one question: as far as I can tell, it is normal for $wgTitle to be null during MediaWiki integration tests. I put var_dump( $wgTitle ) into several tests and it printed NULL, even in an old MediaWiki checkout that I haven’t updated in a month.

This suggests to me that some fix should be in FlaggedRevs – it should deal with a null $wgTitle better. I’ll try to see if this gets me anywhere.

Change 646986 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/FlaggedRevs@master] Guard more singleton() calls with globalArticleInstance() checks

https://gerrit.wikimedia.org/r/646986

The above FlaggedRevs change seems to fix all the EditEntityActionTest/ViewEntityActionTest failures. The HtmlPageLinkRendererEndHookHandlerTest failures are still there (edit: see build result of this change), but my suggestion is to disable those for the time being, to unblock CI. (I haven’t been able to reproduce those failures locally yet, unfortunately.)

Change 647000 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Skip HtmlPageLinkRendererEndHookHandlerTest

https://gerrit.wikimedia.org/r/647000

Change 646986 merged by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@master] Guard more singleton() calls with globalArticleInstance() checks

https://gerrit.wikimedia.org/r/646986

Change 647000 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Skip HtmlPageLinkRendererEndHookHandlerTest

https://gerrit.wikimedia.org/r/647000

Lucas_Werkmeister_WMDE lowered the priority of this task from Unbreak Now! to High.

Alright, CI seems to be unbroken for the time being. Leaving the task open to find a proper fix for at least HtmlPageLinkRendererEndHookHandlerTest.

Also, do we know how this broke so badly and unexpectedly? Would be nice to avoid it recurring.

Also, do we know how this broke so badly and unexpectedly? Would be nice to avoid it recurring.

I am confused about this as well and I don't know that anyone has found any clear root cause. That is, if the code changes between the merged and failing patches were not the cause of the failures, and CI infra didn't change: what caused this to start failing and if it's correct that these tests should have been failing what caused CI to not notice?

Change 647638 had a related patch set uploaded (by Catrope; owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/FlaggedRevs@wmf/1.36.0-wmf.21] Guard more singleton() calls with globalArticleInstance() checks

https://gerrit.wikimedia.org/r/647638

Change 647638 merged by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@wmf/1.36.0-wmf.21] Guard more singleton() calls with globalArticleInstance() checks

https://gerrit.wikimedia.org/r/647638

Mentioned in SAL (#wikimedia-operations) [2020-12-10T20:06:24Z] <catrope@deploy1001> Synchronized php-1.36.0-wmf.21/extensions/FlaggedRevs/: Guard more singleton() calls with globalArticleInstance() checks (T269608, to unbreak CI in wmf.21) (duration: 01m 04s)

Change 649317 had a related patch set uploaded (by Michael Große; owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@wmf/1.36.0-wmf.21] Skip HtmlPageLinkRendererEndHookHandlerTest

https://gerrit.wikimedia.org/r/649317

Change 649317 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.36.0-wmf.21] Skip HtmlPageLinkRendererEndHookHandlerTest

https://gerrit.wikimedia.org/r/649317

Also, do we know how this broke so badly and unexpectedly? Would be nice to avoid it recurring.

I am confused about this as well and I don't know that anyone has found any clear root cause. That is, if the code changes between the merged and failing patches were not the cause of the failures, and CI infra didn't change: what caused this to start failing and if it's correct that these tests should have been failing what caused CI to not notice?

No clue. I think the FlaggedRevs fix makes sense either way, but for HtmlPageLinkRendererEndHookHandlerTest we still need to find a proper solution. So far my only idea is to push WIP changes with more and more debug logging to Gerrit and see what happens in CI, because I still haven’t been able to reproduce this locally. (So far I uploaded a revert, just to see if it still fails – it does – but without adding any logging yet. This could also be taken over by someone else on the Wikidata team.)

With some more debug output (see Gerrit change), the cause of the HtmlPageLinkRendererEndHookHandlerTest failures seems to be that the MW_API constant is defined now, and the hook handler skips itself when running in the API. I assume that the MW_API constant was previously not defined; this CentralAuth change, merged December 7th (the first of the two CentralAuth changes listed in T269608#6674229), adds define( 'MW_API', 'TEST' ); to some CentralAuth test, and since PHP constants cannot be undefined once defined, all other tests that happen to run after that test will now think they’re running from the API, with whatever changes that entails.

Possible solutions: make that test not define MW_API, or make our test check the value of the MW_API constant and ignore the value 'TEST'. The former seems nicer, but I think it’s better to at least start with the latter, since I’m not familiar with CentralAuth at all.

(Side note: MediaWiki core’s BotPasswordSessionProviderTest has been defining that constant since this 2015 change; I can only assume that it never broke the Wikibase test because it happened to run after it.)

Change 650152 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Fix HtmlPageLinkRendererEndHookHandler and reenable test

https://gerrit.wikimedia.org/r/650152

Change 650152 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Fix HtmlPageLinkRendererEndHookHandler and reenable test

https://gerrit.wikimedia.org/r/650152

@Lucas_Werkmeister_WMDE congratulations on finding MW_API as being the root cause!