16:33:32 1) ForeignResourceStructureTest::testVerifyIntegrity 16:33:32 LogicException: Failed to download resource at https://codeload.github.com/wikimedia/jquery.i18n/tar.gz/70b5ee20a638cb8fe36baef8d51ac2eb577ce012 16:33:32 16:33:32 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:300 16:33:32 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:381 16:33:32 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:186 16:33:32 /workspace/src/tests/phpunit/integration/includes/ResourceLoader/ForeignResourceStructureTest.php:41
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T362426 CI depending on GitHub results in numerous failures outside our control | |||
Resolved | hashar | T362425 ForeignResourceStructureTest flaky in CI due to "Failed to download resource at https://codeload.github.com" | |||
Resolved | hashar | T368550 castor does not restore caches? |
Event Timeline
Got this failure again, this time with jQuery:
00:03:33.091 1) ForeignResourceStructureTest::testVerifyIntegrity 00:03:33.091 LogicException: Failed to download resource at https://code.jquery.com/qunit/qunit-2.20.0.js 00:03:33.091 00:03:33.091 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:292 00:03:33.091 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:350 00:03:33.091 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:204 00:03:33.091 /workspace/src/tests/phpunit/integration/includes/ResourceLoader/ForeignResourceStructureTest.php:41
See the console log. This is for a pure js change: Object.assign's first argument must never be null/undefined
I just filed the jQuery version at T368385 as well; not sure if it makes sense to track separately or should be considered a duplicate, TBH.
BTW, I also remember occasionally getting this failure (for GitHub)… maybe ForeignResourceManager should retry the download once or twice if it fails? (AFAICT it’s never called during normal requests, so the potential extra runtime shouldn’t be a production concern, I think.)
I guess that depends on what the goal of the test is…
In a change where the foreign resources are touched, I think specifying additional sources would effectively mean that we trust all of the listed sources equally? Since we would let the change pass CI (and check the new resources into Git) if the file matched any of the sources.
In a change where the foreign resources aren’t touched, I think the test serves almost no purpose and might as well be disabled, except that it’s tricky to implement that? (It could still detect if upstream clandestinely changes the resource at the same URL, but I don’t know that it’s our responsibility to detect that, to be honest.)
Change #1049584 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):
[mediawiki/core@master] ForeignResourceManager: Show details about error
Seen in another build:
LogicException: Failed to download resource at https://registry.npmjs.org/oojs/-/oojs-7.0.1.tgz
So either codeload.github.com, code.jquery.com and registry.npmjs.org are all having infrastructure problems (admittedly, GitHub and npm are both owned by Microsoft, so it’s not completely impossible)… or the issue is on our end?
Change #1049594 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):
[mediawiki/core@master] Skip failing ForeignResourceStructureTest
Add cloudflare to the list of seemingly affected upstreams (build):
LogicException: Failed to download resource at https://cdnjs.cloudflare.com/ajax/libs/chosen/1.8.2/chosen-sprite%402x.png: HTTP request timed out.
At this point I think it’s pretty likely that the issue is on our end… but unfortunately “HTTP request timed out” isn’t as much detail as I’d hoped for :S
Change #1049606 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):
[mediawiki/core@master] Revert "Skip failing ForeignResourceStructureTest"
Change #1049594 merged by jenkins-bot:
[mediawiki/core@master] Skip failing ForeignResourceStructureTest
This is odd, on the integration VMs I am not seeing any connection problems.
thcipriani@integration-agent-docker-1045:~$ nc -vz cdnjs.cloudflare.com -w 1 443 Connection to cdnjs.cloudflare.com (104.17.24.14) 443 port [tcp/https] succeeded! thcipriani@integration-agent-docker-1045:~$ nc -vz github.com -w 1 443 Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded!
I note that there are extra layers: docker container, package manager that still should be tested, but I think network connectivity seems fine at the moment.
Change #1049584 merged by jenkins-bot:
[mediawiki/core@master] ForeignResourceManager: Show details about error
Hrm. looks like there are basic tcp connection errors on the VM. I also note the ip changing a surprising amount during this short test:
thcipriani@integration-agent-docker-1045:~$ while : ; do nc -vz github.com -w 6 443; sleep 30; done Connection to github.com (140.82.113.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.113.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded! Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded! Connection to github.com (140.82.114.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.114.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded! Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded! Connection to github.com (140.82.114.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.114.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.112.4) 443 port [tcp/https] succeeded! Connection to github.com (140.82.112.4) 443 port [tcp/https] succeeded! Connection to github.com (140.82.113.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.113.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.112.3) 443 port [tcp/https] succeeded! Connection to github.com (140.82.112.3) 443 port [tcp/https] succeeded! nc: connect to github.com (140.82.113.4) port 443 (tcp) timed out: Operation now in progress Connection to github.com (140.82.113.4) 443 port [tcp/https] succeeded!
trying this locally: seems par for github, failing outbound is something else though.
I think @Andrew moved the contint Cloud VPS nodes to the OVS agent hypervisors today. Could that be related?
One of the potential issue is ForeignResourceStructureTest::testVerifyIntegrity is triggered from each of the Jenkins job running for mediawiki/core (and thus under php 7.4, 8.1, 8.2, 8.3) and for any patch sent.
Note that it is downloading the tarballs to verify their integrity. The download could potentially be skipped if we instead kept a checksum of each of the files contained in the tarball.
After a quick look at includes/ResourceLoader/ForeignResourceManager.php, it supports XDG_CACHE_HOME and uses mw-foreign beneath it. If I look at the CI cache, there is a single job having such a directory:
$ ls -l /srv/castor/mediawiki-core/master/mediawiki-quibble-vendor-mysql-php80/mw-foreign total 5708 drwxr-sr-x 2 jenkins-deploy wikidev 4096 May 7 13:21 . drwxrwsrwx 7 jenkins-deploy wikidev 4096 May 7 13:21 .. -rw-r--r-- 1 jenkins-deploy wikidev 12718 May 7 13:21 CLDRPluralRuleParser_35271498_328afeab_CLDRPluralRuleParser_js.data -rw-r--r-- 1 jenkins-deploy wikidev 511630 May 7 13:21 codex_09902517_c9fd17c2_codex_1_5_0_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 67311 May 7 13:21 codex_design_tokens_945744ae_48a6385a_codex_design_tokens_1_5_0_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 187300 May 7 13:21 codex_icons_b05e0ae9_00df03a6_codex_icons_1_5_0_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 12959 May 7 13:21 fetch_polyfill_84f8f065_573ed6ed_whatwg_fetch_3_6_2_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 22342 May 7 13:21 intersection_observer_dc899fec_e740e0f2_intersection_observer_0_12_0_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 1215 May 7 13:21 jquery_chosen_056947c8_a922d849_LICENSE_md.data -rw-r--r-- 1 jenkins-deploy wikidev 1904 May 7 13:21 jquery_chosen_4bc4fa80_f000651a_README_md.data -rw-r--r-- 1 jenkins-deploy wikidev 47205 May 7 13:21 jquery_chosen_6ad86030_ccf582d1_chosen_jquery_js.data -rw-r--r-- 1 jenkins-deploy wikidev 11978 May 7 13:21 jquery_chosen_83917de4_61c23e25_chosen_css.data -rw-r--r-- 1 jenkins-deploy wikidev 538 May 7 13:21 jquery_chosen_b5bfabcd_0a20953f_chosen_sprite_png.data -rw-r--r-- 1 jenkins-deploy wikidev 738 May 7 13:21 jquery_chosen_ebd43fb5_2007fde4_chosen_sprite%402x_png.data -rw-r--r-- 1 jenkins-deploy wikidev 6010 May 7 13:21 jquery_client_fad01184_d7a52564_jquery_client_3_0_0_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 285314 May 7 13:21 jquery_e5d63f9e_c357ee36_jquery_3_7_1_js.data -rw-r--r-- 1 jenkins-deploy wikidev 112819 May 7 13:21 jquery_i18n_a5136c3c_9483ae39_70b5ee20a638cb8fe36baef8d51ac2eb577ce012.data -rw-r--r-- 1 jenkins-deploy wikidev 1459212 May 7 13:21 moment_cc10245f_fe546a6a_2_25_2.data -rw-r--r-- 1 jenkins-deploy wikidev 34584 May 7 13:21 mustache_700834f3_1b259b17_mustache_4_2_0_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 24215 May 7 13:21 oojs_e785e93d_32a85675_oojs_7_0_1_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 1601863 May 7 13:21 ooui_d516a5cd_35888a57_oojs_ui_0_49_1_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 142467 May 7 13:21 pako_1a96f6ff_cdb1394f_pako_deflate_js.data -rw-r--r-- 1 jenkins-deploy wikidev 5183 May 7 13:21 pako_3cda6da3_cba999cf_README_md.data -rw-r--r-- 1 jenkins-deploy wikidev 27876 May 7 13:21 pako_53b54ed7_12e6bae1_pako_deflate_min_js.data -rw-r--r-- 1 jenkins-deploy wikidev 1104 May 7 13:21 pako_f23deea8_f82a78b9_LICENSE.data -rw-r--r-- 1 jenkins-deploy wikidev 84760 May 7 13:21 pinia_0987ebc5_a1f75ba8_pinia_2_0_16_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 9706 May 7 13:21 qunitjs_674dbb25_3a1a650e_qunit_2_20_0_css.data -rw-r--r-- 1 jenkins-deploy wikidev 261991 May 7 13:21 qunitjs_c71e1571_f31cb574_qunit_2_20_0_js.data -rw-r--r-- 1 jenkins-deploy wikidev 215019 May 7 13:21 sinonjs_5439e32e_313e716d_sinon_1_17_7_js.data -rw-r--r-- 1 jenkins-deploy wikidev 1082 May 7 13:21 url_7b040ba9_b66de763_LICENSE_md.data -rw-r--r-- 1 jenkins-deploy wikidev 148 May 7 13:21 url_8675b312_28ba3621_polyfill_js.data -rw-r--r-- 1 jenkins-deploy wikidev 18263 May 7 13:21 url_ae97ebf9_6b4ac0f0_polyfill_js.data -rw-r--r-- 1 jenkins-deploy wikidev 531501 May 7 13:21 vue_3418ace6_2ddc00f2_vue_3_3_9_tgz.data -rw-r--r-- 1 jenkins-deploy wikidev 65770 May 7 13:21 vuex_0a04d0b3_7636816b_vuex_4_0_2_tgz.data
That last ran on May 7, then we no more run the php8.0 job. That makes me wonder whether the test actually runs on CI or maybe the cache is not working somehow? :/
I believe this was solved in 2019 with T203694: Run ForeignResourceManager verification on MediaWiki core commits.
If you've run manageForeignResources.php verify once in the past, re-running doesn't download anything and completes near-instantly. The same is true for composer phpunit -- tests/phpunit/integration/includes/ResourceLoader/ForeignResourceStructureTest.php. These two share the same offline and non-expiring cache, validated by content hash. This is enabled in CI as well.
If CI is having general networking issues, then composer install won't succeed anyway, and individual tests like this make little difference (although the vendor job will reach PHPUnit without it, so on those jobs manageForeignResources might be the first visible failure). For this to happen, there would have to be an empty cache. This might happen once or twice a quarter after adding renaming or adding new Jenkins jobs, e.g. for a new PHP version. Or after a Debian upgrade of the CI runners at WMCS, where Castor would begin with an empty cache once.
I guess one of these scenarios happened recently, and thus uncovered a networking reliablilty problem.
With https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1049879 I have emitted some debug statement which seems to indicate the class does indeed write downloaded materials under /cache/mw-foreign but I don't get why it is not saved. I have at least confirmed the files get written to /cache/mw-foreign and if trigger a job as if it was in postmerge, castor does save the mw-foreign file.
So my assumption is that eventually the cache get nuked or saved while that directory does not exist.
I also note the Quibble step which installs the dev dependencies does download all dependencies:
11:31:10 - Downloading squizlabs/php_codesniffer (3.8.1) 11:31:10 - Downloading dealerdirect/phpcodesniffer-composer-installer (v1.0.0) 11:31:10 - Downloading composer/pcre (3.1.4) 11:31:10 - Downloading psr/cache (1.0.1) 11:31:10 - Downloading doctrine/deprecations (1.1.3) 11:31:10 - Downloading doctrine/event-manager (1.2.0)
Filed as T368550
I did ran a job and confirmed the mw-foreign cache to be saved. I was watching the cached directory on the Castor instance and eventually it vanished:
VANISHED ! Wed, 26 Jun 2024 15:30:02 +0000
I tracked back the build to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1049832 from gate-and-submit (which does trigger castor). In the build artifact mw-debug-cli.log.gz there is:
[PHPUnit] Start test ForeignResourceStructureTest::testVerifyIntegrity [PHPUnit] Skipped test ForeignResourceStructureTest::testVerifyIntegrity: T362425
Since the test is skipped, mw-foreign is not populated and due to T368550 it is not saved even when a build generates it. Mystery solved
Change #1049988 had a related patch set uploaded (by Ladsgroup; author: Bartosz Dziewoński):
[mediawiki/core@wmf/1.43.0-wmf.11] Skip failing ForeignResourceStructureTest
Change #1049989 had a related patch set uploaded (by Ladsgroup; author: Bartosz Dziewoński):
[mediawiki/core@wmf/1.43.0-wmf.10] Skip failing ForeignResourceStructureTest
I think the root cause is T368550 which prevents the cache to be kept between build. That means as the jobs running in parallel download materials over a short period of time, we might trigger a throttle/rate limit upstream which kills the connection and fails the build. As Timo said on IRC: We now roll the dice 300 times in every build instead of between 0-1 times per build.
The cache should be restored now, but it is to be verified before we reenable the ForeignResourceStructureTest test.
Change #1049988 merged by jenkins-bot:
[mediawiki/core@wmf/1.43.0-wmf.11] Skip failing ForeignResourceStructureTest
Change #1049989 merged by jenkins-bot:
[mediawiki/core@wmf/1.43.0-wmf.10] Skip failing ForeignResourceStructureTest
Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:05:56Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]]
Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:08:53Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwd
Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:14:49Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] (duration: 08m 52s)
The root cause was that we ran CI without any cache (T368550) and thus the integrity checker had to redownload files again and again as described above T362425#9925584.
I am closing this since I have restored the cacheing last week.
This is not fixed. The revert of the test is (a) not landed, and more importantly (b) not passing CI.
Change #1049606 merged by jenkins-bot:
[mediawiki/core@master] Revert "Skip failing ForeignResourceStructureTest"
Change #1060787 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/core@master] ForeignResourceManager: Ignore network errors during tests
Change #1060787 merged by jenkins-bot:
[mediawiki/core@master] ForeignResourceManager: Ignore network errors during tests