Page MenuHomePhabricator

npm cache saved by castor get corrupted for unknown reason
Open, Needs TriagePublic

Description

Apparently since we have upgraded CI to use npm 7, we encounter some cache corruption with castor (which is really just rsyncing the npm cache and restoring it for the following build). Previous reports: T294426 T293937 and T295341.

The fix is to find the MW_CASTOR_NAMESPACE variable in the build output and delete the directory on integration-castor03 under /srv/jenkins-workspace/caches/.

For the last occurrence in T295341, I kept a copy of the faulty cache:

mv /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-selenium-docker \
  /srv/corrupted-quibble-vendor-mysql-php72-selenium-docker

Which can be retrieved and investigated locally:

$ mkdir  corrupted-npm-cache
$ cd corrupted-npm-cache
$ rsync -va -zz integration-castor03.integration.eqiad1.wikimedia.cloud:/srv/corrupted-quibble-vendor-mysql-php72-selenium-docker
$ cd corrupted-quibble-vendor-mysql-php72-selenium-docker/npm
$ npm cache --cache "$(pwd)" verify
Cache verified and compressed (~/corrupted-npm-cache/corrupted-quibble-vendor-mysql-php72-selenium-docker/npm/_cacache)
Content verified: 0 (0 bytes)
Missing content: 6984
Index entries: 0
Finished in 1.613s

The npm, node-gyp have a modify time of Nov 5 17:12. The last change to files seem to be Nov 8 20:41 which I guess is the time a build ended up saving a faulty cache.

Event Timeline

Crawling through the build of quibble-vendor-mysql-php72-selenium-docker, I tried to find a build saving the cache around Nov 8 20:41. The job runs browser tests for any extension found to have npm run selenium.

I found https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-selenium-docker/86722/

Triggered by change: 736013,3
Project: mediawiki/extensions/Translate
Branch: master
Pipeline: gate-and-submit

The build passed and since that is from gate-and-submit we trigger a save of the cache but it failed:

20:41:04 [PostBuildScript] - [INFO] Executing post build scripts.
20:41:04 Waiting for the completion of castor-save-workspace-cache
20:41:41 castor-save-workspace-cache #2486487 started.
20:41:42 [PostBuildScript] - [ERROR] An error occured during post-build processing.
20:41:42 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException
20:41:42 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
20:41:42 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
20:41:42 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
20:41:42 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
20:41:42 	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
20:41:42 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
20:41:42 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:806)
20:41:42 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:755)
20:41:42 	at hudson.model.Build$BuildExecution.post2(Build.java:178)
20:41:42 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:699)
20:41:42 	at hudson.model.Run.execute(Run.java:1913)
20:41:42 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
20:41:42 	at hudson.model.ResourceController.execute(ResourceController.java:99)
20:41:42 	at hudson.model.Executor.run(Executor.java:431)
20:41:42 Caused by: java.lang.InterruptedException
20:41:42 	at java.base/java.lang.Object.wait(Native Method)
20:41:42 	at java.base/java.lang.Object.wait(Object.java:328)
20:41:42 	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:79)
20:41:42 	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:149)
20:41:42 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
20:41:42 	... 13 more
20:41:42 Build step 'Execute scripts' changed build result to FAILURE
20:41:42 Build step 'Execute scripts' marked build as failure

The saving cache build is https://integration.wikimedia.org/ci/job/castor-save-workspace-cache/2486487/console

20:41:42 Defined: CASTOR_NAMESPACE="castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-selenium-docker"
20:41:42 Creating directory holding cache:
20:41:42 Syncing cache
20:41:42 From.. 172.16.3.116:/srv/jenkins/workspace/workspace/quibble-vendor-mysql-php72-selenium-docker/cache
20:41:42 To.... /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-selenium-docker
20:41:42 + rsync --archive '--rsh=/usr/bin/ssh -a -T -o ConnectTimeout=6 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' '--rsync-path=docker run --rm -i --volume /srv/jenkins/workspace/workspace/quibble-vendor-mysql-php72-selenium-docker/cache:/cache --entrypoint=/usr/bin/rsync docker-registry.wikimedia.org/releng/castor:0.2.4' --delete-delay --delay-updates jenkins-deploy@172.16.3.116:/cache/ /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-selenium-docker
20:41:42 Warning: Permanently added '172.16.3.116' (ECDSA) to the list of known hosts.
20:41:44 directory has vanished: "/cache/npm/_cacache/content-v2"
20:41:44 directory has vanished: "/cache/npm/_cacache/index-v5/80"
20:41:44 directory has vanished: "/cache/npm/_cacache/index-v5/1a/e1"
20:41:44 directory has vanished: "/cache/npm/_cacache/index-v5/1a/a3"
20:41:44 directory has vanished: "/cache/npm/_cacache/index-v5/1a/df"
20:41:44 directory has vanished: "/cache/npm/_cacache/index-v5/1a/96"
20:41:44 file has vanished: "/cache/npm/_cacache/index-v5/1a/6b/748dc1587558477ffb1dd7406d23baf5bf1b2db09ca0524268e980c1982a"
20:41:44 directory has vanished: "/cache/npm/_cacache/index-v5/1a/54"
...

20:41:48 rsync warning: some files vanished before they could be transferred (code 24) at main.c(1668) [generator=3.1.2]

The next build https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-selenium-docker/86723/consoleFull triggered at 20:41:49 which is just after that rsync failure from the previous build. It sync that faulty cache:

20:41:50 Defined: CASTOR_NAMESPACE="castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-selenium-docker"
20:41:50 Syncing...

And it has the cache error which caused T295341:

INFO:quibble.commands:<<< Finish: Install MediaWiki, db=<MySQL /workspace/db/quibble-mysql-wyhpo7hl/socket> vendor=True, in 1.427 s
INFO:quibble.commands:>>> Start: npm install in /workspace/src
npm WARN old lockfile 
npm WARN old lockfile The package-lock.json file was created with an old version of npm,
npm WARN old lockfile so supplemental metadata must be fetched from the registry.
npm WARN old lockfile 
npm WARN old lockfile This is a one-time fix-up, please be patient...
npm WARN old lockfile 
npm WARN old lockfile FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/@babel%2fcompat-data: ENOENT: no such file or directory, lstat '/cache/npm/_cacache/content-v2/sha512/2f/d5/f60a2302e8511da1dc6d5287788ae504e66818011f2498a383ae486ca3487418e2995d426a9dd9e2a1df4e69d8d3d8f365072d48ed40a9d6619541c0d5ef'
npm WARN old lockfile     at Promise.then.then.catch.er (/srv/npm/node_modules/minipass-fetch/lib/body.js:162:15)
npm WARN old lockfile  Could not fetch metadata for @babel/compat-data@7.14.7 { FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/@babel%2fcompat-data: ENOENT: no such file or directory, lstat '/cache/npm/_cacache/content-v2/sha512/2f/d5/f60a2302e8511da1dc6d5287788ae504e66818011f2498a383ae486ca3487418e2995d426a9dd9e2a1df4e69d8d3d8f365072d48ed40a9d6619541c0d5ef'
npm WARN old lockfile     at Promise.then.then.catch.er (/srv/npm/node_modules/minipass-fetch/lib/body.js:162:15)
npm WARN old lockfile   stack:
npm WARN old lockfile    'FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/@babel%2fcompat-data: ENOENT: no such file or directory, lstat \'/cache/npm/_cacache/content-v2/sha512/2f/d5/f60a2302e8511da1dc6d5287788ae504e66818011f2498a383ae486ca3487418e2995d426a9dd9e2a1df4e69d8d3d8f365072d48ed40a9d6619541c0d5ef\'\n    at Promise.then.then.catch.er (/srv/npm/node_modules/minipass-fetch/lib/body.js:162:15)',
npm WARN old lockfile   code: 'ENOENT',
npm WARN old lockfile   errno: 'ENOENT',
npm WARN old lockfile   syscall: 'lstat',
npm WARN old lockfile   path:
npm WARN old lockfile    '/cache/npm/_cacache/content-v2/sha512/2f/d5/f60a2302e8511da1dc6d5287788ae504e66818011f2498a383ae486ca3487418e2995d426a9dd9e2a1df4e69d8d3d8f365072d48ed40a9d6619541c0d5ef',
npm WARN old lockfile   type: 'system' }

And the cache saving complained about 20:41:44 directory has vanished: "/cache/npm/_cacache/content-v2"

:-\

Looking at a file from the corrupted cache there is the timestamp 20:31:13 ( /srv/corrupted-quibble-vendor-mysql-php72-selenium-docker/npm/_cacache/index-v5/ff/d3/eca629ba696cac5a91e13f61bb768d7cbf82072f798d17071572ab3943f2 ). That would be the build https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-selenium-docker/86720/console

20:31:09 INFO:quibble.commands:Running webdriver test in /workspace/src/skins/Vector
20:31:11 npm WARN old lockfile 
20:31:11 npm WARN old lockfile The package-lock.json file was created with an old version of npm,
20:31:11 npm WARN old lockfile so supplemental metadata must be fetched from the registry.
20:31:11 npm WARN old lockfile 
20:31:11 npm WARN old lockfile This is a one-time fix-up, please be patient...
20:31:11 npm WARN old lockfile 
20:31:19 npm WARN EBADENGINE Unsupported engine { package: 'eslint-plugin-wdio@7.4.2',
20:31:19 npm WARN EBADENGINE   required: { node: '>=12.0.0' },
20:31:19 npm WARN EBADENGINE   current: { node: 'v10.24.0', npm: '7.21.0' } }
20:31:19 npm WARN EBADENGINE Unsupported engine { package: 'stylelint-no-unsupported-browser-features@5.0.1',
20:31:19 npm WARN EBADENGINE   required: { node: '>=12' },
20:31:19 npm WARN EBADENGINE   current: { node: 'v10.24.0', npm: '7.21.0' } }
20:31:26 npm WARN deprecated @stylelint/postcss-markdown@0.36.2: Use the original unforked package instead: postcss-markdown
20:31:26 npm WARN deprecated urix@0.1.0: Please see https://github.com/lydell/urix#deprecated
20:31:26 npm WARN deprecated har-validator@5.1.5: this library is no longer supported
20:31:26 npm WARN deprecated resolve-url@0.2.1: https://github.com/lydell/resolve-url#deprecated
20:31:26 npm WARN deprecated chokidar@2.1.8: Chokidar 2 will break on node v14+. Upgrade to chokidar 3 with 15x less dependencies.
20:31:26 npm WARN deprecated chokidar@2.1.8: Chokidar 2 will break on node v14+. Upgrade to chokidar 3 with 15x less dependencies.
20:31:26 npm WARN deprecated querystring@0.2.0: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
20:31:27 npm WARN deprecated uuid@3.4.0: Please upgrade  to version 7 or higher.  Older versions may use Math.random() in certain circumstances, which is known to be problematic.  See https://v8.dev/blog/math-random for details.
20:31:27 npm WARN deprecated request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142
20:31:28 npm WARN deprecated popper.js@1.16.1: You can find the new Popper v2 at @popperjs/core, this package is dedicated to the legacy v1
20:31:29 npm WARN deprecated iltorb@2.4.5: The zlib module provides APIs for brotli compression/decompression starting with Node.js v10.16.0, please use it over iltorb
20:31:29 npm WARN deprecated highlight.js@9.12.0: Version no longer supported. Upgrade to @latest
20:31:33 npm WARN deprecated core-js@2.6.12: core-js@<3.3 is no longer maintained and not recommended for usage due to the number of issues. Because of the V8 engine whims, feature detection in old core-js versions could cause a slowdown up to 100x even if nothing is polyfilled. Please, upgrade your dependencies to the actual version of core-js.
20:31:56 
20:31:56 added 1608 packages, and audited 1609 packages in 47s

What I dont get is why some files would mysteriously vanish when rsync kicks in to save the cache :-\ The cache is stored on the Jenkins agent since we run the container with:

docker run --volume /srv/jenkins/workspace/workspace/quibble-vendor-mysql-php72-selenium-docker/cache:/cache

The rsync is executed on the host and instructed to use a docker container to run the rsync command:

rsync --archive \
  '--rsh=/usr/bin/ssh -a -T -o ConnectTimeout=6 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no'
  '--rsync-path=docker run --rm -i --volume /srv/jenkins/workspace/workspace/quibble-vendor-mysql-php72-selenium-docker/cache:/cache --entrypoint=/usr/bin/rsync docker-registry.wikimedia.org/releng/castor:0.2.4' \
  --delete-delay --delay-updates \
  jenkins-deploy@172.16.3.116:/cache/ \
  /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-selenium-docker

Maybe the directory entries exists (and rsync find them) but the file have not actually be written. It might be due to an issue when the previous container that mounted the cache terminates, Docker might not update the disk cache or something like that. The next container ends up encountering a filesystem that is not up to date. Maybe a filesystem sync would solve it.

Can we have Quibble detect this automatically, trigger a castor/cache clean, and then re-trigger the step? Regardless of the source bug and fixing it, we shouldn't punish users of CI with known-fails.

Mentioned in SAL (#wikimedia-releng) [2024-01-22T08:46:58Z] <hashar> integration-castor05: sudo rm -fR /srv/castor/castor-mw-ext-and-skins/master/mwgate-node18-docker # T295351