Page MenuHomePhabricator

Many CI builds failing with ECONNRESET from npmjs.org during install:bridge steps
Closed, ResolvedPublic

Description

Since today, an unusual number of selenium builds seem to fail with errors like this:

15:34:30 npm error code ECONNRESET
15:34:30 npm error errno ECONNRESET
15:34:30 npm error network Invalid response body while trying to fetch https://registry.npmjs.org/@typescript-eslint%2fvisitor-keys: aborted
15:34:30 npm error network This is a problem related to network connectivity.
15:34:30 npm error network In most cases you are behind a proxy or have bad network settings.
15:34:30 npm error network
15:34:30 npm error network If you are behind a proxy, please make sure that the
15:34:30 npm error network 'proxy' config is set properly.  See: 'npm help config'
15:34:30 npm error A complete log of this run can be found in: /cache/npm/_logs/2025-01-08T14_33_52_660Z-debug-0.log
15:34:30 ERROR: "install:bridge" exited with 1.
15:34:30 npm error code 1
15:34:30 npm error path /workspace/src/extensions/Wikibase
15:34:30 npm error command failed
15:34:30 npm error command sh -c npm-run-all -p install:*

@Michael noticed that these errors seemingly always come from the install:bridge step in Wikibase, even though (AFAIK) that’s far from the only place where we install dependencies from npm. But the errors don’t always happen – some builds go through successfully. My gut feeling is that roughly half the builds fail at the moment.

Affected builds (incomplete list):

Event Timeline

Lucas_Werkmeister_WMDE updated the task description. (Show Details)

Something in the back of my head says we had similar failures like this before, maybe a few months ago (where my first instinct was to blame npm / Microsoft just out of habit ;) but it turned out to be something else), but I can’t find it right now.

I’ve also seen four (so far) several builds where Cypress failed to download like this:

npm error The Cypress App could not be downloaded.
npm error 
npm error Does your workplace require a proxy to be used to access the Internet? If so, you must configure the HTTP_PROXY environment variable before downloading Cypress. Read more: https://on.cypress.io/proxy-configuration
npm error 
npm error Otherwise, please check network connectivity and try again:
npm error 
npm error ----------
npm error 
npm error URL: https://download.cypress.io/desktop/13.15.1?platform=linux&arch=x64
npm error Error: Corrupted download
npm error 
npm error Expected downloaded file to have checksum: 6f9f55b993bda7efaa62209a311fa95027b51e8c71f68b557d02821caf8187f5503149117df8ffc2c2d097f7a49758c48ae91fbc9d65d11acf14542970a2cfb0
npm error Computed checksum: ca1ea47e112b153d677b826eef8b0f1c87d7bb9783953d86dfa50bee8f58268709ef95f078b77d93e5039c04cbecbb7bd96ef6b3b73fe1ef1b76ca5f8d035e1e
npm error 
npm error Expected downloaded file to have size: 199413632
npm error Computed size: 3665553
npm error 
npm error ----------
npm error 
npm error Platform: linux-x64 (Debian - 10.13)
npm error Cypress Version: 13.15.1

I feel like this might also be connection resets under the hood which Cypress is just reporting differently (as truncated files that then don’t have the right length/checksum).

I feel like this might also be connection resets under the hood which Cypress is just reporting differently (as truncated files that then don’t have the right length/checksum).

Yeah, the Cypress downloader code just pipes the body of one HTTP response into the file and then verifies it, I bet if the connection gets reset it looks exactly like this (rather than reporting the connection reset error or retrying or whatever).

This happens a lot for the MediaWiki-extensions-CodeMirror repo, and is happening currently for https://gerrit.wikimedia.org/r/1099840 (see https://integration.wikimedia.org/ci/job/mwext-node20-rundoc/838/console)

In the past it has been fixed (for CodeMirror) by @Jdforrester-WMF or Release Engineering clearing the cache. James told me it happens roughly once every two months across all of CI, but I'd say it's maybe once a month for CodeMirror by itself.

Mentioned in SAL (#wikimedia-releng) [2025-01-08T22:07:53Z] <hashar> castor: deleting potentially corrupted npm cache. On integration-castor05: sudo rm -fR /srv/castor/castor-mw-ext-and-skins/master/{wmf-quibble-selenium-php74,quibble-vendor-mysql-php74-selenium}/npm # T383237

hashar subscribed.

I have nuked the npm cache of two jobs which should address the checksum mismatch encountered when installing Cypress.

For the other builds failing due to ECONNRESET, and the topic of this task: I have no idea. It could be the npmjs registry hard rate limiting us or maybe the WMCS network dropping connections. Now browsing https://status.npmjs.org/ there is a notice:

WARNING: npm is experiencing intermitent degraded installs. This incident affects: Package installation. Posted 11 minutes ago. Jan 08, 2025 - 22:05 UTC

So I am marking that Upstream for now.

hashar claimed this task.

https://status.npmjs.org/incidents/jlqm624klvs2 :

Identified
The issue has been identified and a fix is being implemented.
Posted 9 hours ago. Jan 08, 2025 - 23:12 UTC

Resolved
This incident has been resolved.
Posted 7 hours ago. Jan 09, 2025 - 00:50 UTC

Searching the CI build for the last 7 hours (= 420 minutes) yields nothing (but does before that).

I am assuming this one indeed an issue with upstream. Thank you for the report!

That wouldn’t explain the Cypress download errors… but I guess I’ll see if any new instances of that crop up today. Thanks!