Page MenuHomePhabricator

[Regression] Jenkins: Jobs for npm testing are broken due to npm certificate issues on the new slaves
Closed, ResolvedPublic

Description

Last working job:

https://integration.wikimedia.org/ci/job/mwext-VisualEditor-npm/942/console

- Feb 16, 2014
- Building remotely on integration-slave01

- node v0.10.22
- npm v1.1.38

First failing job:

https://integration.wikimedia.org/ci/job/mwext-VisualEditor-npm/943/console

- Feb 18, 2014
- Building remotely on integration-slave02

- node v0.8.2
- npm v1.1.39

16:23:41 npm ERR! Error: SSL Error: CERT_UNTRUSTED
16:23:41 npm ERR! at ClientRequest.<anonymous>
16:23:41 npm ERR! at Socket.ondata (stream.js:38:26)
16:23:41 npm ERR! [Error: SSL Error: CERT_UNTRUSTED]

Server log:

https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&diff=99596&oldid=99549


- February 17
  - 16:15 hashar: Jenkins deleting slave integration-slave01
  - 16:14 hashar: Jenkins added two labs slaves with 4 CPU: integration-slave02 and integration-slave03
  - 08:46 hashar: Upgrading Jenkins, half an hour downtime

So npm was upgraded one minor version, and nodejs was downgraded *2 major versions*, and (possibly unrelated) it seems to be unable to verify the certificate properly.

According to existing bug reports, this is related to it being a self-signed certificate, however this shouldn't be a problem since a validation mechanism for their official certificate ships with the npm package. Upstream recommends upgrading to the most recent minor version, but that doesn't seem to be the problem considering the bug started happening for us overnight (no certificate change upstream) when we went from v1.1.38 to v1.1.39 (not down).


Version: wmf-deployment
Severity: normal
See Also:
https://github.com/npm/npm/issues/4838

Details

Reference
bz61508

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:03 AM
bzimport set Reference to bz61508.

So the certificates partially ship with nodejs, not the npm package.

Upstream nodejs takes care to backport these cert changes to v0.8, however the slaves not only downgraded from v0.10 to v0.8 but also to an way older minor release of the v0.8 branch.

There have been 24 (!) minor releases since v0.8.2, latest is v0.8.22.

In case this particular cert change can be mitigated by only upgrading npm, I've done a manual upgrade of npm on the individual instances using npm itself:

$ sudo -s
$ npm conf set strict-ssl false
$ npm install -g npm
npm 1.1.39
...
npm 1.4.3
$ npm conf set strict-ssl true
$ cd /tmp && mkdir foo123 && cd foo123
$ npm install jshint
success

The jobs still fail after this because the new labs slaves are also missing grunt.

$ sudo -s
$ npm install -g grunt-cli
success

Note that I couldn't interact with npm the normal way because for some reason /home is read-only on the integration instances (even in root, npm still uses your original home as location to do some temporary work and caching).

Had to set export HOME=/root to bypass that.

krinkle at integration-slave03:

  1. Enter root and fix HOME so that npm doesn't put cache in /home
  2. which is read-only in labs (why?)

$ sudo -s
$ export HOME=/root

Temporarily disable ssl check

$ npm conf set strict-ssl false

  1. Remove symlink to apt-get installed version
  2. because npm is not allowed to to delete this shadow

$ l /usr/bin/npm

/usr/bin/npm -> /etc/alternatives/npm*

$ rm /usr/bin/npm

Upgrade npm

$ /etc/alternatives/npm install -g npm

npm@1.1.39
...
npm@1.4.3

Re-enable ssl check

$ npm conf set strict-ssl true

  1. Verify that stuff works by doing an
  2. install of an example package (jshint)
  3. in a tmp dir

$ cd /tmp && mkdir foo123 && cd foo123
$ npm install jshint
...
success
$ cd ~
$ rm -rf /tmp/foo123

Install Grunt

$ npm install -g grunt-cli

...
success

integration-slave01 received nodejs 0.10.x when it got added to apt.wikimedia.org. It has been later removed but the instance never got cleaned up.

npm, I have no idea, probably similar.

I dont want the slaves to be tweaked manually, everything must be in puppet. So there is a few bugs that we should fill all related to updating packages in apt.wikimedia.org:

  • nodejs 0.10.x (that is apparently a work in progress)
  • npm 1.3.10 should be backported from Ubuntu Trusty
  • grunt-cli needs to be packaged

Then we can update the list of packages in operations/puppet.git file ./modules/contint/manifests/packages/labs.pp . It list npm but no grunt-cli since there is no package there.

Does it sound right?

Lowering priority and assigning back to Timo. He applied a workaround. Still have to fill bugs as mentioned in comment #3

(In reply to Antoine "hashar" Musso from comment #3)

integration-slave01 received nodejs 0.10.x when it got added to
apt.wikimedia.org. It has been later removed but the instance never got
cleaned up.

npm, I have no idea, probably similar.

I dont want the slaves to be tweaked manually, everything must be in puppet.
So there is a few bugs that we should fill all related to updating packages
in apt.wikimedia.org:

  • nodejs 0.10.x (that is apparently a work in progress)
  • npm 1.3.10 should be backported from Ubuntu Trusty
  • grunt-cli needs to be packaged

Then we can update the list of packages in operations/puppet.git file
./modules/contint/manifests/packages/labs.pp . It list npm but no grunt-cli
since there is no package there.

Does it sound right?

Yes, except for grunt-cli needing to be packaged. We explicitly don't want to do that, like the over 300 other arbitrary npm modules we fetch daily on the integration slaves based on things listed in package.json in local repositories, this yet just another package like that. We can and should (for consistency and for it being the right version) install this via npm.

I'm sure there is a puppet syntax for ensuring a certain shell command has been executed (e.g. based on a certain file existing). Similar to how we use git::clone in some places and the puppet file{} syntax. They're not provisioned packages, just inline specified within our manifest created by something other than a package (a rb template file, a git clone, or, in this case, an npm install)

For some reason I managed to get the pmtpa slave nodes back to nodejs 0.8.x which break the VisualEditor npm jobs.

I also created two new slaves in eqiad (integration-slave1001 and integration-slave1002) and they come up with nodejs 0.8.x as well.

Will mail ops list to figure out how to get nodejs 0.10.x marked for install on those hosts.

(In reply to Antoine "hashar" Musso from comment #6)

Will mail ops list to figure out how to get nodejs 0.10.x marked for install
on those hosts.

I don't know if it's the same for you, but docs were wrong for me.
https://www.mediawiki.org/w/index.php?title=Parsoid%2FSetup&diff=930615&oldid=930612

Mailed ops list. The Parsoid and VisualEditor npm jobs are now failing and preventing changes to be merged until the SSL cert issue is properly fixed.

(In reply to Nemo from comment #7)

I don't know if it's the same for you, but docs were wrong for me.
<https://www.mediawiki.org/w/index.
php?title=Parsoid%2FSetup&diff=930615&oldid=930612>

The doc instructs to use a ppa which provides 0.10.x. We do not use ppa.

I was a bit upset this morning. I have applied Timo fix from Comment #2 on all four instances:

integration-slave02.pmtpa.wmflabs
integration-slave03.pmtpa.wmflabs
integration-slave1001.eqiad.wmflabs
integration-slave1002.eqiad.wmflabs

Seems to work now.

https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&oldid=115053

20:08 hashar: Jenkins unpolled integration-slave1003 npm is outdated there and does not trust npmregistry.org ( bug 61508 )
22:37 Krinkle: Hack-patching integration-slave1003.eqiad.wmflabs per https://bugzilla.wikimedia.org/show_bug.cgi?id=61508#c2

krinkle at integration-slave1003.eqiad.wmflabs in ~
$ node --version
v0.8.2

$ npm --version
1.1.39

$ sudo -s

  1. export HOME=/root
  2. npm conf set strict-ssl false
  3. l /usr/bin/npm

    /usr/bin/npm -> /etc/alternatives/npm*
  1. rm /usr/bin/npm
  2. /etc/alternatives/npm install -g npm

    ... npm@1.4.13 /usr/lib/node_modules/npm
  1. cd /tmp && mkdir foo123 && cd foo123
  2. npm install jshint

    .. success .. jshint@2.5.1 node_modules/jshint

l which npm

/usr/bin/npm -> ../lib/node_modules/npm/bin/npm-cli.js*

npm conf set strict-ssl true

  1. cd ~ && rm -rf /tmp/foo123/
  2. npm install -g grunt-cli

    .. success .. /usr/bin/grunt -> /usr/lib/node_modules/grunt-cli/bin/grunt grunt-cli@0.1.13 /usr/lib/node_modules/grunt-cli

npm --version

1.4.13

grunt --version

grunt-cli v0.1.13

  • Bug 66048 has been marked as a duplicate of this bug. ***

(In reply to Antoine "hashar" Musso from bug 66048 comment #1)

Node marked offline on
https://integration.wikimedia.org/ci/computer/integration-slave1003/

Brought back online.

Thank you Timo for fixing up the installation on integration-slave1003. I guess we can close that bug since you proposed to use node 10.x on bug 66056 which would definitely fix the issue.

Existing instances have been patches so marking this as fixed.

The fact that we need to puppetize the patches is a separate bug.

See:

  • bug 68256
  • bug 66056

And the patches are documented at:
https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup