Quibble CI jobs time out after 30min due to instance stalling at "npm install parse" step
Closed, ResolvedPublic

Description

Today some MediaWiki jobs that usually take ~ 10 minutes to accomplish end up reaching the 30 minutes timeout. They might be CPU starved, though I also noticed a few builds for which npm install took ~ 20 minutes by itself, so it might be a networking issue.

Example for the wmf-quibble-vendor-mysql-hhvm-docker job.

It roams among Jenkins slaves having the m4executor label.

That job supposedly takes 12 minutes to run in the best case. The last few builds, some failed after the 30 minutes timeout (see below).

Update looks like the issue is npm install issuing npm ERR! registry error parsing json and idling/delaying install by 10 minutes.

From https://support.cloudflare.com/hc/en-us/articles/200171896-Why-am-I-getting-a-503-Service-Temporarily-Unavailable-error-message-

$ curl https://registry.npmjs.org/cdn-cgi/trace
fl=16f139
h=registry.npmjs.org
ip=208.80.155.255
ts=1530197834.216
visit_scheme=https
uag=curl/7.38.0
colo=IAD
spdy=off
http=http/1.1
loc=US
Success#42518 minintegration-slave-docker-1011
Aborted#42422 minintegration-slave-docker-1013
Aborted#42324 minintegration-slave-docker-1006
Success#42213 minintegration-slave-docker-1004
Failed#42130 minintegration-slave-docker-1007
Failed#42030 minintegration-slave-docker-1013
Failed#41914 minintegration-slave-docker-1004
Failed#41830 minintegration-slave-docker-1006
Aborted#4176 min 42 secintegration-slave-docker-1005
Success#41620 minintegration-slave-docker-1005
Failed#41530 minintegration-slave-docker-1005
Failed#41422 minintegration-slave-docker-1011
Success#41314 minintegration-slave-docker-1005
Failed#4128 min 31 secintegration-slave-docker-1008
Failed#41130 minintegration-slave-docker-1001
Success#41012 minintegration-slave-docker-1005
Failed#40930 minintegration-slave-docker-1008
Failed#40830 minintegration-slave-docker-1005
Failed#40730 minintegration-slave-docker-1005
Failed#40615 minintegration-slave-docker-1004
Success#40514 minintegration-slave-docker-1004
Success#40429 minintegration-slave-docker-1003
Success#40321 minintegration-slave-docker-1010
Success#40211 minintegration-slave-docker-1012
Success#40116 minintegration-slave-docker-1006
Success#4009 min 54 secintegration-slave-docker-1015
There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 27 2018, 7:18 PM
hashar updated the task description. (Show Details)Jun 27 2018, 7:22 PM

An example of npm install, time is elapsed time in hh:mm:ss since start of build:

00:06:28.594 INFO:test.run_extskin:Running "npm test" for CirrusSearch
00:06:44.719 npm ERR! registry error parsing json
00:06:44.726 npm ERR! registry error parsing json
00:07:12.043 npm WARN deprecated nomnom@1.8.1: Package no longer supported. Contact support@npmjs.com for more info.
00:07:21.198 npm ERR! registry error parsing json
00:17:22.592 npm ERR! registry error parsing json
00:18:34.473 npm WARN prefer global coffeescript@1.10.0 should be installed with -g
00:18:35.518 npm WARN prefer global jsonlint@1.6.2 should be installed with -g
00:18:35.781 npm WARN prefer global jshint@2.9.5 should be installed with -g
00:18:36.149 
...
00:18:42.676 > grunt test

Or more than 12 minutes to do the install step. That build happened on integration-slave-docker-1009.

From labs-capacity-planning Grafana board:

  • labvirt1016 has huge spikes of disk IO every ten minutes
  • labvirt1010 seems IO starved as well
  • labvirt1006 has CPU guest raising every half hour
  • labvirt1001 and labvirt1012 seems to have CPU guest spike as well every half hour

Maybe CI, or another project, put too much pressures on the instances.

Finally from the top-instances Grafana board

Some instances have high iowait: for example fastcci-worker1 ( link ). fastcci-worker2 and fastcci-worker3 got fixed around 18:50 utc

Top instances per steal CPU, indicating the underlying kvm cant allocate them enough CPU. That might help pinning which labvirt has an issue:

InstanceSteal CPU
deployment-cache-text04.deployment-prep.eqiad.wmflabs13%
tools-flannel-etcd-02.tools.eqiad.wmflabs11%
dumpstest-please-dont-delete.testlabs.eqiad.wmflabs9%
d-3.analytics.eqiad.wmflabs8%
mediawiki2latex.collection-alt-renderer.eqiad.wmflabs7%
deployment-fluorine02.deployment-prep.eqiad.wmflabs6%
wikibase.wikidata-dev.eqiad.wmflabs6%
videodev.video.eqiad.wmflabs6%
togetherjs.visualeditor.eqiad.wmflabs6%
deployment-aqs02.deployment-prep.eqiad.wmflabs6%

The usual tools-webgrid-lighttpd having high user cpu:

InstanceUser CPU
tools-webgrid-lighttpd-1416.tools.eqiad.wmflabs101%
tools-webgrid-lighttpd-1422.tools.eqiad.wmflabs100%
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs100%
tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs100%
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs100%
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs100%
tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs100%
tools-webgrid-lighttpd-1406.tools.eqiad.wmflabs100%
tools-webgrid-lighttpd-1425.tools.eqiad.wmflabs98%
taxonbot.dwl.eqiad.wmflabs92%
Jdforrester-WMF triaged this task as High priority.Jun 27 2018, 7:50 PM

Not sure if this is quite at UBN yet, but it's at least High.

The MediaWiki jobs are running integration-slave-docker-1001 to 1015 which is the m4executor label in Jenkins. Their CPU usage do not seems alarming.

We would need a cloud admin to look at OpenStack to find out on which labvirt the integration-slave-docker instances are scheduled on. From there, look at what is going on on those labvirt servers. There is potentially a CPU starvation (some instances have high steal CPU) and certainly an I/O starvation on one of the host (labvirt1010 and labvirt1016).

hashar updated the task description. (Show Details)Jun 28 2018, 7:49 AM

Looks like the issue is npm install issuing npm ERR! registry error parsing json and idling/delaying install by 10 minutes. I will look at the cache, it might be corrupted.

Change 442789 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Use npm --verbose on some Quibble jobs

https://gerrit.wikimedia.org/r/442789

Change 442789 merged by jenkins-bot:
[integration/config@master] Use npm --verbose on some Quibble jobs

https://gerrit.wikimedia.org/r/442789

Change 442791 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Use a different job for npm --verbose

https://gerrit.wikimedia.org/r/442791

Change 442791 merged by jenkins-bot:
[integration/config@master] Use a different job for npm --verbose

https://gerrit.wikimedia.org/r/442791

hashar removed a project: Cloud-VPS.EditedJun 28 2018, 8:48 AM

A build using npm install --verbose is https://integration.wikimedia.org/ci/job/mediawiki-quibble-composer-mysql-php70-docker/1480/console
It locks just after:

00:01:36.025 npm verb afterAdd /cache/npm/to-array/0.1.4/package/package.json written

That json file is valid, at least according to json_pp.

In the container there are four stall npm process:

`-npm(1)-+-{V8 WorkerThread}(1)
         |-{V8 WorkerThread}(1)
         |-{V8 WorkerThread}(1)
         |-{V8 WorkerThread}(1)
         |-{node}(1)
         |-{npm}(1)
         |-{npm}(1)
         |-{npm}(1)
         `-{npm}(1)

Eventually after 10 minutes, the registry gave some HTML output with a 503 Temporarily unavailable when trying to get the component-bind module:

00:11:45.014 npm http 503 https://registry.npmjs.org/component-bind
00:11:45.014 npm verb bad json <!DOCTYPE html>
00:11:45.015 npm verb bad json <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
00:11:45.015 npm verb bad json <!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
00:11:45.015 npm verb bad json <!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
00:11:45.016 npm verb bad json <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
00:11:45.016 npm verb bad json <head>
00:11:45.016 npm verb bad json <title>Temporarily unavailable | registry.npmjs.org | Cloudflare</title></title>
...
00:11:45.033 npm ERR! registry error parsing json

00:11:45.033 npm verb headers { date: 'Thu, 28 Jun 2018 08:40:28 GMT',
00:11:45.033 npm verb headers   'content-type': 'text/html; charset=UTF-8',
00:11:45.034 npm verb headers   'content-length': '3291',
00:11:45.034 npm verb headers   connection: 'keep-alive',
00:11:45.034 npm verb headers   'set-cookie': [ '__cfduid=d6e61ae1e4d173c478b8a965aedcfb5741530174618; expires=Fri, 28-Jun-19 08:30:18 GMT; path=/; domain=.registry.npmjs.org; HttpOnly' ],
00:11:45.035 npm verb headers   'expect-ct': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
00:11:45.035 npm verb headers   server: 'cloudflare',
00:11:45.035 npm verb headers   'cf-ray': '431eca64f8319fc6-IAD' }

The retry is similar:

00:11:45.036 npm info retry will retry, error on last attempt: SyntaxError: Unexpected token < in JSON at position 0
00:11:45.036 npm info retry <!DOCTYPE html>
00:11:45.038 npm info retry <title>Temporarily unavailable | registry.npmjs.org | Cloudflare</title></title>

But eventually it pass and get downloaded/cached:

00:11:55.032 npm info attempt registry request try #2 at 8:40:38 AM
00:11:55.034 npm verb etag W/"5861bd0e9c408f075f381328c70e835d"
00:11:55.034 npm verb lastModified Sat, 26 May 2018 20:24:17 GMT
00:11:55.034 npm http request GET https://registry.npmjs.org/component-bind
00:11:55.074 npm http 304 https://registry.npmjs.org/component-bind
00:11:55.076 npm verb headers { date: 'Thu, 28 Jun 2018 08:40:38 GMT',
00:11:55.076 npm verb headers   connection: 'close',
00:11:55.076 npm verb headers   'set-cookie': [ '__cfduid=d3267284850ea202cce4774c1211547931530175238; expires=Fri, 28-Jun-19 08:40:38 GMT; path=/; domain=.registry.npmjs.org; HttpOnly' ],
00:11:55.077 npm verb headers   'cache-control': 'max-age=300',
00:11:55.077 npm verb headers   'cf-cache-status': 'HIT',
00:11:55.077 npm verb headers   'cf-ray': '431ed9885f0c9580-IAD',
00:11:55.077 npm verb headers   etag: '"5861bd0e9c408f075f381328c70e835d"',
00:11:55.078 npm verb headers   'expect-ct': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
00:11:55.078 npm verb headers   'last-modified': 'Sat, 26 May 2018 20:24:17 GMT',
00:11:55.078 npm verb headers   vary: 'accept-encoding, accept',
00:11:55.078 npm verb headers   server: 'cloudflare' }
00:11:55.078 npm verb etag https://registry.npmjs.org/component-bind from cache
00:11:55.079 npm verb get saving component-bind to /cache/npm/registry.npmjs.org/component-bind/.cache.json

So that is most probably not related to Cloud-VPS , we might just be hammering npmjs too much and overflow some limitation they have.

https://www.cloudflare.com/5xx-error-landing/?utm_source=error_footer

Change 442799 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Remove npm --verbose

https://gerrit.wikimedia.org/r/442799

Change 442799 merged by jenkins-bot:
[integration/config@master] Remove npm --verbose

https://gerrit.wikimedia.org/r/442799

With system time (UTC):

08:30:19 npm verb afterAdd /cache/npm/to-array/0.1.4/package/package.json written
08:40:28 npm http 503 https://registry.npmjs.org/component-bind
08:40:28 npm verb headers { date: 'Thu, 28 Jun 2018 08:40:28 GMT',
08:40:28 npm verb headers   'set-cookie': [ '__cfduid=d6e61ae1e4d173c478b8a965aedcfb5741530174618; expires=Fri, 28-Jun-19 08:30:18 GMT; path=/; domain=.registry.npmjs.org; HttpOnly' ],

The request and set-cookie indicate 08:30. The response and date header happens at 08:40. So I guess it is CloudFlare holding the request for 10 minutes for some reason. I am not sure why npm does not abort the connection way before those 10 minutes though.

For the npm side: npm-registry-client relies on request which creates an http.ClientRequest. On Nodejs v6.11.0 it does not seem to have any default timeout. I guess after 10 minutes, CloudFlare aborts the connection with an HTML error page.

Change 442857 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Force npm loglevel to 'info'

https://gerrit.wikimedia.org/r/442857

Change 442857 merged by jenkins-bot:
[integration/config@master] Force npm loglevel to 'info'

https://gerrit.wikimedia.org/r/442857

Mentioned in SAL (#wikimedia-operations) [2018-06-28T14:26:22Z] <hashar> CI jobs running npm might suffer from a 10 minutes delay since June 27th | T198348

hashar updated the task description. (Show Details)Jun 28 2018, 2:58 PM

I have sent a summary to Cloudflare support via https://support.cloudflare.com/ . We might very well be rate limited by their CDN protection system :-(

See T179229 for a potential way to mitigate/resolve.

Other persons are reporting similar issues in https://github.com/npm/npm/issues/21101

There are also likely duplicates such as 21102 and 21103.

greg added a subscriber: greg.Jun 28 2018, 5:17 PM

Other persons are reporting similar issues in https://github.com/npm/npm/issues/21101

There are also likely duplicates such as 21102 and 21103.

And all of those have been closed due to "not the right issue tracker" reasons :/

They did link to https://status.npmjs.org/incidents/51c7q80zsj9f which is hopefully the same root issue.

Given that upstream marked them as invalid, I opened https://github.com/npm/npm-registry-client/issues/174 as a bit more pointed "do this" request.

@greg wrote:

They did link to https://status.npmjs.org/incidents/51c7q80zsj9f which is hopefully the same root issue.

https://github.com/npm/npm/issues/21101 was definitely on the wrong repo, I guess it eventually got forward to npmjs maintainer \o/ That status message is encouraging.

@Jdforrester-WMF wrote:

I opened https://github.com/npm/npm-registry-client/issues/174 as a bit more pointed "do this" request.

I will look at adding a few more details from my debugging session :] Thank you!

Intermittent 500 errors when installing packages

Investigating
We are currently investigating the issue
Posted about 2 hours ago. Jun 28, 2018 - 16:48 UTC

Monitoring
We have implemented a fix and are currently monitoring the situation
Posted about 1 hour ago. Jun 28, 2018 - 16:59 UTC

\o/

Krinkle renamed this task from CI jobs takes too long / instances overloaded to Quibble CI jobs time out after 30min due to instance stalling at "npm install parse" step.Jun 28 2018, 6:30 PM

That still happens has shown in the info log of npm install :\

Yeah, the time-outs are still happening.

@hashar Have we purged the castor caches for mediawiki-*/npm at least once since this issue started? I think that may be worth doing as well. Several of the reports I found relating to this error were relating to cache corruption, which can survive even after the upstream server-side is fixed.

cscott added a subscriber: cscott.Jun 28 2018, 8:46 PM

parallel-lint is timing out and failing as well: https://gerrit.wikimedia.org/r/442209

parallel-lint is timing out and failing as well: https://gerrit.wikimedia.org/r/442209

That is a patch for Wikibase which has a lot of php files. parallel-lint with HHVM is slow hence the 350 seconds timeout being reached. Can you fill it as another task please? It is probably missing a COMPOSER_PROCESS_TIMEOUT=600 and ideally we should only lint files that got changed by the patch (which we do for mediawiki/core).

Yeah, the time-outs are still happening.

@hashar Have we purged the castor caches for mediawiki-*/npm at least once since this issue started? I think that may be worth doing as well. Several of the reports I found relating to this error were relating to cache corruption, which can survive even after the upstream server-side is fixed.

I checked a few cache to make sure the .json files were indeed valid json. At least I haven't found any containing the CloudFlare HTML message. It is probably worth trying by nuking the cache for a couple jobs and see whether that fix it.

The caches are on castor02.integration.eqiad.wmflabs under /srv/jenkins-workspace/caches . The top of a job console has the 'castor namespace' shown which is the sub directory. Example: castor-mw-ext-and-skins/master/wmf-quibble-vendor-mysql-hhvm-docker

Mentioned in SAL (#wikimedia-releng) [2018-06-28T21:07:26Z] <hasharAway> castor: nuking caches mediawiki-core/master/mediawiki-quibble-vendor-mysql-php70-docker and mediawiki-core/master/mediawiki-quibble-vendor-mysql-hhvm-docker | T198348

Mentioned in SAL (#wikimedia-releng) [2018-06-28T21:09:23Z] <hasharAway> castor: nuking caches castor-mw-ext-and-skins/master/wmf-quibble-vendor-mysql-hhvm-docker/npm and castor-mw-ext-and-skins/master/wmf-quibble-vendor-mysql-php70-docker/npm | T198348

Mentioned in SAL (#wikimedia-releng) [2018-06-28T21:17:46Z] <hasharAway> castor02: nuking cache of npm/node jobs via rm -fR /srv/jenkins-workspace/caches/*/*/*node* (note: other jobs might still have a npm cache) | T198348

Change 442988 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Bump Quibble jobs timeout from 30 to 45 minutes

https://gerrit.wikimedia.org/r/442988

Change 442988 merged by jenkins-bot:
[integration/config@master] Bump Quibble jobs timeout from 30 to 45 minutes

https://gerrit.wikimedia.org/r/442988

Change 442989 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Revert "Bump Quibble jobs timeout from 30 to 45 minutes"

https://gerrit.wikimedia.org/r/442989

Plan B will be to setup some npm registry cache on our infrastructure.

Tentatively the incident is resolved on npmjs side:

https://status.npmjs.org/incidents/51c7q80zsj9f
Resolved
This incident has been resolved.
Jun 28, 2018 - 21:56 UTC

Bumping the timeout to 45 minutes helped get builds to pass, thus avoiding to have the jobs to be rebuild.

I still some slow build but they don't show the npm ERR! registry error parsing json. I suspect it due to the Jenkins slave running on an overloaded labvirt somehow. That was my first hypothesis on that task, I will dig into that.

From my investigation it seems the npm issue is indeed resolved. There are still builds taking an unusual amount of time but it seems to be CPU related. There are a few builds with a 45 minutes timeout but it is due to a work in progress change in Gerrit which causes PHPUnit to timeout.

I will monitor it over the week-end.

matmarex removed a subscriber: matmarex.Jun 29 2018, 3:27 PM

parallel-lint is timing out and failing as well: https://gerrit.wikimedia.org/r/442209

That is a patch for Wikibase which has a lot of php files. parallel-lint with HHVM is slow hence the 350 seconds timeout being reached. Can you fill it as another task please? It is probably missing a COMPOSER_PROCESS_TIMEOUT=600 and ideally we should only lint files that got changed by the patch (which we do for mediawiki/core).

Filed as T198493. Did we discount the "general VM slowdown" theory for the timeouts? I'm starting to look into T198421 today, which seems to be another timeout issue that started ~yesterday. Wondering if there could be a relationship between these bugs...

Change 443380 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Revert "Force npm loglevel to 'info'"

https://gerrit.wikimedia.org/r/443380

Change 443380 merged by jenkins-bot:
[integration/config@master] Revert "Force npm loglevel to 'info'"

https://gerrit.wikimedia.org/r/443380

hashar closed this task as Resolved.Jul 2 2018, 3:25 PM
hashar claimed this task.

As far as I can tell, the issue has been resolved by CloudFlare.

Mentioned in SAL (#wikimedia-releng) [2018-07-19T08:52:09Z] <hashar> Switching back timeout of Quibble jobs from 30 to 45 minutes | https://gerrit.wikimedia.org/r/442989 | T198348

Change 442989 merged by jenkins-bot:
[integration/config@master] Revert "Bump Quibble jobs timeout from 30 to 45 minutes"

https://gerrit.wikimedia.org/r/442989

hashar reopened this task as Open.Jul 19 2018, 3:51 PM

Obviously, npmjs has issues again as of 20 minutes ago.....

Change 446863 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Bump Quibble jobs timeout from 30 to 45 minutes [2]

https://gerrit.wikimedia.org/r/446863

Mentioned in SAL (#wikimedia-operations) [2018-07-19T15:53:01Z] <hashar> CI: npmjs is slow. Bumping Quibble jobs timeout from 30 to 45 minutes | https://gerrit.wikimedia.org/r/#/c/446863 | T198348

Mentioned in SAL (#wikimedia-releng) [2018-07-19T15:53:10Z] <hashar> CI: npmjs is slow. Bumping Quibble jobs timeout from 30 to 45 minutes | https://gerrit.wikimedia.org/r/#/c/446863 | T198348

Change 446863 merged by jenkins-bot:
[integration/config@master] Bump Quibble jobs timeout from 30 to 45 minutes [2]

https://gerrit.wikimedia.org/r/446863

Change 446972 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Force npm loglevel to 'info' [2]

https://gerrit.wikimedia.org/r/446972

Change 446972 merged by jenkins-bot:
[integration/config@master] Force npm loglevel to 'info' [2]

https://gerrit.wikimedia.org/r/446972

Mentioned in SAL (#wikimedia-releng) [2018-07-19T21:47:02Z] <hashar> CI: Force npm loglevel to 'info' via a Zuul parameter | T198348

npm --info is forced again. That will give us traces whenever there is a "registry error" occurring.

Change 448369 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Revert "Force npm loglevel to 'info' [2]"

https://gerrit.wikimedia.org/r/448369

hashar closed this task as Resolved.Jul 27 2018, 6:50 AM

Change 448369 merged by jenkins-bot:
[integration/config@master] Revert "Force npm loglevel to 'info' [2]"

https://gerrit.wikimedia.org/r/448369