Page MenuHomePhabricator

castor rsync's taking 3-5 minutes for mwgate-npm jobs
Closed, ResolvedPublic

Description

e.g. https://integration.wikimedia.org/ci/job/mwgate-npm-node-6-docker/23778/console

07:18:02 + exec docker run --rm --env-file /dev/fd/63 --volume /srv/jenkins-workspace/workspace/mwgate-npm-node-6-docker@2/cache:/cache docker-registry.wikimedia.org/releng/castor:0.1.3 load
07:18:02 ++ /usr/bin/env
07:18:02 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL|HHVM_REPO_CENTRAL_PATH)='
07:18:03 Defined: CASTOR_NAMESPACE="castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker"
07:18:03 Syncing...
07:18:03 rsync: failed to set times on "/cache/.": Operation not permitted (1)

07:23:24 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [generator=3.1.2]
07:23:24 
07:23:24 Done

The whole npm install job would probably install and execute faster than that with no cache :(

Event Timeline

On castor02.integration.eqiad.wmflabs:

$ sudo du -s -m /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker
249	/srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker

So that specific cache is 249 MBytes. Note it is shared by all mediawiki extensions and skins.

Looking at the build trend for the job at https://integration.wikimedia.org/ci/job/mwgate-npm-node-6-docker/buildTimeTrend looks like the transfer is usually quite fast with unusual spikes of minutes.

That might be IO or network starvation. Potentially because several jobs are fetching at the same time.

Looking at castor02 rsync daemon logs, there are 21 connections other a few seconds:

$ sudo journalctl -u rsync|grep 'Feb 27 07:1.*rsync on'
Feb 27 07:17:58 castor02 rsyncd[4603]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-php70-jessie-non-voting/ from ci-jessie-wikimedia-978588.contintcloud.eqiad.wmflabs (10.68.19.141)
Feb 27 07:17:58 castor02 rsyncd[4604]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1002.integration.eqiad.wmflabs (10.68.19.10)
Feb 27 07:18:02 castor02 rsyncd[4608]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-hhvm-jessie/ from ci-jessie-wikimedia-978585.contintcloud.eqiad.wmflabs (10.68.16.164)
Feb 27 07:18:02 castor02 rsyncd[4609]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1005.integration.eqiad.wmflabs (10.68.21.12)
Feb 27 07:18:02 castor02 rsyncd[4610]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1006.integration.eqiad.wmflabs (10.68.17.244)
Feb 27 07:18:02 castor02 rsyncd[4611]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1003.integration.eqiad.wmflabs (10.68.16.47)
Feb 27 07:18:02 castor02 rsyncd[4612]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1004.integration.eqiad.wmflabs (10.68.16.233)
Feb 27 07:18:02 castor02 rsyncd[4613]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-hhvm-jessie/ from ci-jessie-wikimedia-978580.contintcloud.eqiad.wmflabs (10.68.21.246)
Feb 27 07:18:02 castor02 rsyncd[4616]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-hhvm-jessie/ from ci-jessie-wikimedia-978589.contintcloud.eqiad.wmflabs (10.68.20.210)
Feb 27 07:18:02 castor02 rsyncd[4614]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-hhvm-jessie/ from ci-jessie-wikimedia-978587.contintcloud.eqiad.wmflabs (10.68.21.14)
Feb 27 07:18:02 castor02 rsyncd[4615]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1007.integration.eqiad.wmflabs (10.68.19.105)
Feb 27 07:18:02 castor02 rsyncd[4617]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-php70-jessie-non-voting/ from ci-jessie-wikimedia-978581.contintcloud.eqiad.wmflabs (10.68.22.108)
Feb 27 07:18:02 castor02 rsyncd[4618]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-php70-jessie-non-voting/ from ci-jessie-wikimedia-978579.contintcloud.eqiad.wmflabs (10.68.18.212)
Feb 27 07:18:02 castor02 rsyncd[4619]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-php70-jessie-non-voting/ from ci-jessie-wikimedia-978584.contintcloud.eqiad.wmflabs (10.68.17.170)
Feb 27 07:18:02 castor02 rsyncd[4620]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-hhvm-jessie/ from ci-jessie-wikimedia-978586.contintcloud.eqiad.wmflabs (10.68.23.188)
Feb 27 07:18:02 castor02 rsyncd[4621]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-php70-jessie-non-voting/ from ci-jessie-wikimedia-978583.contintcloud.eqiad.wmflabs (10.68.22.128)
Feb 27 07:18:02 castor02 rsyncd[4622]: rsync on caches/castor-mw-ext-and-skins/master/mwext-testextension-hhvm-jessie/ from ci-jessie-wikimedia-978578.contintcloud.eqiad.wmflabs (10.68.17.68)
Feb 27 07:18:03 castor02 rsyncd[4624]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1001.integration.eqiad.wmflabs (10.68.16.101)
Feb 27 07:18:03 castor02 rsyncd[4623]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1002.integration.eqiad.wmflabs (10.68.19.10)
Feb 27 07:18:03 castor02 rsyncd[4625]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1003.integration.eqiad.wmflabs (10.68.16.47)
Feb 27 07:18:03 castor02 rsyncd[4626]: rsync on caches/castor-mw-ext-and-skins/master/mwgate-npm-node-6-docker/ from integration-slave-docker-1001.integration.eqiad.wmflabs (10.68.16.101)

The network graph in bytes per seconds:

castor02_network_bytes.png (308×586 px, 30 KB)

Sounds to me the caching system should probably be more distributed. Maybe using a distributed file system that all slaves would participate in, or brew our own based on rsync from a local cache and then from the central one.

Could we set up a second castor instance, and split the traffic? (either randomly or some consistent hash based on job and repo names).

I did some tests with a 368MB cache directory (wmf-quibble-core-vendor-mysql-php72-docker). One observation is that it is much slower with rsync --compress: 17 seconds instead of 12, limited by CPU on the castor server instead of by the network.

I think it would be beneficial to tar up the whole cache directory and to have castor serve the tarballs via HTTP instead of rsyncd. The timings would be approximately as follows:

  • Download tarball: 6.8 seconds for 162MB
  • tar -xzf: 2.5 seconds
  • Post-build tar -czf: 7.8 seconds

These figures are on idle servers, but it would scale better since the network traffic is minimised without causing significant CPU usage on the castor server.

If you add a varnishd on every integration slave and proxy via localhost, then some network traffic could be avoided. That's my idea of a distributed filesystem.

Change 526603 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[integration/config@master] castor: Remove --compress from rsync

https://gerrit.wikimedia.org/r/526603

Change 526603 merged by jenkins-bot:
[integration/config@master] castor: Remove --compress from rsync

https://gerrit.wikimedia.org/r/526603

With releng/castor:0.2.3 which no more use --compress, a build of quibble-vendor-mysql-php72-docker took 7 minutes to fetch the cache:

00:00:00.763 Defined: CASTOR_NAMESPACE="castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-docker"
00:00:00.764 Syncing...
...
00:07:18.949 Done

Which is ~ 400 MBytes and 8700 files.

So even though --compress save up on the CPU side, it is still network bound. And even more now that more data has to be transferred.

The specific example above happened at 9:33:49 UTC. Around that time, he instance shows a lot of iowait and io util percentage above 80%. The network output cap at roughly 20MBps when it wsa more like 4-5 MBps previously.

When the system is mostly idle, that specific cache is fetched as fast as in 11 seconds. When busy, I looked at the transfer rate using bwm-ng and it shows a network cap at 21MBytes/s while the 8 vCPUs are mostly idle.

So sounds to me we should restore --compress to save up on the amount of data being transferred.

And last thing, the instance hosting the cache is integration-castor03.integration.eqiad.wmflabs (172.16.5.161). It is running on the server cloudvirt1002.

The Grafana view shows the same network cap at roughly 20 MBps.

I talked with @Legoktm about the possibility of removing the npm cache from this rsync'd cache directory, and instead using a central network service for individual packages, for example using local-npm.

...
If you add a varnishd on every integration slave and proxy via localhost, then some network traffic could be avoided. That's my idea of a distributed filesystem.

I have evaluated that using Squid as a man in the middle proxy (to cache requests made over HTTPS instead of just a TCP pass through) T116015. I had an issue with rubygems but really I can't remember why I have dismissed it. Surely that is worth revisiting and potentially we could get support from the traffic team.

A promising one was Sonatype Nexus which comes with support for the majority of package managers. It had a few interesting features such as reporting security issues or usage of broken/outdated packages. Not sure whether it is distributed though.

Anyway we have T147635: Investigate again a central cache for package managers but it is kind of stall for now. Then the lame "rsync to a single instance" has shown its limit so it is certainly time to build something better, and as you have mentioned more distributed.

I have filled T232644 about the network bandwidth limitation. Apparently WMCS rate limits instances as a safe guard measure (which is a good thing).

I have found that integration-castor03 is on a server that has slow CPU (cloudvirt1002 see T223971 / T225713 ). The instance would benefit from being migrated to another cloudvirt machine with faster CPU. I filled T232646 to migrate it to a more powerful server. Then I guess we can roll back to use --compress.

Mentioned in SAL (#wikimedia-releng) [2019-09-17T14:31:24Z] <hashar> Raise egress filter trafic shaping on integration-castor03 from 240Mbit to 400Mbit ( T232644 / T188375 )

The network is barely capped anymore, it got bumped to 800Mbits for egress traffic.

The instance has been moved to a faster host (cloudvirt1021). So potentially we could reenable --compress. The default is 6 from a bracket of 0 - 9, so maybe we can lower the compression level to save on CPU and that would still save on the network bandwith. Can be done by replacing --compress with --compress-level=0.

Current builds seem to take ~20s. Is this fast enough?

Krinkle assigned this task to hashar.
Krinkle added a subscriber: Krinkle.

Seems so. Definitely better than before. No longer multi-minute halts on the core jobs, and it also isn't the bottleneck in either the "test" or "gate" pipeline, so making it faster right now doesn't result in results shown quicker in Gerrit.

There is still future work to be considered around pruning the cache and/or reducing the cache to only be for composer etc and not npm (and serving npm via an http proxy instead). I believe some of that might be covered by other tasks, but if not they could be created for it.

Change 734412 had a related patch set uploaded (by Hashar; author: Jforrester):

[integration/config@master] jjb: Update caster users from 0.2.1 (2018) to 0.2.4 (2019)

https://gerrit.wikimedia.org/r/734412

Mentioned in SAL (#wikimedia-releng) [2021-10-26T10:17:03Z] <hashar> Updating all Jenkins job to remove --compress option from rsync when saving cache | https://gerrit.wikimedia.org/r/c/integration/config/+/734412 | T188375

Change 734412 merged by jenkins-bot:

[integration/config@master] jjb: Update castor users from 0.2.1 (2018) to 0.2.4 (2019)

https://gerrit.wikimedia.org/r/734412