Page MenuHomePhabricator

Speedup mwext-codehealth-master-non-voting Castor job
Open, HighPublic

Description

In T427450 we could see that the mwext-codehealth-master-non-voting job is the slowest one syncing data using Castor. It looks like the cache is dirty all the time and we sync it.

Could we just disable saving cache for this job? Delete everything and let it drift over time, always downloading new stuff.
Or investigate if something is saved to the cache that is changed every time ans shouldn't be saved?

If we can delete this bottleneck, it will help all other jobs to get a shorter queue time to save their cache.

There was a build this morning that waited 6m35s for the job to start:

00:13:59.135 Waiting for the completion of castor-save-workspace-cache
00:20:35.295 castor-save-workspace-cache #6662819 started.

Looking for saves that took longer than 100 seconds all point to mwext-codehealth-master-non-voting:

$ grep -l -P '<duration>\d\d\d\d\d\d+' /srv/jenkins/builds/castor-save-workspace-cache/*/build.xml |xargs grep -A1 TRIGGERED_JOB_NAME|grep value
/srv/jenkins/builds/castor-save-workspace-cache/6670854/build.xml-          <value>mwext-codehealth-master-non-voting</value>
/srv/jenkins/builds/castor-save-workspace-cache/6671387/build.xml-          <value>mwext-codehealth-master-non-voting</value>
/srv/jenkins/builds/castor-save-workspace-cache/6671868/build.xml-          <value>mwext-codehealth-master-non-voting</value>
/srv/jenkins/builds/castor-save-workspace-cache/6672133/build.xml-          <value>mwext-codehealth-master-non-voting</value>
/srv/jenkins/builds/castor-save-workspace-cache/6672160/build.xml-          <value>mwext-codehealth-master-non-voting</value>
/srv/jenkins/builds/castor-save-workspace-cache/6672824/build.xml-          <value>mwext-codehealth-master-non-voting</value>
/srv/jenkins/builds/castor-save-workspace-cache/6674308/build.xml-          <value>mwext-codehealth-master-non-voting</value>
/srv/jenkins/builds/castor-save-workspace-cache/6675449/build.xml-          <value>mwext-codehealth-master-non-voting</value>
/srv/jenkins/builds/castor-save-workspace-cache/6675904/build.xml-          <value>mwext-codehealth-master-non-voting</value>

On integration-castor06.integration.eqiad1.wikimedia.cloud, that cache holds 166k files and is 17GBytes:

$ sudo find /srv/castor/castor-mw-ext-and-skins/master/mwext-codehealth-master-non-voting -type f|wc -l
166205
$ sudo du -hs /srv/castor/castor-mw-ext-and-skins/master/mwext-codehealth-master-non-voting
17G	/srv/castor/castor-mw-ext-and-skins/master/mwext-codehealth-master-non-voting

Event Timeline

Peter renamed this task from Speedup mwext-codehealth-master-non-voting castor job to Speedup mwext-codehealth-master-non-voting Castor job.Thu, May 28, 6:48 AM
Peter updated the task description. (Show Details)

Change #1294992 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[integration/config@master] jjb: don't save castor cache after mwext-codehealth runs

https://gerrit.wikimedia.org/r/1294992

jjb: don't save castor cache after mwext-codehealth-master-non-voting
https://gerrit.wikimedia.org/r/c/integration/config/+/1294992

One of the problem is the cache will turn to be obsolete and the job will fetch a bunch of artifacts from the remote registries.

I had a look at the cache on integration-castor06.integration.eqiad1.wikimedia.cloud:

(cd /srv/castor/castor-mw-ext-and-skins/master/mwext-codehealth-master-non-voting; sudo du -m -d1)
3320	./npm
183	./node-gyp
2529	./sonar
2	./_logs
140	./composer
1	./node-sass
2	./mesa_shader_cache
3934	./Cypress
5693	./_cacache
15799	.

That is 15 GBytes (and 151 559 files) which is definitely way too large. The way I have been dealing with that is to manually delete the entries in the cache, which happens once in a while but it is surely not idea.

Cypress is one of the offenders:

805	./Cypress/15.14.2
805	./Cypress/15.12.0
19	./Cypress/13.15.2
674	./Cypress/14.5.3
811	./Cypress/15.8.2
805	./Cypress/15.14.1
19	./Cypress/15.7.1

That is I think due to extensions having different versions of Cypress in use, but since they share the same cache all those versions end up stored in the same namespace and they are loaded for any extensions. And indeed looking in mediawiki/extensions using grep -A1 '"node_modules/cypress"' */package-lock.json:

ExtensionCypress version
Cite15.15.0
CommunityConifguration15.15.0
EntitySchema13.17.0
GrowthExperiments15.15.0
GuidedTour15.14.1
Score15.8.2
WikibaseLexeme13.17.0
Wikibase14.5.3
WikibaseQualityConstraints15.14.2

I went ahead and deleted from the cache the ones that are not listed:

(cd /srv/castor/castor-mw-ext-and-skins/master/mwext-codehealth-master-non-voting \
 sudo rm -fR ./Cypress/15.12.0 ./Cypress/13.15.2 ./Cypress/15.14.1 ./Cypress/15.7.1)

Might have to check again later because maybe a build that is ongoing will end up putting them back.

node-gyp compiles modules and its cache is namespaced by node version (well ABI version I guess):

65	./node-gyp/24.14.1
3	./node-gyp/20.16.0
54	./node-gyp/20.20.2
3	./node-gyp/20.19.1
3	./node-gyp/18.20.4
56	./node-gyp/20.19.5
3	./node-gyp/20.18.1

Quibble images now come with NodeJS 24:

docker run --rm -it --entrypoint=node docker-registry.wikimedia.org/releng/quibble-bullseye-php83:1.18.0-s1 --version
v24.14.1

And thus the old caches can be disposed of. I have deleted them using:

(cd /srv/castor/castor-mw-ext-and-skins/master/mwext-codehealth-master-non-voting
 sudo rm -fR sudo rm -fR node-gyp/{18,20}.*)

Sonar caches hold jar files which keep accumulating. My guess is the easiest is to nuke it entirely and let it repopulated. There are few large caches, in MBytes:

430	./analytics-refinery-source/master/analytics-refinery-maven-java8/sonar
1770	./wikimedia-event-utilities/master/wikimedia-event-utilities-maven-java8-site-publish/sonar
1981	./wikimedia-event-utilities/master/wikimedia-event-utilities-maven-java11/sonar
1981	./wikimedia-event-utilities/master/wikimedia-event-utilities-maven-java8/sonar
707	./search-glent/master/search-glent-maven-java8/sonar
707	./search-glent/master/search-glent-maven-java8-site-publish/sonar
707	./search-glent/master/search-glent-maven-java11/sonar
1106	./search-extra/master/search-extra-maven-java8/sonar
327	./search-extra/master/search-extra-maven-java17/sonar
1432	./search-extra/master/search-extra-maven-java11/sonar
1106	./search-extra/master/search-extra-maven-java8-site-publish/sonar
613	./search-extra-analysis/master/search-extra-analysis-maven-java11/sonar
218	./search-extra-analysis/master/search-extra-analysis-maven-java17/sonar
395	./search-extra-analysis/master/search-extra-analysis-maven-java8/sonar
395	./search-extra-analysis/master/search-extra-analysis-maven-java8-site-publish/sonar
110	./integration-gearman-java/master/gearman-java-maven-java11-site-publish/sonar
110	./integration-gearman-java/master/gearman-java-maven-java17/sonar
110	./integration-gearman-java/master/gearman-java-maven-java8/sonar
110	./integration-gearman-java/master/gearman-java-maven-java11/sonar
165	./search-highlighter/master/search-highlighter-maven-java17/sonar
1101	./search-highlighter/master/search-highlighter-maven-java8/sonar
1101	./search-highlighter/master/search-highlighter-maven-java8-site-publish/sonar
1264	./search-highlighter/master/search-highlighter-maven-java11/sonar
2529	./castor-mw-ext-and-skins/master/mwext-codehealth-master-non-voting/sonar
596	./wmf-jvm-utils/master/wmf-jvm-utils-maven-java8/sonar
596	./wmf-jvm-utils/master/wmf-jvm-utils-maven-java8-site-publish/sonar
596	./wmf-jvm-utils/master/wmf-jvm-utils-maven-java11/sonar
7147	./mediawiki-services-parsoid/master/mwext-codehealth-master-non-voting/sonar
198	./wikidata-query-rdf/master/wikidata-query-rdf-maven-java8/sonar

I have deleted all those SonarQube caches

npm has 3G and again the cache keeps accumulating.

So I am tempted to simply wipe them entirely.

Fun time Cypress ships with copies of ffpmeg (68M) and ffprobe (78MB)

There is
117MB ./browser_v8_context_snapshot.bin
201MB ./Cypress

sigh.

Mentioned in SAL (#wikimedia-releng) [2026-05-28T16:48:11Z] <hashar> castor: nuked SonarQube cache: rm -fR /srv/castor/castor-mw-ext-and-skins/master/mwext-codehealth-master-non-voting/sonar/ # T427471

hashar triaged this task as High priority.Thu, May 28, 4:50 PM
hashar awarded a token.

Change #1294992 abandoned by Phedenskog:

[integration/config@master] jjb: don't save castor cache after mwext-codehealth-master-non-voting

Reason:

Let's skip this since it will drift. We need to find another solution for the slow job.

https://gerrit.wikimedia.org/r/1294992

Peter removed Peter as the assignee of this task.Fri, Jun 5, 1:43 PM

My suggestion: Disable this job for a week. Compare the metrics between the two weeks to have better understanding of how much this job makes all other jobs slower. Like it is now, we cannot have a job that makes all other jobs slower.