Workspaces for mwgate-php55lint / mwgate-php70lint are getting huge
Open, HighPublic

Description

legoktm@integration-slave-jessie-1003:/srv/jenkins-workspace/workspace$ du -hs *
311M	analytics-refinery-release
4.0K	analytics-refinery-release@tmp
873M	apps-android-wikipedia-publish
913M	apps-android-wikipedia-test
114M	commit-message-validator
227M	composer-package-validate
42M	debian-glue
472K	debian-glue-non-voting
4.0K	fail-archived-repositories
88M	integration-zuul-layoutdiff
16M	integration-zuul-layoutvalidation-gate
876K	labs-tools-ZppixBot-php55lint
788M	mediawiki-core-code-coverage
1.3G	mediawiki-core-jsduck
3.1G	mediawiki-core-php55lint
82M	mediawiki-vendor-composer-security
96M	mwext-CirrusSearch-whitespaces
163M	mwext-VisualEditor-jsduck
696K	mwgate-composer-validate
50M	mwgate-jsduck
5.4G	mwgate-php55lint
24M	mwgate-php56lint
15M	operations-dns-lint
1.6M	operations-dns-tabs
192M	operations-mw-config-php55lint
97M	operations-mw-config-typos
169M	phabricator-jessie-commits
3.7M	phabricator-jessie-debs
170M	phabricator-jessie-diffs
4.8M	php55lint
144M	php56lint
14M	php-compile-hhvm
15M	php-compile-php55
15M	php-compile-php70
88M	selenium-Wikibase-T167432
100K	test-csteipp-sensiolabs-securityadvisorieschecker
502M	wikimedia-fundraising-civicrm

High prio because this is filling up /srv

Legoktm created this task.Nov 7 2017, 5:57 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptNov 7 2017, 5:57 PM
Legoktm updated the task description. (Show Details)Nov 7 2017, 5:58 PM
Paladox added a subscriber: Paladox.Nov 7 2017, 5:59 PM

I guess we can setup a cron that deletes the repos every day?

No, we need to fix the root cause. This wasn't an issue until recently.

hashar added a subscriber: hashar.Nov 7 2017, 9:26 PM
5.4Gmwgate-php55lint

Sounds like that job clones the whole repo and with submodule processing. Which takes a bunch of space for a mediawiki/core patches @ some wmf branch?

Then in the job config:

<scm class="hudson.plugins.git.GitSCM">
  <extensions>
    <hudson.plugins.git.extensions.impl.SubmoduleOption>
      <disableSubmodules>true</disableSubmodules>
      <recursiveSubmodules>false</recursiveSubmodules>

https://integration.wikimedia.org/ci/job/mwgate-php55lint/configure shows that "Disable submodules processing" is checked.

So that is not mediawiki/core


Looking on integration-slave-jessie-1001

$ du -m -d1 /srv/jenkins-workspace/workspace/mwgate-php55lint
1	/srv/jenkins-workspace/workspace/mwgate-php55lint/data.template
1	/srv/jenkins-workspace/workspace/mwgate-php55lint/config.template
3	/srv/jenkins-workspace/workspace/mwgate-php55lint/i18n
3749	/srv/jenkins-workspace/workspace/mwgate-php55lint/.git
1	/srv/jenkins-workspace/workspace/mwgate-php55lint/src
1	/srv/jenkins-workspace/workspace/mwgate-php55lint/includes
1	/srv/jenkins-workspace/workspace/mwgate-php55lint/tests
14	/srv/jenkins-workspace/workspace/mwgate-php55lint/resources
1	/srv/jenkins-workspace/workspace/mwgate-php55lint/doc
1	/srv/jenkins-workspace/workspace/mwgate-php55lint/maintenance
1	/srv/jenkins-workspace/workspace/mwgate-php55lint/languages
3767	/srv/jenkins-workspace/workspace/mwgate-php55lint

.git/objects/pack has a lot of huge packfiles.

I think what happens it that the Jenkins WORKSPACE is not wiped on job start. Different repos are tested and each accumulate their pack files in .git/objects/pack until the disk is filled.

I guess we can switch to wiping the workspace and using a shallow clone. IIRC that was quite slow for mediawiki/core.

Also most probably we can dish out that job and just use composer test / parallel-lint?

I guess we can switch to wiping the workspace and using a shallow clone. IIRC that was quite slow for mediawiki/core.

We have mediawiki-core-php55lint as a separate job, so I think switching mwgate-php55lint over to shallow clone + workspace wipe should be fine.

Also most probably we can dish out that job and just use composer test / parallel-lint?

We still need the job for the "check" pipeline regardless.

+1 +1. So yeah I guess we can switch to wiping the workspace and using a shallow clone. There might be a defaults for that in JJB already :)]

Just had to clean this up on integration-slave-jessie-1003.

Mentioned in SAL (#wikimedia-releng) [2017-12-22T22:38:59Z] <thcipriani> integration-slave-jessie-1004 removed mediawiki-core-jsduck, mwgate-php55lint, mediawikicore-php55lint as /srv mount was full T179963

hashar added a comment.EditedJan 22 2018, 2:15 PM

That is happening once per week or so. The workaround is to clean out the workspaces manually but that is terrible.

We apparently two different templates to generate the PHP lint jobs. Comparing php55lint and mwgate-php55lint:

--- php55lint	2017-12-04 09:26:58.000000000 +0100
+++ mwgate-php55lint	2017-12-04 09:26:58.000000000 +0100
@@ -4,3 +4,4 @@
   <description>&lt;p&gt;Job is managed by &lt;a href=&quot;https://www.mediawiki.org/wiki/CI/JJB&quot;&gt;Jenkins Job Builder&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;This job is triggered by Zuul&lt;/p&gt;
+&lt;p&gt;This job is triggered by Zuul.&lt;/p&gt;
+&lt;p&gt;Git submodules are NOT processed.&lt;/p&gt;
 &lt;!-- Managed by Jenkins Job Builder --&gt;</description>
@@ -15,3 +16,3 @@
       <strategy class="hudson.tasks.LogRotator">
-        <daysToKeep>30</daysToKeep>
+        <daysToKeep>15</daysToKeep>
         <numToKeep>-1</numToKeep>
@@ -113,5 +114,3 @@
     <extensions>
-      <hudson.plugins.git.extensions.impl.CloneOption>
-        <shallow>true</shallow>
-      </hudson.plugins.git.extensions.impl.CloneOption>
+      <hudson.plugins.git.extensions.impl.CleanCheckout/>
       <hudson.plugins.git.extensions.impl.SubmoduleOption>
@@ -124,3 +123,2 @@
       </hudson.plugins.git.extensions.impl.SubmoduleOption>
-      <hudson.plugins.git.extensions.impl.WipeWorkspace/>
     </extensions>

EDIT: mixed it up

php55lint

Wipes the workspace and does a shallow clone. That is the version we would want to use everywhere. I am not sure how git-changed-in-head.sh behave with a shallow clone but hopefully it will be fine.

mwgate-php55lint

Apparently does a full clone and a clean checkout. It does NOT wipe the workspace hence the .git directory keeps growing up as different repositories trigger that job (see my earlier comment T179963#3742837 ).

Change 405722 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Use shallow clone for phplint jobs

https://gerrit.wikimedia.org/r/405722

Change 405722 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Use shallow clone for phplint jobs

https://gerrit.wikimedia.org/r/405722

This issue keeps happening (well, out of space on the jenkins executers at least) but this patch is still open from right before All Hands :). Add it officially to our (too big of a) short-term-ish backlog.

greg added a comment.Apr 26 2018, 4:49 PM

And again...

gjg@integration-slave-jessie-1001:/srv/jenkins-workspace/workspace$ du -sh *
511M	analytics-refinery-release
4.0K	analytics-refinery-release@tmp
422M	analytics-refinery-update-jars
4.0K	analytics-refinery-update-jars@tmp
822M	apps-android-wikipedia-publish
4.6G	apps-android-wikipedia-test
120M	commit-message-validator
236M	composer-package-validate
9.0M	debian-glue
1.8M	debian-glue-non-voting
81M	integration-zuul-layoutdiff
14M	integration-zuul-layoutvalidation-gate
481M	mediawiki-core-code-coverage
2.9G	mediawiki-core-php55lint
410M	mediawiki-core-php70lint
80M	mediawiki-vendor-composer-security
70M	mwext-CirrusSearch-whitespaces
764K	mwgate-composer-validate
1.5G	mwgate-php55lint
24M	mwgate-php56lint
1.5G	mwgate-php70lint
21M	operations-dns-lint
1.6M	operations-dns-tabs
236M	operations-mw-config-php55lint
99M	operations-mw-config-typos
236M	operations-puppet-wmf-style-guide
210M	phabricator-jessie-commits
3.4M	phabricator-jessie-debs
210M	phabricator-jessie-diffs
1.3M	php55lint
52M	php56lint
140M	selenium-Wikibase-chrome
541M	wikimedia-fundraising-civicrm
gjg@integration-slave-jessie-1001:/srv/jenkins-workspace/workspace$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                 10M     0   10M   0% /dev
tmpfs                               792M   81M  711M  11% /run
/dev/vda3                            19G   15G  3.1G  83% /
tmpfs                               2.0G     0  2.0G   0% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                               2.0G     0  2.0G   0% /sys/fs/cgroup
none                                256M  110M  147M  43% /var/lib/mysql
/dev/mapper/vd-second--local--disk   21G   20G     0 100% /srv
none                                256M     0  256M   0% /srv/home/jenkins-deploy/tmpfs
tmpfs                               396M     0  396M   0% /run/user/2947
tmpfs                               396M     0  396M   0% /run/user/11634
tmpfs                               396M     0  396M   0% /run/user/2890

Could this possibly be the cause of fetch/lock failures I'm seeing?

https://integration.wikimedia.org/ci/job/mwgate-php70lint/1140/console

Building remotely on integration-slave-jessie-1001 (DebianGlue contintLabsSlave DebianJessie) in workspace /srv/jenkins-workspace/workspace/mwgate-php70lint
[..] Fetching changes from the remote Git repository
 > git config remote.origin.url git://contint1001.wikimedia.org/mediawiki/extensions/ReplaceText # timeout=10
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from git://contint1001.wikimedia.org/mediawiki/extensions/ReplaceText  [..]
Caused by: hudson.plugins.git.GitException: Command "git config remote.origin.url git://contint1001.wikimedia.org/mediawiki/extensions/ReplaceText" returned status code 4:
stderr: error: failed to write new configuration file .git/config.lock
Krinkle renamed this task from mwgate-php55lint workspaces are getting huge to Workspaces for mwgate-php55lint / mwgate-php70lint are getting huge.May 2 2018, 5:29 PM
greg added a comment.May 2 2018, 5:42 PM
gjg@integration-slave-jessie-1002:/srv/jenkins-workspace/workspace$ du -sh
15G	.
gjg@integration-slave-jessie-1002:/srv/jenkins-workspace/workspace$ du -sh *
422M	analytics-refinery-update-jars
4.0K	analytics-refinery-update-jars@tmp
902M	apps-android-wikipedia-publish
5.0G	apps-android-wikipedia-test
100M	commit-message-validator
236M	composer-package-validate
1.6M	debian-glue-non-voting
4.0K	fail-archived-repositories
82M	integration-zuul-layoutdiff
11M	integration-zuul-layoutvalidation-gate
452M	mediawiki-core-code-coverage-php7
1.9G	mediawiki-core-php55lint
792M	mediawiki-core-php70lint
79M	mediawiki-vendor-composer-security
54M	mwext-CirrusSearch-whitespaces
776K	mwgate-composer-validate
1.4G	mwgate-php55lint
24M	mwgate-php56lint
2.0G	mwgate-php70lint
12M	operations-dns-lint
1.6M	operations-dns-tabs
211M	operations-mw-config-php55lint
100M	operations-mw-config-typos
238M	operations-puppet-wmf-style-guide
210M	phabricator-jessie-commits
4.1M	phabricator-jessie-debs
1.2M	php55lint
157M	php56lint
147M	selenium-Wikibase-chrome
524M	wikimedia-fundraising-civicrm

rm -rf'd again...

Mentioned in SAL (#wikimedia-releng) [2018-07-14T03:27:01Z] <Krinkle> Clearing various workspaces on integration-slave-jessie-1001 to fix operations-mw-config-php55lint Jenkins builds - T179963

Mentioned in SAL (#wikimedia-releng) [2018-08-27T13:47:38Z] <hashar> updating phplint jobs to use shallow clone AND wipe the workspace | https://gerrit.wikimedia.org/r/#/c/integration/config/+/405722/ | T179963

Change 405722 merged by jenkins-bot:
[integration/config@master] Use shallow clone for phplint jobs

https://gerrit.wikimedia.org/r/405722

hashar closed this task as Resolved.Aug 27 2018, 9:30 PM
hashar claimed this task.

Should be good now. I simply forgot about this task and its patch https://gerrit.wikimedia.org/r/405722

Krinkle reopened this task as Open.Aug 31 2018, 3:56 AM

Shallow clones break the git-changed-in-head script, so it thinks that literally every file was changed.

Created a revert of this change as https://gerrit.wikimedia.org/r/#/c/integration/config/+/456515/

hashar removed hashar as the assignee of this task.Sep 24 2018, 4:40 PM
Seb35 added a subscriber: Seb35.Sep 24 2018, 5:29 PM

git-changed-in-head works with almost-shallow clones: git clone --depth 2 https://gerrit.wikimedia.org/...

Yup that works most of the time, but --depth 2 is not sufficient when a chain of patchset is being tested. I guess that is what prompted the revert.

I think I will just phase out those jobs entirely, they come from before we had all MediaWiki extensions and skins normalized to use composer test as an entry point with jakub-onderka/php-parallel-lint.

The git-changed-in-head part is an optimization to make it faster when it was used on huge repositories such as mediawiki/core or Wikibase. Nowadays that is handled directly by Quibble.

The only left use case is to lint php files for untrusted users. And eventually we will drop that use case T192217: Remove the "check" pipeline and Zuul's user-filter.