Page MenuHomePhabricator

reduce copies of mediawiki/core in workspaces
Closed, DeclinedPublic

Description

We had a CI outage caused by full SSD disks on gallium and lanthanum. The reason is apparently that l10n bot triggered testextensions jobs on almost all extensions, that caused mediawiki/core to be cloned by zuul-cloner from Gerrit multiple times and filled the disks quite fast.

MediaWiki core takes roughly 500MB. zuul-cloner does a regular git clone.

When we used the Jenkins git plugin, we had it set to use git clone --reference /some/mirror which made it way faster to realize the clone. zuul-cloner does not do such a thing.

We could have a copy of mediawiki/core on the slaves (and on the same disk as the workspace), then before invoking zuul-cloner on mediawiki/core, do a regular git clone which would creates hardlinks and save disk. zuul-cloner would then fetch whatever is missing from zuul git repositories.

Related Objects

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added subscribers: hashar, Legoktm, Krenair.

The outage is still ongoing essentially. gallium and lanthanum are taking turns running out of space.

The immediate trigger was the localisation commits which triggered testextension jobs for repositories we haven't seen since the introduction of Zuul-cloner.

For production slaves (zend) we could possibly fix this by removing mediawiki-core from the zuul-cloner command, and instead doing a plain git clone using a --reference to the local replication.

For labs slaves (hhvm) we could do a shallow git-clone using a limited depth (e.g. 10). Perhaps we can revive the mw-core-get script and repurpose it to behave this way (instead of the archive approach it currently does).

From a discussion I had with @Legoktm yesterday:

When a job needs mediawiki/core and uses zuul-cloner, we could prepopulate mw from a local mirror on the same disk using something like:

if [ ! -d "src/.git" ]; then
    git clone --no-checkout --shared /mirror/of/mediawiki/core.git src/
fi

That will setup the working copy to point to the mirror:

$ cat src/.git/objects/info/alternates 
/srv/ssd/gerrit/mediawiki/core.git/objects

Which is lightning breeze and takes only 108KBytes :)

Two culprits:

The problem with git-clone --shared, is that whenever commits become unreferenced in the source repository the related objects in the copy become unreferenced (dangling) and might be removed from the source repository entirely (via git gc). If they are removed in source but referenced in the clone, the repository becomes corrupt.

If we do not repack / gc the source repo we should be safe. But if we update it from time to time we will eventually have to repack it from time to time as well. It might be possible to detect the cloned repository is corrupt and thus recreate it in such a case.

Since the clone --shared is quite fast, we could always reclone core to make sure we never end up with a corrupted repo in case the source has unreferenced some objects. Thus doing something like:

rm -fR src/.git
git clone --bare --shared /mirror/of/mediawiki/core.git src/.git
zuul-cloner mediawiki/core ...

Edit: formatting.
Zuul cloner will run the usual git remote update / git checkout / git fetch and populate the working space accordingly.

From https://www.mediawiki.org/wiki/Continuous_integration/meetings/2015-03-30/minutes

We need to find a champion to implement/test/validate/deploy the idea. I am unlikely to conduct it anytime soon though :(

Krinkle triaged this task as Medium priority.Apr 17 2015, 12:18 PM
Krinkle set Security to None.

On review of a zuul-cloner patch on upstream Gerrit. Jeremy Stanley noticed the hard linking is skipped when the mirror repository belong to another user.

That is why I proposed above (T93703#1144542) to use git clone --shared.