Page MenuHomePhabricator

issues with artifact cache in an-coord1001
Closed, ResolvedPublic

Description

When deploying from deployment1001 to an-coord1001 we got an error cause target was full

Error looks like:

Unhandled error:
deploy-local failed: <ErrorReturnCode_11> {u'full_cmd': u'/usr/bin/git fat pull', u'stderr': u'\nrsync: write failed on "/srv/deployment/analytics/refinery-cache/revs/4e9894c0db04ee39be0f094d0ef11ebbff198834/.git/fat/objects/ecfba6edeb07a17a0d1b09981aff917e068d146c": No space left on device (28)\nrsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.2]\nrsync error: error in file IO (code 11) at io.c(1642) [sender=3.1.2]\n', u'stdout': u'receiving file list ... \n 100 files...\r182 files to consider\n003d9b4

Looks like some artifacts might not be getting deleted when thy should as our config is set to keep 2:

nuria@deploy1001:/srv/deployment/analytics/refinery/scap$ more scap.cfg
[global]
git_repo: analytics/refinery
git_deploy_dir: /srv/deployment
git_repo_user: analytics-deploy
ssh_user: analytics-deploy
server_groups: canary, default
canary_dsh_targets: target-canary
dsh_targets: targets
git_submodules: False
git_fat: True
cache_revs: 2

Does this scap config take effect in the source as well as the target of the deploy?

Event Timeline

This also often affects other hosts with relatively small /srv partitions, like notebook* hosts.

Can the release engineering chime as to whether scap config settings should also delete artifacts from the target of the deploy?

Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Mentioned in SAL (#wikimedia-analytics) [2019-07-11T20:31:54Z] <ottomata> resized /srv on an-coord1001 from 60G to 115G - T227132

I believe this happens on an-coord1001 and notebook* hosts because their /srv partitions are relatively small. When the disk fills up during scap deploy, the scap deploy aborts, and does not remove old cached deploys. Upon future successful deploys it does.

We had some room on an-coord1001's volumegroup, so I resized it from 60G to 115G. refinery is 15G, so hopefully this will be fine on an-coord1001 from now on.

On notebooks however, /srv is used for both deployment and user /home dirs. Users are often filling up /srv with their homes on notebook hosts. I'm not sure what we can do about this problem there other than implementing some usage quotas...or expanding disk space hardware...or doing a distributed Hadoop based notebook project with the goal of dropping support for individual notebook hosts.

Nuria moved this task from Ready to Deploy to In Progress on the Analytics-Kanban board.

Let's figure out how to deploy only what we need to notebook hosts

Change 522542 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/scap@master] Deploy to notebook hosts as a separate environment to avoid deploying artifacts

https://gerrit.wikimedia.org/r/522542

Change 522542 merged by Ottomata:
[analytics/refinery/scap@master] Deploy to notebook hosts as a separate environment to avoid deploying artifacts

https://gerrit.wikimedia.org/r/522542

Change 523171 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/scap@master] Set deploy targets for notebook environment

https://gerrit.wikimedia.org/r/523171

Change 523171 merged by Ottomata:
[analytics/refinery/scap@master] Set deploy targets for notebook environment

https://gerrit.wikimedia.org/r/523171

I've manually removed a bunch of old refinery artifact jar versions from the refinery deploy on notebook hosts to free up space.

Can the release engineering chime as to whether scap config settings should also delete artifacts from the target of the deploy?

I', not 100% sure what you are asking for here but I'll take my best stab at it:

When cleaning up any old revisions from the deployment cache, scap completely removes the directory, so that should remove artifacts as well.

Does that answer your question or did I misunderstand?

Does this scap config take effect in the source as well as the target of the deploy?

The cached revs are only used on the target, not on the source.

We're having a problem where the cached revs are not removed in the case of a scap failure. Occasionally we run out of disk space on a target, scap deploy fails, we free up space, deploy again, and then there are more cached revs than there should be.