issues with artifact cache in an-coord1001
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Nuria
	Jul 2 2019, 6:54 PM

Description

When deploying from deployment1001 to an-coord1001 we got an error cause target was full

Error looks like:

Unhandled error:
deploy-local failed: <ErrorReturnCode_11> {u'full_cmd': u'/usr/bin/git fat pull', u'stderr': u'\nrsync: write failed on "/srv/deployment/analytics/refinery-cache/revs/4e9894c0db04ee39be0f094d0ef11ebbff198834/.git/fat/objects/ecfba6edeb07a17a0d1b09981aff917e068d146c": No space left on device (28)\nrsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.2]\nrsync error: error in file IO (code 11) at io.c(1642) [sender=3.1.2]\n', u'stdout': u'receiving file list ... \n 100 files...\r182 files to consider\n003d9b4

Looks like some artifacts might not be getting deleted when thy should as our config is set to keep 2:

nuria@deploy1001:/srv/deployment/analytics/refinery/scap$ more scap.cfg
[global]
git_repo: analytics/refinery
git_deploy_dir: /srv/deployment
git_repo_user: analytics-deploy
ssh_user: analytics-deploy
server_groups: canary, default
canary_dsh_targets: target-canary
dsh_targets: targets
git_submodules: False
git_fat: True
cache_revs: 2

Does this scap config take effect in the source as well as the target of the deploy?

Details

	Subject	Repo	Branch	Lines +/-
	Set deploy targets for notebook environment	analytics/refinery/scap	master	+2 -0
	Deploy to notebook hosts as a separate environment to avoid deploying artifacts	analytics/refinery/scap	master	+14 -2

Customize query in gerrit

Related Objects

Mentioned Here: T228347: deployments to analytics1030 failing
rANRE4e9894c0db04: Remove hiwikisource because it does not exist

Event Timeline

• Nuria created this task.Jul 2 2019, 6:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 2 2019, 6:54 PM

• Nuria added a project: Release-Engineering-Team.Jul 2 2019, 6:55 PM

This also often affects other hosts with relatively small /srv partitions, like notebook* hosts.

Can the release engineering chime as to whether scap config settings should also delete artifacts from the target of the deploy?

ping @greg

Milimetric assigned this task to Ottomata.Jul 8 2019, 3:37 PM

Milimetric triaged this task as High priority.

Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Milimetric added a project: Analytics-Kanban.

Mentioned in SAL (#wikimedia-analytics) [2019-07-11T20:31:54Z] <ottomata> resized /srv on an-coord1001 from 60G to 115G - T227132

I believe this happens on an-coord1001 and notebook* hosts because their /srv partitions are relatively small. When the disk fills up during scap deploy, the scap deploy aborts, and does not remove old cached deploys. Upon future successful deploys it does.

We had some room on an-coord1001's volumegroup, so I resized it from 60G to 115G. refinery is 15G, so hopefully this will be fine on an-coord1001 from now on.

On notebooks however, /srv is used for both deployment and user /home dirs. Users are often filling up /srv with their homes on notebook hosts. I'm not sure what we can do about this problem there other than implementing some usage quotas...or expanding disk space hardware...or doing a distributed Hadoop based notebook project with the goal of dropping support for individual notebook hosts.

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.Jul 11 2019, 8:36 PM

Let's figure out how to deploy only what we need to notebook hosts

Change 522542 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/scap@master] Deploy to notebook hosts as a separate environment to avoid deploying artifacts

https://gerrit.wikimedia.org/r/522542

gerritbot added a project: Patch-For-Review.Jul 12 2019, 7:30 PM

Change 522542 merged by Ottomata:
[analytics/refinery/scap@master] Deploy to notebook hosts as a separate environment to avoid deploying artifacts

https://gerrit.wikimedia.org/r/522542

Docs updated: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery#Deploying_to_notebook*_hosts

Change 523171 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/scap@master] Set deploy targets for notebook environment

https://gerrit.wikimedia.org/r/523171

Change 523171 merged by Ottomata:
[analytics/refinery/scap@master] Set deploy targets for notebook environment

https://gerrit.wikimedia.org/r/523171

Maintenance_bot removed a project: Patch-For-Review.Jul 15 2019, 2:10 PM

I've manually removed a bunch of old refinery artifact jar versions from the refinery deploy on notebook hosts to free up space.

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Jul 15 2019, 2:24 PM

In T227132#5301532, @Nuria wrote:

Can the release engineering chime as to whether scap config settings should also delete artifacts from the target of the deploy?

I', not 100% sure what you are asking for here but I'll take my best stab at it:

When cleaning up any old revisions from the deployment cache, scap completely removes the directory, so that should remove artifacts as well.

Does that answer your question or did I misunderstand?

Does this scap config take effect in the source as well as the target of the deploy?

The cached revs are only used on the target, not on the source.

greg edited projects, added Release-Engineering-Team (Deployment services); removed Release-Engineering-Team.Jul 18 2019, 4:34 PM

We're having a problem where the cached revs are not removed in the case of a scap failure. Occasionally we run out of disk space on a target, scap deploy fails, we free up space, deploy again, and then there are more cached revs than there should be.

@mmodell: this an example of such a problem: https://phabricator.wikimedia.org/T228347

• Nuria moved this task from Done to In Progress on the Analytics-Kanban board.Jul 18 2019, 8:40 PM

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Jul 24 2019, 5:39 PM

• Nuria closed this task as Resolved.Feb 27 2020, 7:29 PM

issues with artifact cache in an-coord1001Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

issues with artifact cache in an-coord1001
Closed, ResolvedPublic
Actions