Page MenuHomePhabricator

Scap deploy failed to sync git-fat artifacts
Closed, ResolvedPublic

Description

A recent (the first) deploy of https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/cassandra-twcs to the restbase clusters (prod. and staging), resulted in two failures to properly sync git-fat artifacts.

See: https://phabricator.wikimedia.org/T133395#2705719

Details

Revisions and Commits

Event Timeline

Was this the deploy today?

I looked at the log for /srv/deployment/cassandra/twcs (using scap deploy-log -v) and didn't see praseodymium.eqiad.wmnet or xenon.eqiad.wmnet listed there. The latest log looks like it was to the default environment rather than -e staging.

Do you happen to have the output of the deploy handy?

Was this the deploy today?

I looked at the log for /srv/deployment/cassandra/twcs (using scap deploy-log -v) and didn't see praseodymium.eqiad.wmnet or xenon.eqiad.wmnet listed there. The latest log looks like it was to the default environment rather than -e staging.

Do you happen to have the output of the deploy handy?

My understanding is that the deploy happened as the result of merging https://gerrit.wikimedia.org/r/313892; TBK, no one manually invoked a deploy (ala scap deploy) here; This was the first ever deploy from this repo.

thcipriani triaged this task as Medium priority.Oct 11 2016, 4:49 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.

My understanding is that the deploy happened as the result of merging https://gerrit.wikimedia.org/r/313892; TBK, no one manually invoked a deploy (ala scap deploy) here; This was the first ever deploy from this repo.

Ah, I see. Hrm, digging in logstash it shows that the git fat pull was attempted for praseodymium

Git fat pull '/srv/deployment/cassandra/twcs-cache/revs/0b0c8385b5ea05aa57d77ce42e239a86cf16b2e6'

https://logstash.wikimedia.org/goto/43a2728b1c37b4bd2a24aa0c36d1c7bc

Will do some more digging.

thcipriani raised the priority of this task from Medium to High.Dec 16 2016, 10:55 PM

Raising the priority since this is still happening. Again, all the logs say is that the pull was initiated:

Git fat pull '/srv/deployment/cassandra/twcs-cache/revs/3720e7af94976a802c476213fbad66752f1f90e8'

but nothing else which must mean that git fat pull is exiting 0 even when it fails to actually pull anything which is always indicative of good software :(

I managed to make the scap-vagrant box reproduce this (although I don't have steps to reproduce, just tried deploys until a large binary failed to hydrate).

The files are making it into the repository under .git/fat/objects they are just failing to materialize in the working directory.

What git-fat is doing internally is searching for any file in the git index with the "magic size" of 74 bytes, checking to see if that file starts with the "cookie" #$# git-fat and then attempts to run a git checkout-index --index --force [filename] to trigger the git smudge filter behavior.

The smudge behavior in the .git/config is setup to run git fat filter-smudge which should find the large binary file under .git/fat/objects and put it into place in the directory tree. The problem is the the smudge filter isn't triggered by git checkout-index --index --force [filename] because, I think, git thinks the smudge filter has already run (it has, for checkout) and won't re-run the smudge filter without modifications to the working tree to make it differ from the index.

git-fat attempts to work around this by, effectively, touching the file: https://github.com/jedbrown/git-fat/blob/master/git-fat#L409-L418; however, since all this happens very quickly, this touch doesn't seem to be enough to invalidate the cache.

Here you can see the problem in action: I stat a file, run a git checkout-index to trigger a smudge, but the file stat remains the same. After touching the file, a git checkout-index triggers the smudge.

vagrant@scap-target$ stat big1.bin
  File: ‘big1.bin’
  Size: 74              Blocks: 8          IO Block: 4096   regular file
Device: 36h/54d Inode: 28998       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/ vagrant)   Gid: ( 1000/ vagrant)
Access: 2016-12-19 03:15:28.940999952 +0000
Modify: 2016-12-19 03:15:28.912999952 +0000
Change: 2016-12-19 03:15:28.912999952 +0000
 Birth: -
vagrant@scap-target$ git checkout-index --index --force big1.bin                   
vagrant@scap-target$ stat big1.bin
  File: ‘big1.bin’
  Size: 74              Blocks: 8          IO Block: 4096   regular file
Device: 36h/54d Inode: 28998       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/ vagrant)   Gid: ( 1000/ vagrant)
Access: 2016-12-19 03:15:28.940999952 +0000
Modify: 2016-12-19 03:15:28.912999952 +0000
Change: 2016-12-19 03:15:28.912999952 +0000
 Birth: -
vagrant@scap-target$ touch big1.bin
vagrant@scap-target$ git checkout-index --index --force big1.bin                   
vagrant@scap-target$ stat big1.bin                                                 
  File: ‘big1.bin’
  Size: 10485760        Blocks: 20480      IO Block: 4096   regular file
Device: 36h/54d Inode: 29104       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/ vagrant)   Gid: ( 1000/ vagrant)
Access: 2016-12-19 04:37:19.476999952 +0000
Modify: 2016-12-19 04:37:19.476999952 +0000
Change: 2016-12-19 04:37:19.476999952 +0000
 Birth: -

I'm not sure what the "right" fix is for this inside scap is. Certainly an ugly bug.

[ ... ]

I'm not sure what the "right" fix is for this inside scap is. Certainly an ugly bug.

Nice analysis!

To add to last weekend's event about twcs repo, at the time I forced git fat pull on restbase1* nodes after finding some were not updated:

root@neodymium:~# salt 'restbase1*' --out=txt cmd.run 'ls -la /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar'
restbase1016.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 13 19:16 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1018.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1009.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1011.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 18 07:02 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1013.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 18 07:02 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1017.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 15 16:08 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1007.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1012.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1008.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1015.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1014.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
restbase1010.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
thcipriani added a revision: Restricted Differential Revision.Dec 20 2016, 7:41 PM

Change 333300 had a related patch set uploaded (by Thcipriani):
Invalidate git index cache before smudging

https://gerrit.wikimedia.org/r/333300

Change 333300 merged by Chad:
Invalidate git index cache before smudging

https://gerrit.wikimedia.org/r/333300

Ottomata claimed this task.
Ottomata subscribed.

Oops, wrong ticket, meant to close T155856.

With the new version of git-fat deployed, this should now be fixed!

Please, reopen if this recurs.