A recent (the first) deploy of https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/cassandra-twcs to the restbase clusters (prod. and staging), resulted in two failures to properly sync git-fat artifacts.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Invalidate git index cache before smudging | operations/debs/git-fat | master | +8 -3 |
Revisions and Commits
| Restricted Differential Revision |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Ottomata | T147856 Scap deploy failed to sync git-fat artifacts | |||
| Resolved | Ottomata | T155856 Package + deploy new version of git-fat |
Event Timeline
Was this the deploy today?
I looked at the log for /srv/deployment/cassandra/twcs (using scap deploy-log -v) and didn't see praseodymium.eqiad.wmnet or xenon.eqiad.wmnet listed there. The latest log looks like it was to the default environment rather than -e staging.
Do you happen to have the output of the deploy handy?
My understanding is that the deploy happened as the result of merging https://gerrit.wikimedia.org/r/313892; TBK, no one manually invoked a deploy (ala scap deploy) here; This was the first ever deploy from this repo.
Ah, I see. Hrm, digging in logstash it shows that the git fat pull was attempted for praseodymium
Git fat pull '/srv/deployment/cassandra/twcs-cache/revs/0b0c8385b5ea05aa57d77ce42e239a86cf16b2e6'
https://logstash.wikimedia.org/goto/43a2728b1c37b4bd2a24aa0c36d1c7bc
Will do some more digging.
Raising the priority since this is still happening. Again, all the logs say is that the pull was initiated:
Git fat pull '/srv/deployment/cassandra/twcs-cache/revs/3720e7af94976a802c476213fbad66752f1f90e8'
but nothing else which must mean that git fat pull is exiting 0 even when it fails to actually pull anything which is always indicative of good software :(
I managed to make the scap-vagrant box reproduce this (although I don't have steps to reproduce, just tried deploys until a large binary failed to hydrate).
The files are making it into the repository under .git/fat/objects they are just failing to materialize in the working directory.
What git-fat is doing internally is searching for any file in the git index with the "magic size" of 74 bytes, checking to see if that file starts with the "cookie" #$# git-fat and then attempts to run a git checkout-index --index --force [filename] to trigger the git smudge filter behavior.
The smudge behavior in the .git/config is setup to run git fat filter-smudge which should find the large binary file under .git/fat/objects and put it into place in the directory tree. The problem is the the smudge filter isn't triggered by git checkout-index --index --force [filename] because, I think, git thinks the smudge filter has already run (it has, for checkout) and won't re-run the smudge filter without modifications to the working tree to make it differ from the index.
git-fat attempts to work around this by, effectively, touching the file: https://github.com/jedbrown/git-fat/blob/master/git-fat#L409-L418; however, since all this happens very quickly, this touch doesn't seem to be enough to invalidate the cache.
Here you can see the problem in action: I stat a file, run a git checkout-index to trigger a smudge, but the file stat remains the same. After touching the file, a git checkout-index triggers the smudge.
vagrant@scap-target$ stat big1.bin File: ‘big1.bin’ Size: 74 Blocks: 8 IO Block: 4096 regular file Device: 36h/54d Inode: 28998 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant) Access: 2016-12-19 03:15:28.940999952 +0000 Modify: 2016-12-19 03:15:28.912999952 +0000 Change: 2016-12-19 03:15:28.912999952 +0000 Birth: - vagrant@scap-target$ git checkout-index --index --force big1.bin vagrant@scap-target$ stat big1.bin File: ‘big1.bin’ Size: 74 Blocks: 8 IO Block: 4096 regular file Device: 36h/54d Inode: 28998 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant) Access: 2016-12-19 03:15:28.940999952 +0000 Modify: 2016-12-19 03:15:28.912999952 +0000 Change: 2016-12-19 03:15:28.912999952 +0000 Birth: - vagrant@scap-target$ touch big1.bin vagrant@scap-target$ git checkout-index --index --force big1.bin vagrant@scap-target$ stat big1.bin File: ‘big1.bin’ Size: 10485760 Blocks: 20480 IO Block: 4096 regular file Device: 36h/54d Inode: 29104 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant) Access: 2016-12-19 04:37:19.476999952 +0000 Modify: 2016-12-19 04:37:19.476999952 +0000 Change: 2016-12-19 04:37:19.476999952 +0000 Birth: -
I'm not sure what the "right" fix is for this inside scap is. Certainly an ugly bug.
[ ... ]
I'm not sure what the "right" fix is for this inside scap is. Certainly an ugly bug.
Nice analysis!
To add to last weekend's event about twcs repo, at the time I forced git fat pull on restbase1* nodes after finding some were not updated:
root@neodymium:~# salt 'restbase1*' --out=txt cmd.run 'ls -la /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar' restbase1016.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 13 19:16 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1018.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1009.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1011.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 18 07:02 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1013.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 18 07:02 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1017.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 15 16:08 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1007.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 24507 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1012.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1008.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1015.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1014.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar restbase1010.eqiad.wmnet: -rw-r--r-- 1 deploy-service deploy-service 74 Dec 16 22:39 /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar
Change 333300 had a related patch set uploaded (by Thcipriani):
Invalidate git index cache before smudging
Done! https://apt.wikimedia.org/wikimedia/pool/main/g/git-fat/
Reopen if it doesn't work! :)
With the new version of git-fat deployed, this should now be fixed!
Please, reopen if this recurs.