Page MenuHomePhabricator

articlequality repo mirroring is broken
Closed, ResolvedPublic

Description

We have several repos which each contain large machine learning model files, and these repos are all configured to mirror *from* GitHub to Phabricator, to Gerrit. This was a historical arrangement to allow us to collaborate in GitHub but still deploy on WMF production. The mirroring has been acting badly lately, often failing to push LFS objects through, see T212818, but the other repos have at least been partially usable. Just a few days ago, I discovered that the "articlequality" repo is only mirroring to Phabricator, and at some point the Gerrit repo was blanked. We need help resolving this issue. Please compare a working repo:
https://phabricator.wikimedia.org/source/editquality/
with the broken repo
https://phabricator.wikimedia.org/source/articlequality/

Here you can see that the repo is empty:
https://gerrit.wikimedia.org/g/scoring/ores/articlequality


Config at https://phabricator.wikimedia.org/source/articlequality/manage/uris/ has:

Information: This repository is hosted remotely. Phabricator is observing it.

Namely look at Github wikimedia/articlequality and mirror to Gerrit scoring/ores/articlequality.

The basics https://phabricator.wikimedia.org/source/articlequality/manage/basics/ says:

Update Frequency 1 h, 40 m
Storage Directory OK /srv/repos
Working Copy OK /srv/repos/1914/
Updates OK Last updated Mon, Jan 7, 2:15 PM (1 h, 41 m ago).

See comments below, Gerrit probably just garbage collect the objects, maybe because references are not pushed.

Event Timeline

awight triaged this task as High priority.Jan 4 2019, 6:41 PM

From the Gerrit log for git garbage collection:

gc_log:[2019-01-05 03:48:46,483] [WorkQueue-1] INFO : [scoring/ores/articlequality]

gc config: gc.aggressive=true;
pack config: maxDeltaDepth=50, deltaSearchWindowSize=10, deltaSearchMemoryLimit=0, deltaCacheSize=52428800, deltaCacheLimit=100, compressionLevel=-1, indexVersion=2, bigFileThreshold=52428800, threads=0, reuseDeltas=true, reuseObjects=true, deltaCompress=true, buildBitmaps=true, bitmapContiguousCommitCount=100, bitmapRecentCommitCount=20000, bitmapRecentCommitSpan=100, bitmapDistantCommitSpan=5000, bitmapExcessiveBranchCount=100, bitmapInactiveBranchAge=90, singlePack=false

Configuration:

maxDeltaDepth50
deltaSearchWindowSize10
deltaSearchMemoryLimit0
deltaCacheSize52428800
deltaCacheLimit100
compressionLevel-1
indexVersion2
bigFileThreshold52428800
threads0
reuseDeltastrue
reuseObjectstrue
deltaCompresstrue
buildBitmapstrue
bitmapContiguousCommitCount100
bitmapRecentCommitCount20000
bitmapRecentCommitSpan100
bitmapDistantCommitSpan5000
bitmapExcessiveBranchCount100
bitmapInactiveBranchAge90
singlePackfalse
before: sizeOfPackedObjects=614327754, sizeOfLooseObjects=0, numberOfPackedObjects=2271, numberOfPackFiles=2, numberOfPackedRefs=1, numberOfLooseRefs=0, numberOfLooseObjects=0
after:  sizeOfPackedObjects=279, sizeOfLooseObjects=595472479, numberOfPackedObjects=3, numberOfPackFiles=1, numberOfPackedRefs=1, numberOfLooseRefs=0, numberOfLooseObjects=2268

Human friendly version:

beforeafter
sizeOfPackedObjects614327754279
sizeOfLooseObjects0595472479
numberOfPackedObjects22713
numberOfPackFiles21
numberOfPackedRefs11
numberOfLooseRefs00
numberOfLooseObjects02268

And indeed looking on the server:

$ cd /srv/gerrit/git/scoring/ores/articlequality.git
$ git count-objects -vH
count: 2268
size: 575.61 MiB
in-pack: 2271
packs: 2
size-pack: 585.93 MiB
prune-packable: 2268
garbage: 0
size-garbage: 0 bytes

There are no references?

$ git ls-remote .
887c405ec37c7968c5a4a08373f65f53ba8526c7	refs/meta/config

So it is not replicating (there are no references) and git gc happily removed a lot of objects, possibly due to a force push that caused lot of objects to be loose.

We have several repos which each contain large machine learning model files, and these repos are all configured to mirror *from* GitHub to Phabricator, to Gerrit. This was a historical arrangement to allow us to collaborate in GitHub but still deploy on WMF production. The mirroring has been acting badly lately, often failing to push LFS objects through, see T212818, but the other repos have at least been partially usable. Just a few days ago, I discovered that the "articlequality" repo is only mirroring to Phabricator, and at some point the Gerrit repo was blanked. We need help resolving this issue.

There are a couple reasons we deploy for Gerrit:

  • solely rely on our own infrastructure to deploy, eg when Github might be unavailable for some reasons (outage, rate limit, network issue whatever)
  • limit the attack surface area to Gerrit and WMCS authentication (in the case a privileged Github account is compromised which happened at least a couple times). Side effect: we have full forensic logs.

I would highly recommend to migrate to develop straight on Gerrit. That would simplifies the system by avoiding two replications. Given you are using git-lfs, at least the repository would be at a manageable size and would fit just fine in Gerrit.

@hashar Thanks for all the helpful diagnostics! I pushed commits straight from my local repo to gerrit, and created a new reference to master in gerrit. This seems to have worked.

Once deployment proceeds successfully, I'll close this task.

awight claimed this task.

I was able to work around by pushing directly to gerrit. If it happens again, I'll reuse this task or T212818.

We have several repos which each contain large machine learning model files, and these repos are all configured to mirror *from* GitHub to Phabricator, to Gerrit. This was a historical arrangement to allow us to collaborate in GitHub but still deploy on WMF production. The mirroring has been acting badly lately, often failing to push LFS objects through, see T212818, but the other repos have at least been partially usable. Just a few days ago, I discovered that the "articlequality" repo is only mirroring to Phabricator, and at some point the Gerrit repo was blanked. We need help resolving this issue.

There are a couple reasons we deploy for Gerrit:

  • solely rely on our own infrastructure to deploy, eg when Github might be unavailable for some reasons (outage, rate limit, network issue whatever)
  • limit the attack surface area to Gerrit and WMCS authentication (in the case a privileged Github account is compromised which happened at least a couple times). Side effect: we have full forensic logs.

I would highly recommend to migrate to develop straight on Gerrit. That would simplifies the system by avoiding two replications. Given you are using git-lfs, at least the repository would be at a manageable size and would fit just fine in Gerrit.

I have made that T213246 , which would be to disable the automatic sync from github to gerrit. But lets follow up on that task.