Page MenuHomePhabricator

Create mirror of Gerrit repositories for consumption by various tools
Closed, ResolvedPublic

Description

When looking at Gerrit issues during spring 2019 (T221026) we noticed a lot of git operations emanating from various tools on WMCS.

They are mostly reads, and there is no firm indication that the batch requests are overloading Gerrit. But given the load they impose, it might be wise to shift the load to a mirror. They notably do not need to be 100% up to date with the master and could suffer the slight delay incurred by replication.

Relevant extracts from T221026:

@mmodell wrote:

From looking at http requests per minute in javamelody, over 1 year, I see that traffic has increased a lot:

https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=httpHitsRate (http hits per minutes):

Screenshot from 2019-04-16 19-22-21.png (517×1 px, 27 KB)

update yearly view on Sep 25th:

Gerrit-yearly-httpHitsRate.png (499×1 px, 19 KB)

@thcipriani pointed out the mean stays identical, but the max has grown in March 2019 from roughly 4k/minutes to 6k/minutes.

@hashar proposed: Would it make sense to set a readonly replica such as git.wikimedia.org to offload Gerrit? The bots/scripts running on WMCS could be easily made to point to that mirror. And listed:

Out of 623k https requests in April 17th access logs:

RequestsIPDNS PTR
84110172.16.1.221codesearch4.codesearch.eqiad.wmflabs.
699212620:0:861:102:10:64:16:8phab1001.eqiad.wmnet.
51736172.16.1.85extdist-02.extdist.eqiad.wmflabs.
51736172.16.1.84extdist-01.extdist.eqiad.wmflabs.
51676172.16.1.86extdist-03.extdist.eqiad.wmflabs.
16465172.16.5.187integration-slave-docker-1051
16116172.16.5.162integration-slave-docker-1048
14709172.16.5.181integration-slave-docker-1050
13660172.16.1.36integration-slave-docker-1041
13579172.16.0.26integration-slave-docker-1054
12990172.16.6.184integration-slave-docker-1043
12909172.16.3.86integration-slave-docker-1040
11672172.16.7.168integration-slave-docker-1034
10847172.16.5.190integration-slave-docker-1052
9705xxxxxsome public internet IP
8786172.16.3.87integration-slave-docker-1037

Probably codesearch ( https://codesearch.wmflabs.org/ ), Phabricator and extdist ( https://www.mediawiki.org/wiki/Extension:ExtensionDistributor ) could be moved to a use a mirror.

The CI slaves do hammer Gerrit :-/

Note that its for any HTTP request, not just git-upload-pack. But the result is similar when filtering for upload-pack.

Not taken in account, git fetch from the zuul-mergers which are done over ssh with the jenkins-bot user.

Event Timeline

Candidates to be switched to a new Git mirror would be:

  • Phabricator. Or maybe we can have Gerrit to replicate to the Phabricator host and let Phabricator consume from the local replica.
  • CI, which is hammering Gerrit on purpose but might potentially be made to use a replica (unsure)

The git repositories on /srv/gerrit takes 25GBytes. Some of them should NOT be replicated though such as at least All-User.git but probably some private repositories should not as well (we have a few). But I guess the Gerrit replication mecanism is already smart enough to not replicate private repositories.

I would guess we can use some Ganeti VM then pick up a DNS entry (twist: git.wikimedia.org is already used as a redirect to Diffusion). But maybe we can do some evil path routing in Varnish to redirect git requests to the instance hosting the mirrors.

The CI slaves do hammer Gerrit :-/

Couldn’t they use a mirror as well? If they fetch a specific ref (refs/changes/ef/abcdef/xy), and we don’t expect the target of that ref to ever change (the next patch set instead increments the xy counter), then they can try to fetch from a mirror and sleep-and-retry until the ref is found, no? (I’m assuming that the delay would usually not be more than a few seconds – if it’s more like a few minutes, then this would probably delay CI too much.)

(edit: I hadn’t seen @hashar’s comment above yet when I wrote this)

The delay would be just a few seconds for sure, the devil is that the CI jobs fetch branches (refs/heads/*) which thus change constantly. At the time the change get merged, there is a race condition between the change being replicated and some CI jobs running assuming the head got updated already.

@hashar: we already have a read-only replica of most repositories on phabricator. Is that not satisfactory? It's kept up to date, with a lag of just a few minutes in the worst case.

hashar triaged this task as Medium priority.Jun 24 2019, 1:46 PM

@hashar: we already have a read-only replica of most repositories on phabricator. Is that not satisfactory? It's kept up to date, with a lag of just a few minutes in the worst case.

My previous experience was that Phabricator replicas were extremely slow to update (on the scale of hours) for less frequently updated repositories because it was polling. Is that no longer the case?

(gah, sorry, fighting my own herald rule)

We can use gerrit slave feature for this (e.g a readonly). (gerrit2001 being the slave)

See https://gerrit.googlesource.com/homepage/+/md-pages/docs/Scaling.md

@hashar: we already have a read-only replica of most repositories on phabricator. Is that not satisfactory? It's kept up to date, with a lag of just a few minutes in the worst case.

My previous experience was that Phabricator replicas were extremely slow to update (on the scale of hours) for less frequently updated repositories because it was polling. Is that no longer the case?

From the table above, phab1001.eqiad.wmnet is showing as one of the top users. So I guess that is the polling of all repositories to update Diffusion / the readonly mirror. Potentially we could have Phabricator to use a git mirror instead of hitting the master Gerrit.


Of course, I do not have much metrics/info to explain the raise in https requests per minutes nor whether it is actually a problem (but it might).

hashar added a subscriber: MoritzMuehlenhoff.

OK, I set up a Gerrit mirror today. Clone URLs are https://ggmirror.wmflabs.org/git/<gerrit name>.git. And https://ggmirror.wmflabs.org/cgit/ as a web view/debugger. It will git fetch every 60 minutes.

codesearch is now pointing at the mirror, so all of that traffic should disappear (and the mirror should use less traffic. I hope). If the mirror can handle that amount of traffic well, I'll start switching over some more services over to it (extdist, libup).

Thanks for setting that up for your tools, @Legoktm !

I think we probably still want a mirror in production that can be used for other (production) things eg: Phabricator. And also to have a production grade host for this so it's not impacted by any unforeseen stability issues in WMCS.

I think we probably still want a mirror in production that can be used for other (production) things eg: Phabricator. And also to have a production grade host for this so it's not impacted by any unforeseen stability issues in WMCS.

Agreed 100%.

OK, I set up a Gerrit mirror today. Clone URLs are https://ggmirror.wmflabs.org/git/<gerrit name>.git. And https://ggmirror.wmflabs.org/cgit/ as a web view/debugger. It will git fetch every 60 minutes.

codesearch is now pointing at the mirror, so all of that traffic should disappear (and the mirror should use less traffic. I hope). If the mirror can handle that amount of traffic well, I'll start switching over some more services over to it (extdist, libup).

httpHitsRate-2019-08-03.png (499×1 px, 19 KB)

This made a huge difference in traffic volume! Thank for doing this @Legoktm :)

This might have broken Diffusion mirrors: T229756.

There is now a replica of gerrit in codfw that can be used to clone from:

example:

git clone https://gerrit-replica.wikimedia.org/r/operations/puppet

@hashar Does this resolve the ticket or is it part of it as well to switch a lot of tools to using it?

How updated is gerrit-replica? Is it immediatelly updated after gerrit (master)? Thanks.

I think possibly a few mins delay. (It runs on one thread).

replication should work again now. there was a syntax issue in the config that has been fixed.

@hashar Does this resolve the ticket or is it part of it as well to switch a lot of tools to using it?

This task asked to create a read-only mirror of Gerrit repositories and that part is indeed fulfilled now (via https://gerrit-replica.wikimedia.org/r/ ).

Left todo:

  • migrate the few tools that have been identified in this task: codesearch, extdist, wikifarm.pluggableauth.eqiad.wmflabs, phabricator etc.
  • Phase out the transient https://ggmirror.wmflabs.org/git/ which was/is pulling every repo every hour
  • Update doc, or at least advertise the new replica

And I guess we are set :]

@hashar phabricator has been migrated to use the replica :)

Codesearch and extdist also uses the replica (done by @Legoktm)

Apparently extdist is still reaching out to gerrit.wikimedia.org over https from at least:

  • extdist-04.extdist.eqiad.wmflabs.
  • extdist-05.extdist.eqiad.wmflabs.
  • extdist-01.extdist.eqiad.wmflabs.

@Legoktm can you revisit extdist configuration on labs and have it point to gerrit-replica.wikimedia.org instead?

And there is still wikifarm.pluggableauth.eqiad.wmflabs. but I have no idea what that one is :-\

(note to self: gotta verify whether those https hits are actually git requests, they might be regular API traffic)

@hashar phabricator has been migrated to use the replica :)

For projects on phabricator that are "observing" gerrit, I moved all of them. Some of them do mirroring, I did not update those. For posterity sake, the script I used to update phab URIs is P8857

Yearly view has dramatically improved since mid July:

Gerrit-yearly-httpHitsRate.png (499×1 px, 19 KB)

Requests left

HitsIPHost
14421172.16.2.41wikifarm.pluggableauth.eqiad.wmflabs.
3686172.16.1.84extdist-01.extdist.eqiad.wmflabs.
3679172.16.0.54extdist-04.extdist.eqiad.wmflabs.
3677172.16.0.56extdist-05.extdist.eqiad.wmflabs.

Wikifarm

Looking at request made by wikifarm (which looks like it is http://wikifarm.wmflabs.org/ )

grep 172.16.2.41 gerrit.wikimedia.org.https.access.log|cut -f7|sed -e 's/[0-9]\+/<number>/g'|sort|uniq -c|sort -n

There is surely room for improvement to stop refreshing the accounts so often. For today log I found an account refreshed a thousand times, a few others a few hundred times or so.
The requests to /changes/ are spread among a wide variety of tasks.

extdist

The https queries are for git-upload-pack (which sends packed objects to the client), each repositories is queried 4 times by each of the 3 instances (total: 12), that is done once per hour. Example for mediawiki/skins/WPtouch:

2019-09-25T06:30:28	172.16.1.84	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:28	172.16.1.84	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:29	172.16.1.84	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:29	172.16.1.84	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:47	172.16.0.54	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:48	172.16.0.54	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:48	172.16.0.54	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:48	172.16.0.54	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:55	172.16.0.56	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:56	172.16.0.56	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:56	172.16.0.56	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack
2019-09-25T06:30:56	172.16.0.56	http://gerrit.wikimedia.org/r/mediawiki/skins/WPtouch/info/refs?service=git-upload-pack

I guess because each iterate through each of the 4 snapshot-able branches:

wmf-config/CommonSettings.php
// Available snapshots
$wgExtDistSnapshotRefs = [
    'master',
    'REL1_33',
    'REL1_32',
    'REL1_31',
];

That can probably be optimized to a single query using something like:

git fetch origin \
 +refs/heads/master:refs/origin/heads/master \
 +refs/heads/REL1_31:refs/origin/heads/REL1_31 \
 +refs/heads/REL1_32:refs/origin/heads/REL1_32 \
 +refs/heads/REL1_33:refs/origin/heads/REL1_33

Regardless it should be moved to use gerrit-replica.wikimedia.org.

Wikifarm, I am letting it through since that query states on the API, though surely it could be optimized.

For extdist I have filled the follow up action T233781 , apparently the config has not been taken in account.

Anyway, not much left to do beside that ;]

For wikifarm.pluggableauth.eqiad.wmflabs I have filed T268759