Page MenuHomePhabricator

Make CI point to gerrit-replica instead of the main gerrit
Closed, DeclinedPublic


Gerrit is constantly being taken over and I assume one big reason is the CI downloading lots of repos in every run. Pointing out to would reduce this load. I don't know with what credentials it's hitting gerrit but if it's not anonymous, this would reduce the chance of gerrit being taken over if CI gets compromsied

Event Timeline

Moving CI to Gerrit-Replica was previously discussed in T226240: Create mirror of Gerrit repositories for consumption by various tools, The current task about Gerrit being overloaded (T277127: Gerrit Apache out of workers) doesn't appear to be mentioning CI as a likely cause at this stage.

Thanks for the ticket, it gives a good context. I understand we might need ref to be immediate but actually that actually misses the point. The cause of hammering gerrit by CI is not loading the patch, it's loading the dependencies (each one of the extensions and skins) and those can tolerate a couple minutes of delay and they happen at least ten times more often than loading the patch. It's a huge win with little work I assume to just load dependencies from gerrit-replica. I add people from the original task to this.

Zuul (the CI workflow) does have support to wait for replication of git repositories to a replica though I don't know the details of all the config changes that would be needed to support that.

Most CI fetches (be it the API or git) happen on a small sub set of repositories (eg: mediawiki/core or operations/puppet) and Gerrit has so much caching that most requests end up hitting in memory caches. It is definitely not an issue.

The primary issue we encounter is when low latency bots crawl without rate limit, specially when they hit slow code path such as the gitblit repository browser or crawl every single repositories there causing some cache contention (busy repositories might get expelled from the cache cause some large barely used repo are inserted in the cache.

For the outage we had over the week-end and a couple weeks ago, they share the same root cause. Solutions to be tracked at T277127