Page MenuHomePhabricator

Move clients off of gerrit-replica.wikimedia.org back to gerrit.wikimedia.org
Open, Stalled, MediumPublic

Description

This ticket is to replace https://gerrit-replica.wikimedia.org with https://gerrit.wikimedia.org in various repos that might or might not use it instead of the main host.

reasoning per list thread here:

https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/LHYBXQCQ5VZPJZEJS2L5MOKJFIQTH3T7/

and commit message here:

https://gerrit.wikimedia.org/r/c/labs/codesearch/+/919925

also, from today's SRE collab team meeting it was TODO to create this very ticket, even before the mail thread brought it up again the same day

Event Timeline

Change 919925 had a related patch set uploaded (by Dzahn; author: Dzahn):

[labs/codesearch@master] switch to main gerrit server instead of using the replica

https://gerrit.wikimedia.org/r/919925

Change 919925 merged by jenkins-bot:

[labs/codesearch@master] switch to main gerrit server instead of using the replica

https://gerrit.wikimedia.org/r/919925

Bawolff subscribed.

Extdist is also using replica currently.

I deployed the codesearch change. So it shouldn't be affected by the maint window. Maybe eventually we should have three hosts, one for replicas (codesearch and others put a lot of pressure on gerrit) and one for standby. Just thinking.

Change 920243 had a related patch set uploaded (by Hashar; author: Hashar):

[labs/codesearch@master] Revert "switch to main gerrit server instead of using the replica"

https://gerrit.wikimedia.org/r/920243

I am replying here to the wikitech-l message, I think it will make it easier to find it later in the future. It came up in preparation of a network operation which affected access to the Gerrit replica and scheduled for 30 minutes. My reply are inline

This means codesearch will be affected (and won't get updated) and possibly even will be down during that time.

More or less, Codesearch indexes repositories via a service named Hound which we have configured to poll every 90 minutes. If it can't reaches the repositories, I guess it will bails out indexing for that time tick and try again on the next poll tick 90 minutes later. Once the index is populated, the frontend serves the searches from the index and do not need any access to the repositories. At least that is my understanding.

We, at least in my team, would like to switch codesearch (and other clients) back to just use gerrit.wikimedia.org and not the replica directly.

Just today we agreed to make a new ticket for specifically this, because soon we have to reimage the replica to bullseye and add more downtime.

I don't think it causes much troubles. The major services we offloaded to the replica are system that routinely hammer every single repositories (and sometime multiple branches) but can otherwise work fine without live updates. From the top of my mind:

  • Extension distributor
  • Phabricator polling for mirroring to Diffusion
  • Code Search

That has offloaded 60% of the git traffic (roughly 4000 requests per minute) off of the primary Gerrit (which also serves the REST API requests and anything that requires live). When the replica get reimaged the CodeSearch (and others) will be delayed by a few hours, and I don't think it is any problematic. The primary might be to handle it, though I don't know how it will behave given it will have to serve traffic from all repositories which discard caches and might well affect the latency of more important use cases.

The reason we did the split in the past was to reduce load on the main Gerrit server but meanwhile first the issue has been fixed in newer Gerrit versions and then also just a few days ago we switched to brand new hardware.
So now if anything it should be beefier than before and even without

that it seemed already a thing of the past.

Possibly yes, then that still adds load for the application and I imagine it would add a bit of latency to requests made to the primary. It can potentially be measured by pointing gerrit-replica.wikimedia.org to the primary (which might just work, IIRC the ssh host key is shared between the primary and the replica) and then observes any change in the service quality (latency primarily which is affected by caches hit ratio, Java garbage collection and whatever race contention / locking mechanisms in the app).

And we pay for this with this issue that the replica becomes a second production system, with the need for downtimes. It complicates fail-over scenarios too and in a way means there is never a passive host when we do DC switch-over.

The replica is a hot spare, it continuously receives updates from the primary which lets one switch to it having all the data up-to-date thanks to the replication. I wanted to try switching over to it back in August when the server got changed and would have proposed to follow that route late April but I was in vacations in both cases. Regardless what I have found out is would have missing data:

  • LFS data are stored on disk and are not replicated, that never got implemented. Everyone uses either a NFS array or a S3 bucket shared between the hosts. We could tentatively move that to Swift, but most probably I will look at moving the data to Ceph when it is ready to accept new use cases (T308317)
  • The replica does not maintain the same set of caches (memory and Lucene) as the primary, it is notably missing the change and diff indexes. A switch to the replica thus implies starting with some caches being cold which might cause a bit of havoc.

I think we should invest in a high availability setup which is T217174 (I declined it as part of a routine cleanup of Phabricator tasks). What it entitles? I don't know to be fair, but that is how most are running Gerrit this days, that lets one have multiple primaries geographically distributed, and when one goes down (maintenance, upgrade whatever) traffic shift to the others. That should make operations easier and reduce maintenance occurred downtime which is always stressful.

Until we go that route, there are a few changes in the pipeline to make it easier to switch over the service (use a fixed UID to ease rysnc and move files around to have all of them under a single directory).

So yea, I suggest we change the config of codesearch now to use the main gerrit unless you have concerns about that.

I'd like to keep the bot git traffic on the replica and proposed a revert of the CodeSearch change: https://gerrit.wikimedia.org/r/920243

I am replying here to the wikitech-l message, I think it will make it easier to find it later in the future. It came up in preparation of a network operation which affected access to the Gerrit replica and scheduled for 30 minutes. My reply are inline

This means codesearch will be affected (and won't get updated) and possibly even will be down during that time.

More or less, Codesearch indexes repositories via a service named Hound which we have configured to poll every 90 minutes. If it can't reaches the repositories, I guess it will bails out indexing for that time tick and try again on the next poll tick 90 minutes later. Once the index is populated, the frontend serves the searches from the index and do not need any access to the repositories. At least that is my understanding.

It's more complicated than that (I'm one of the maintainers of codesearch). gerrit-replica being down for extended period of time can bring down codesearch fully. I distinctly remember one case of that.

As maintainer of codesearch: I don't care if you want to revert the patch after the maintenance or keep it as-is. Just point to the main gerrit during the maintenance since it's not bringing gerrit down even though it's making a lot of requests.

As an engineer: I think there is difference in opinion on what gerrit-replica is or should be. Either make an extra host that's a hot stand-by (and switch them around as needed). Or switch DNS of gerrit-replica to point to the main gerrit during maint on gerrit-replica. or make it official gerrit-replica is not stable and allow codesearch to use main gerrit in the times.

Change 920243 abandoned by Hashar:

[labs/codesearch@master] Revert "switch to main gerrit server instead of using the replica"

Reason:

We had a couple meetings with Daniel and others and clarified a few things. In short the server is beefer, codesearch is not that much traffic and we can revisit later. Overall it is probably good to keep offloading some traffic off the primary, but this revert does not change much.

https://gerrit.wikimedia.org/r/920243

This caused T337263: full disk on codesearch8 on the codesearch side (I honestly wouldn't have caught that ahead of time either, so not trying to blame!!).

More or less, Codesearch indexes repositories via a service named Hound which we have configured to poll every 90 minutes. If it can't reaches the repositories, I guess it will bails out indexing for that time tick and try again on the next poll tick 90 minutes later. Once the index is populated, the frontend serves the searches from the index and do not need any access to the repositories. At least that is my understanding.

Yep, this is what should happen and will happen in most cases - it'll just keep running, with an outdated index. There are some (IMO) edge cases that it could not start back up, but being realistic, codesearch is not a production service and if people can't live without it for a few hours or a day, that's an entirely different problem.

I guess we don't need to switch back to gerrit-replica? In any case, if we eventually do, we need to take care to remove all the old git repos first before cloning the new ones. Something like:

cd /srv
mv hound hound.bak
mkdir hound
chown codesearch:codesearch hound
systemctl start codesearch-write-config
# in a separate terminal
rm -rf hound.bak

I guess we don't need to switch back to gerrit-replica?

The agreement so far was "don't switch back until after the reimage of gerrit-replica (tomorrow, Thu May 25) and then we ultimately decide (if it stays on main gerrit forever or goes back). I feel a tendency to keep it on main gerrit but we should still make a formal decision afaict.

P.S. it also seems possible in the future, with regular 6 month DC-switchover, we would consider flipping what is gerrit and what is gerrit-replica between eqiad and codfw, which might or might not factor into this. The whole point of NOT pointing clients to gerrit-replica is that it makes the "passive" / hot spare a second production host. So if we don't have to use it I think we should not use it.

LSobanski triaged this task as Medium priority.Jun 20 2023, 4:48 PM

@LSobanski so.. summarizing the status here:

codesearch was switched to main gerrit in https://gerrit.wikimedia.org/r/c/labs/codesearch/+/919925

extdist though stayed on gerrit-replica, see last comment on abandoned patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/920761

Other than that, looking at codesearch for things that use gerrit-replica.wikimedia.org I can't see specific clients using it, at least not in what is covered by codesearch of course.

https://codesearch.wmcloud.org/search/?q=gitlab-replica.wikimedia.org&files=&excludeFiles=&repos=

It seems that we either have to say "extdist stays on replica forever" and then decline this ticket but also accept that it still does this one thing for production and both gerrit machines are therefore production... or we have to still switch extdist to main gerrit now and resolve this.

If switching extdist to main gerrit doesn't kill it that seems the better solution to me by far.

codesearch was switched to main gerrit in https://gerrit.wikimedia.org/r/c/labs/codesearch/+/919925

We should switch it back to gerrit-replica :]

I gave some overview of the replica usage T336710#8857046 , the missing LARGE user which can't be found in codesearch is Phabricator which is constantly pulling every single repositories. I'd like one day to make it a Gerrit replica and push updates instead, but that is moare work.

Dzahn changed the task status from Open to Stalled.Nov 9 2023, 9:52 PM

stalled since we have no consensus which direction to go

if both gerrit and gerrit-replica are needed at all times then we have to accept there exists no failover machine or create more machines