Page MenuHomePhabricator

git.wikimedia.org replication from gerrit stopped or lags
Closed, ResolvedPublic

Event Timeline

JanZerebecki raised the priority of this task from to Needs Triage.
JanZerebecki updated the task description. (Show Details)
JanZerebecki added projects: acl*sre-team, Gerrit.
JanZerebecki subscribed.
Paladox closed this task as a duplicate of T100110: gerrit not mergin into gitblit.
Paladox set Security to None.
Paladox added subscribers: Krinkle, QChris, Legoktm, Paladox.

Hi I had this patch https://gerrit.wikimedia.org/r/#/c/212813/ review and +2 for code reviewed and it said it was successfully merged but looking on gitblit it still doesent say anything ecept from last update was 2 days ago and it was a localisation update. http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FLiquidThreads

Paladox triaged this task as Unbreak Now! priority.May 23 2015, 9:13 PM

Since gerrit has stoped replicating into gitblit status should be unbreak now.

Beginning at 2015-05-21 15:47 Gerrit's replication logs are full of errors like

[2015-05-21 15:47:41,273] ERROR com.googlesource.gerrit.plugins.replication.ReplicationQueue : Cannot replicate to gerritslave@gallium.wikimedia.org:/srv/ssd/gerrit/mediawiki/extensions/ConfirmEdit.git
org.eclipse.jgit.errors.TransportException: gerritslave@gallium.wikimedia.org:/srv/ssd/gerrit/mediawiki/extensions/ConfirmEdit.git: Algorithm negotiation fail

Both the error message and the time it started hints at the latest
SSHD hardening getting in the way here.

The commits

look relevant, as they got merged around that time, and they change
sshd config.

Judging from gerrit's logs, the affected replication targets are:

  • antimony.wikimedia.org
  • gallium.wikimedia.org
  • lanthanum.eqiad.wmnet

Possible way's forward would be to

  • teach Java/Jsch/Gerrit's replication plugin to connect using the settings that we use, or
  • to backpaddle on the ssh hardening for the three affected hosts.

Since we WMF wants to get rid of gerrit, I am not too fond of fiddling
with Java/Jsch/Gerrit's replication plugin.

Could we instead leverage the fresh $disable_nist_kex, and
$explicit_macs on the three affected hosts?

Change 213216 had a related patch set uploaded (by QChris):
Turn off sshd MAC and KEX hardening for gerrit replication targets

https://gerrit.wikimedia.org/r/213216

Also branch on http://git.wikimedia.org/summary/operations%2Fpuppet for master needs to be changed to production or production needs to be changed to master please.

Also branch on http://git.wikimedia.org/summary/operations%2Fpuppet for master needs to be changed to production or production needs to be changed to master please.

That's T52152: 404 for Gitblit's "Tree > HEAD" in operations/puppet repository

Change 213216 merged by Alexandros Kosiaris:
Turn off sshd MAC and KEX hardening for gerrit replication targets

https://gerrit.wikimedia.org/r/213216

How long would it take for gerrit and git to pickup the change so that they both can start working again.

akosiaris claimed this task.
akosiaris subscribed.

The general rule is 20 minutes right now, though some changes might take up to 40 minutes or even 60 minutes. I see on http://git.wikimedia.org that gerrit replication works fine again, so resolving this

Keep in mind also that Gerrit's not going to start replicating a repo again until the objects in it start changing (new commits, etc). So repos that lag may take a little longer to decide to catch up.

A gerrit admin could kick off a replication job for all repos but I'm on vacation now so it won't be me :p

Keep in mind also that Gerrit's not going to start replicating a repo again until the objects in it start changing [...]

If the replication plugin got restarted or so, that'd be true. But that's not the case here.

The replication plugin keeps failed pushes in its queue and retries them automatically. Gerrit had already caught up :-)

But meh. Starting forced replication nonetheless :-P

Starting forced replication nonetheless

That back-fired due to github overloading repo names (see T100409).
Fixed by forcing replication of only the affected repos again from their gerrit names.
Added a warning against --all to wikitech: https://wikitech.wikimedia.org/w/index.php?title=Gerrit&diff=160816&oldid=153784