git.wikimedia.org replication from gerrit stopped or lags
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JanZerebecki
	May 22 2015, 2:36 AM

Description

Compare https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FWikibase.git with https://phabricator.wikimedia.org/diffusion/EWBA/ which means git.w.o is now 6 hours behind.

Details

	Subject	Repo	Branch	Lines +/-
	Turn off sshd MAC and KEX hardening for gerrit replication targets	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects

Mentioned In: T100509: Jenkins master / client ssh connection fails due to missing ssh algorithm
rOPUPc34bc58ee505: Turn off sshd MAC and KEX hardening for gerrit replication targets
T100022: Login link on Special:UserLogout should not have Special:UserLogout as returnto value
Mentioned Here: T100409: https://github.com/wikimedia/mediawiki/ release tags vanished
T52152: 404 for Gitblit's "Tree > HEAD" in operations/puppet repository
rOPUP598389d08a95: sshd: set Message Authentication Code ciphers
rOPUPf73786e3afb1: sshd: don't use NIST key exchange protocols

Event Timeline

JanZerebecki created this task.May 22 2015, 2:36 AM

JanZerebecki raised the priority of this task from to Needs Triage.

JanZerebecki updated the task description. (Show Details)

JanZerebecki added projects: acl*sre-team, Gerrit.

JanZerebecki subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 22 2015, 2:36 AM

Hi I had this patch https://gerrit.wikimedia.org/r/#/c/212813/ review and +2 for code reviewed and it said it was successfully merged but looking on gitblit it still doesent say anything ecept from last update was 2 days ago and it was a localisation update. http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FLiquidThreads

Since gerrit has stoped replicating into gitblit status should be unbreak now.

Paladox added a subscriber: Reedy.May 24 2015, 12:28 AM

Paladox added a subscriber: • demon.May 24 2015, 12:56 AM

Beginning at 2015-05-21 15:47 Gerrit's replication logs are full of errors like

[2015-05-21 15:47:41,273] ERROR com.googlesource.gerrit.plugins.replication.ReplicationQueue : Cannot replicate to gerritslave@gallium.wikimedia.org:/srv/ssd/gerrit/mediawiki/extensions/ConfirmEdit.git
org.eclipse.jgit.errors.TransportException: gerritslave@gallium.wikimedia.org:/srv/ssd/gerrit/mediawiki/extensions/ConfirmEdit.git: Algorithm negotiation fail

Both the error message and the time it started hints at the latest
SSHD hardening getting in the way here.

The commits

look relevant, as they got merged around that time, and they change
sshd config.

Judging from gerrit's logs, the affected replication targets are:

antimony.wikimedia.org
gallium.wikimedia.org
lanthanum.eqiad.wmnet

Possible way's forward would be to

teach Java/Jsch/Gerrit's replication plugin to connect using the settings that we use, or
to backpaddle on the ssh hardening for the three affected hosts.

Since we WMF wants to get rid of gerrit, I am not too fond of fiddling
with Java/Jsch/Gerrit's replication plugin.

Could we instead leverage the fresh $disable_nist_kex, and
$explicit_macs on the three affected hosts?

Change 213216 had a related patch set uploaded (by QChris):
Turn off sshd MAC and KEX hardening for gerrit replication targets

https://gerrit.wikimedia.org/r/213216

gerritbot added a project: Patch-For-Review.May 24 2015, 9:41 AM

Also branch on http://git.wikimedia.org/summary/operations%2Fpuppet for master needs to be changed to production or production needs to be changed to master please.

In T99990#1306479, @Paladox wrote:

Also branch on http://git.wikimedia.org/summary/operations%2Fpuppet for master needs to be changed to production or production needs to be changed to master please.

That's T52152: 404 for Gitblit's "Tree > HEAD" in operations/puppet repository

zhuyifei1999 subscribed.May 25 2015, 12:39 PM

zhuyifei1999 mentioned this in T100022: Login link on Special:UserLogout should not have Special:UserLogout as returnto value.May 25 2015, 12:42 PM

Change 213216 merged by Alexandros Kosiaris:
Turn off sshd MAC and KEX hardening for gerrit replication targets

https://gerrit.wikimedia.org/r/213216

Paladox mentioned this in rOPUPc34bc58ee505: Turn off sshd MAC and KEX hardening for gerrit replication targets.May 25 2015, 6:07 PM

How long would it take for gerrit and git to pickup the change so that they both can start working again.

The general rule is 20 minutes right now, though some changes might take up to 40 minutes or even 60 minutes. I see on http://git.wikimedia.org that gerrit replication works fine again, so resolving this

Keep in mind also that Gerrit's not going to start replicating a repo again until the objects in it start changing (new commits, etc). So repos that lag may take a little longer to decide to catch up.

A gerrit admin could kick off a replication job for all repos but I'm on vacation now so it won't be me :p

In T99990#1311842, @demon wrote:

Keep in mind also that Gerrit's not going to start replicating a repo again until the objects in it start changing [...]

If the replication plugin got restarted or so, that'd be true. But that's not the case here.

The replication plugin keeps failed pushes in its queue and retries them automatically. Gerrit had already caught up :-)

But meh. Starting forced replication nonetheless :-P

hashar mentioned this in T100509: Jenkins master / client ssh connection fails due to missing ssh algorithm.May 27 2015, 2:00 PM

In T99990#1312092, @QChris wrote:

Starting forced replication nonetheless

That back-fired due to github overloading repo names (see T100409).
Fixed by forcing replication of only the affected repos again from their gerrit names.
Added a warning against --all to wikitech: https://wikitech.wikimedia.org/w/index.php?title=Gerrit&diff=160816&oldid=153784

git.wikimedia.org replication from gerrit stopped or lagsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

git.wikimedia.org replication from gerrit stopped or lags
Closed, ResolvedPublic
Actions