Compare https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FWikibase.git with https://phabricator.wikimedia.org/diffusion/EWBA/ which means git.w.o is now 6 hours behind.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Turn off sshd MAC and KEX hardening for gerrit replication targets | operations/puppet | production | +6 -0 |
Related Objects
- Mentioned In
- T100509: Jenkins master / client ssh connection fails due to missing ssh algorithm
rOPUPc34bc58ee505: Turn off sshd MAC and KEX hardening for gerrit replication targets
T100022: Login link on Special:UserLogout should not have Special:UserLogout as returnto value - Mentioned Here
- T100409: https://github.com/wikimedia/mediawiki/ release tags vanished
T52152: 404 for Gitblit's "Tree > HEAD" in operations/puppet repository
rOPUP598389d08a95: sshd: set Message Authentication Code ciphers
rOPUPf73786e3afb1: sshd: don't use NIST key exchange protocols
Event Timeline
Hi I had this patch https://gerrit.wikimedia.org/r/#/c/212813/ review and +2 for code reviewed and it said it was successfully merged but looking on gitblit it still doesent say anything ecept from last update was 2 days ago and it was a localisation update. http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FLiquidThreads
Beginning at 2015-05-21 15:47 Gerrit's replication logs are full of errors like
[2015-05-21 15:47:41,273] ERROR com.googlesource.gerrit.plugins.replication.ReplicationQueue : Cannot replicate to gerritslave@gallium.wikimedia.org:/srv/ssd/gerrit/mediawiki/extensions/ConfirmEdit.git org.eclipse.jgit.errors.TransportException: gerritslave@gallium.wikimedia.org:/srv/ssd/gerrit/mediawiki/extensions/ConfirmEdit.git: Algorithm negotiation fail
Both the error message and the time it started hints at the latest
SSHD hardening getting in the way here.
The commits
look relevant, as they got merged around that time, and they change
sshd config.
Judging from gerrit's logs, the affected replication targets are:
- antimony.wikimedia.org
- gallium.wikimedia.org
- lanthanum.eqiad.wmnet
Possible way's forward would be to
- teach Java/Jsch/Gerrit's replication plugin to connect using the settings that we use, or
- to backpaddle on the ssh hardening for the three affected hosts.
Since we WMF wants to get rid of gerrit, I am not too fond of fiddling
with Java/Jsch/Gerrit's replication plugin.
Could we instead leverage the fresh $disable_nist_kex, and
$explicit_macs on the three affected hosts?
Change 213216 had a related patch set uploaded (by QChris):
Turn off sshd MAC and KEX hardening for gerrit replication targets
Also branch on http://git.wikimedia.org/summary/operations%2Fpuppet for master needs to be changed to production or production needs to be changed to master please.
Change 213216 merged by Alexandros Kosiaris:
Turn off sshd MAC and KEX hardening for gerrit replication targets
How long would it take for gerrit and git to pickup the change so that they both can start working again.
The general rule is 20 minutes right now, though some changes might take up to 40 minutes or even 60 minutes. I see on http://git.wikimedia.org that gerrit replication works fine again, so resolving this
Keep in mind also that Gerrit's not going to start replicating a repo again until the objects in it start changing (new commits, etc). So repos that lag may take a little longer to decide to catch up.
A gerrit admin could kick off a replication job for all repos but I'm on vacation now so it won't be me :p
If the replication plugin got restarted or so, that'd be true. But that's not the case here.
The replication plugin keeps failed pushes in its queue and retries them automatically. Gerrit had already caught up :-)
But meh. Starting forced replication nonetheless :-P
That back-fired due to github overloading repo names (see T100409).
Fixed by forcing replication of only the affected repos again from their gerrit names.
Added a warning against --all to wikitech: https://wikitech.wikimedia.org/w/index.php?title=Gerrit&diff=160816&oldid=153784