Page MenuHomePhabricator

git replication to antimony/gallium/github broken
Closed, ResolvedPublic

Description was merged, but isn't showing up in nor

Also, says "there has been no activity today" (false), and the active repositories sidebar is empty

Version: wmf-deployment
Severity: blocker



Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 2:20 AM
bzimport added a project: Gerrit.
bzimport set Reference to bz55948.

Hmm, no problems replicating to lanthanum, just everything else :\

We're getting rejected host key errors in the logs. All the broken sites have valid fingerprints in known_hosts, and ssh'ing manually to the boxes works fine.

A side effect in Jenkins is that we use the locate replication for extensions jobs. That is used to installed mediawiki/core@master as well as potential extensions dependencies. Can lead to some crazy build failures.

On gallium auth log, I see rejected connection from [] since Oct 19 20:55 UTC

The last one working:

Oct 19 20:46:27
Set /proc/self/oom_score_adj to 0
Connection from port 44711
Found matching RSA key: /
Postponed publickey for gerritslave from port 44711 ssh2 [preauth]
Found matching RSA key:
Accepted publickey for gerritslave from port 44711 ssh2
pam_unix(sshd:session): session opened for user gerritslave by (uid=0)
User child is on pid 30532
pam_unix(sshd:session): session closed for user gerritslave

The first one failing:

Oct 19 20:56:30
Connection from port 45384
Received disconnect from 3: com.jcraft.jsch.JSchException: reject HostKey: [preauth]

Rest of the auth log is filled with such errors.

October 19th:

20:54 ^d: gerrit: installed 2.7-rc2-507-g1e7090b, service back up

Seems the upgrade did not went well and broke something. Maybe replication is run by a different username that does not has added to known_hosts.

The same issue appear on lanthanum.eqiad.wmnet and might be happening on as well.

(In reply to comment #4)

October 19th:

20:54 ^d: gerrit: installed 2.7-rc2-507-g1e7090b, service back up

Seems the upgrade did not went well and broke something. Maybe replication
run by a different username that does not has added to

Upgrade didn't touch replication, it only added a minor change to the output format of gerrit query.

gerrit has always read /var/lib/gerrit2/.ssh/known_hosts, which hasn't changed since the move to ytterbium.

(In reply to comment #5)

The same issue appear on lanthanum.eqiad.wmnet and might be happening on as well.

lanthanum is replicating fine, it's antimony/gallium/github that are funky like I mentioned above.

This turned out to be an installation issue. For some reason gerrit user's homedir was at /home/gerrit2 instead of /var/lib/gerrit2. For now i just copied the files and restarted gerrit2, but I will fix it cleanly, moving the homedir in /var/lib/gerrit2 and deleting /home/gerrit2

Bah, this is my fault. I'll clean it up.