Page MenuHomePhabricator

git replication to antimony/gallium/github broken
Closed, ResolvedPublic

Description

https://gerrit.wikimedia.org/r/#/c/90739/ was merged, but isn't showing up in https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FMassMessage.git nor https://github.com/wikimedia/mediawiki-extensions-MassMessage/commits/master

Also, https://git.wikimedia.org/ says "there has been no activity today" (false), and the active repositories sidebar is empty


Version: wmf-deployment
Severity: blocker

Details

Reference
bz55948

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 2:20 AM
bzimport added a project: Gerrit.
bzimport set Reference to bz55948.

Hmm, no problems replicating to lanthanum, just everything else :\

We're getting rejected host key errors in the logs. All the broken sites have valid fingerprints in known_hosts, and ssh'ing manually to the boxes works fine.

A side effect in Jenkins is that we use the locate replication for extensions jobs. That is used to installed mediawiki/core@master as well as potential extensions dependencies. Can lead to some crazy build failures.

On gallium auth log, I see rejected connection from ytterbium.wikimedia.org [208.80.154.80] since Oct 19 20:55 UTC

The last one working:

Oct 19 20:46:27
Set /proc/self/oom_score_adj to 0
Connection from 208.80.154.80 port 44711
Found matching RSA key: /
Postponed publickey for gerritslave from 208.80.154.80 port 44711 ssh2 [preauth]
Found matching RSA key:
/
Accepted publickey for gerritslave from 208.80.154.80 port 44711 ssh2
pam_unix(sshd:session): session opened for user gerritslave by (uid=0)
User child is on pid 30532
pam_unix(sshd:session): session closed for user gerritslave

The first one failing:

Oct 19 20:56:30
Connection from 208.80.154.80 port 45384
Received disconnect from 208.80.154.80: 3: com.jcraft.jsch.JSchException: reject HostKey: gallium.wikimedia.org [preauth]

Rest of the auth log is filled with such errors.

October 19th:

20:54 ^d: gerrit: installed 2.7-rc2-507-g1e7090b, service back up

Seems the upgrade did not went well and broke something. Maybe replication is run by a different username that does not has gallium.wikimedia.org added to known_hosts.

The same issue appear on lanthanum.eqiad.wmnet and might be happening on antimony.wikimedia.org as well.

(In reply to comment #4)

October 19th:

20:54 ^d: gerrit: installed 2.7-rc2-507-g1e7090b, service back up

Seems the upgrade did not went well and broke something. Maybe replication
is
run by a different username that does not has gallium.wikimedia.org added to
known_hosts.

Upgrade didn't touch replication, it only added a minor change to the output format of gerrit query.

gerrit has always read /var/lib/gerrit2/.ssh/known_hosts, which hasn't changed since the move to ytterbium.

(In reply to comment #5)

The same issue appear on lanthanum.eqiad.wmnet and might be happening on
antimony.wikimedia.org as well.

lanthanum is replicating fine, it's antimony/gallium/github that are funky like I mentioned above.

This turned out to be an installation issue. For some reason gerrit user's homedir was at /home/gerrit2 instead of /var/lib/gerrit2. For now i just copied the files and restarted gerrit2, but I will fix it cleanly, moving the homedir in /var/lib/gerrit2 and deleting /home/gerrit2

Bah, this is my fault. I'll clean it up.